Webly Supervised Image-Text Embedding with Noisy Tag Refinement

Niluthpol Mithun, Ravdeep Pasricha, Evangelos Papalexakis, Amit Roy-Chowdhury

Responsive image

Auto-TLDR; Robust Joint Embedding for Image-Text Retrieval Using Web Images

Slides

In this paper, we address the problem of utilizing web images in training robust joint embedding models for the image-text retrieval task. Prior webly supervised approaches directly leverage weakly annotated web images in the joint embedding learning framework. The objective of these approaches would suffer significantly when the ratio of noisy and missing tags associated with the web images is very high. In this regard, we propose a CP decomposition based tensor completion framework to refine the tags of web images by modeling observed ternary inter-relations between the sets of labeled images, tags, and web images as a tensor. To effectively deal with the high ratio of missing entries likely in our case, we incorporate intra-modal correlation as side information in the proposed framework. Our tag refinement approach combined with existing web supervised image-text embedding approaches provide a more principled way for learning the joint embedding models in the presence of significant noise from web data and limited clean labeled data. Experiments on benchmark datasets demonstrate that the proposed approach helps to achieve a significant performance gain in image-text retrieval.

Similar papers

JECL: Joint Embedding and Cluster Learning for Image-Text Pairs

Sean Yang, Kuan-Hao Huang, Bill Howe

Responsive image

Auto-TLDR; JECL: Clustering Image-Caption Pairs with Parallel Encoders and Regularized Clusters

Poster Similar

We propose JECL, a method for clustering image-caption pairs by training parallel encoders with regularized clustering and alignment objectives, simultaneously learning both representations and cluster assignments. These image-caption pairs arise frequently in high-value applications where structured training data is expensive to produce, but free-text descriptions are common. JECL trains by minimizing the Kullback-Leibler divergence between the distribution of the images and text to that of a combined joint target distribution and optimizing the Jensen-Shannon divergence between the soft cluster assignments of the images and text. Regularizers are also applied to JECL to prevent trivial solutions. Experiments show that JECL outperforms both single-view and multi-view methods on large benchmark image-caption datasets, and is remarkably robust to missing captions and varying data sizes.

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

Tobia Tesan, Pasquale Coscia, Lamberto Ballan

Responsive image

Auto-TLDR; Context-Based Image Annotation with Multiple Semantic Embeddings and Recurrent Neural Networks

Slides Poster Similar

Images represent a commonly used form of visual communication among people. Nevertheless, image classification may be a challenging task when dealing with unclear or non-common images needing more context to be correctly annotated. Metadata accompanying images on social-media represent an ideal source of additional information for retrieving proper neighborhoods easing image annotation task. To this end, we blend visual features extracted from neighbors and their metadata to jointly leverage context and visual cues. Our models use multiple semantic embeddings to achieve the dual objective of being robust to vocabulary changes between train and test sets and decoupling the architecture from the low-level metadata representation. Convolutional and recurrent neural networks (CNNs-RNNs) are jointly adopted to infer similarity among neighbors and query images. We perform comprehensive experiments on the NUS-WIDE dataset showing that our models outperform state-of-the-art architectures based on images and metadata, and decrease both sensory and semantic gaps to better annotate images.

Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization

Li Ren, Kai Li, Liqiang Wang, Kien Hua

Responsive image

Auto-TLDR; Adversarial Discriminative Domain Regularization for Efficient Cross-Modal Matching

Slides Poster Similar

Matching information across image and text modalities is a fundamental challenge for many applications that involve both vision and natural language processing. The objective is to find efficient similarity metrics to compare the similarity between visual and textual information. Existing approaches mainly match the local visual objects and the sentence words in a shared space with attention mechanisms. The matching performance is still limited because the similarity computation is based on simple comparisons of the matching features, ignoring the characteristics of their distribution in the data. In this paper, we address this limitation with an efficient learning objective that considers the discriminative feature distributions between the visual objects and sentence words. Specifically, we propose a novel Adversarial Discriminative Domain Regularization (ADDR) learning framework, beyond the paradigm metric learning objective, to construct a set of discriminative data domains within each image-text pairs. Our approach can generally improve the learning efficiency and the performance of existing metrics learning frameworks by regulating the distribution of the hidden space between the matching pairs. The experimental results show that this new approach significantly improves the overall performance of several popular cross-modal matching techniques (SCAN, VSRN, BFAN) on the MS-COCO and Flickr30K benchmarks.

Embedding Shared Low-Rank and Feature Correlation for Multi-View Data Analysis

Zhan Wang, Lizhi Wang, Hua Huang

Responsive image

Auto-TLDR; embedding shared low-rank and feature correlation for multi-view data analysis

Slides Poster Similar

The diversity of multimedia data in the real-world usually forms multi-view features. How to explore the structure information and correlations among multi-view features is still an open problem. In this paper, we propose a novel multi-view subspace learning method, named embedding shared low-rank and feature correlation (ESLRFC), for multi-view data analysis. First, in the embedding subspace, we propose a robust low-rank model on each feature set and enforce a shared low-rank constraint to characterize the common structure information of multiple feature data. Second, we develop an enhanced correlation analysis in the embedding subspace for simultaneously removing the redundancy of each feature set and exploring the correlations of multiple feature data. Finally, we incorporate the low-rank model and the correlation analysis into a unified framework. The shared low-rank constraint not only depicts the data distribution consistency among multiple feature data, but also assists robust subspace learning. Experimental results on recognition tasks demonstrate the superior performance and noise robustness of the proposed method.

Transformer Reasoning Network for Image-Text Matching and Retrieval

Nicola Messina, Fabrizio Falchi, Andrea Esuli, Giuseppe Amato

Responsive image

Auto-TLDR; A Transformer Encoder Reasoning Network for Image-Text Matching in Large-Scale Information Retrieval

Slides Poster Similar

Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolution of deep-learning-based image and text processing systems, multi-modal matching remains a challenging problem. In this work, we consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. State-of-the-art results in image-text matching are achieved by inter-playing image and text features from the two different processing pipelines, usually using mutual attention mechanisms. However, this invalidates any chance to extract separate visual and textual features needed for later indexing steps in large-scale retrieval systems. In this regard, we introduce the Transformer Encoder Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive architectures, the Transformer Encoder (TE). This architecture is able to separately reason on the two different modalities and to enforce a final common abstract concept space by sharing the weights of the deeper transformer layers. Thanks to this design, the implemented network is able to produce compact and very rich visual and textual features available for the successive indexing step. Experiments are conducted on the MS-COCO dataset, and we evaluate the results using a discounted cumulative gain metric with relevance computed exploiting caption similarities, in order to assess possibly non-exact but relevant search results. We demonstrate that on this metric we are able to achieve state-of-the-art results in the image retrieval task. Our code is freely available at https://github.com/mesnico/TERN.

VSR++: Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching

Hui Yuan, Yan Huang, Dongbo Zhang, Zerui Chen, Wenlong Cheng, Liang Wang

Responsive image

Auto-TLDR; Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching

Slides Poster Similar

Image-text matching has made great progresses recently, but there still remains challenges in fine-grained matching. To deal with this problem, we propose an Improved Visual Semantic Reasoning model (VSR++), which jointly models 1) global alignment between images and texts and 2) local correspondence between regions and words in a unified framework. To exploit their complementary advantages, we also develop a suitable learning strategy to balance their relative importance. As a result, our model can distinguish image regions and text words in a fine-grained level, and thus achieves the current stateof-the-art performance on two benchmark datasets.

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Responsive image

Auto-TLDR; Fully-Attentive Reduction for Vision and Language

Slides Poster Similar

The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements - like regions and words - proper reduction functions are needed to transform a set of encoded elements into a single response, like a classification or similarity score. In this paper, we propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction, which can be used for both classification and ranking. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices, on both COCO and VQA 2.0 datasets. Experimentally, we demonstrate that our approach leads to a performance increase on both tasks. Further, we conduct ablation studies to validate the role of each component of the approach.

Fast Discrete Cross-Modal Hashing Based on Label Relaxation and Matrix Factorization

Donglin Zhang, Xiaojun Wu, Zhen Liu, Jun Yu, Josef Kittler

Responsive image

Auto-TLDR; LRMF: Label Relaxation and Discrete Matrix Factorization for Cross-Modal Retrieval

Poster Similar

In recent years, cross-media retrieval has drawn considerable attention due to the exponential growth of multimedia data. Many hashing approaches have been proposed for the cross-media search task. However, there are still open problems that warrant investigation. For example, most existing supervised hashing approaches employ a binary label matrix, which achieves small margins between wrong labels (0) and true labels (1). This may affect the retrieval performance by generating many false negatives and false positives. In addition, some methods adopt a relaxation scheme to solve the binary constraints, which may cause large quantization errors. There are also some discrete hashing methods that have been presented, but most of them are time-consuming. To conquer these problems, we present a label relaxation and discrete matrix factorization method (LRMF) for cross-modal retrieval. It offers a number of innovations. First of all, the proposed approach employs a novel label relaxation scheme to control the margins adaptively, which has the benefit of reducing the quantization error. Second, by virtue of the proposed discrete matrix factorization method designed to learn the binary codes, large quantization errors caused by relaxation can be avoided. The experimental results obtained on two widely-used databases demonstrate that LRMF outperforms state-of-the-art cross-media methods.

Cross-Media Hash Retrieval Using Multi-head Attention Network

Zhixin Li, Feng Ling, Chuansheng Xu, Canlong Zhang, Huifang Ma

Responsive image

Auto-TLDR; Unsupervised Cross-Media Hash Retrieval Using Multi-Head Attention Network

Slides Poster Similar

The cross-media hash retrieval method is to encode multimedia data into a common binary hash space, which can effectively measure the correlation between samples from different modalities. In order to further improve the retrieval accuracy, this paper proposes an unsupervised cross-media hash retrieval method based on multi-head attention network. First of all, we use a multi-head attention network to make better matching images and texts, which contains rich semantic information. At the same time, an auxiliary similarity matrix is constructed to integrate the original neighborhood information from different modalities. Therefore, this method can capture the potential correlations between different modalities and within the same modality, so as to make up for the differences between different modalities and within the same modality. Secondly, the method is unsupervised and does not require additional semantic labels, so it has the potential to achieve large-scale cross-media retrieval. In addition, batch normalization and replacement hash code generation functions are adopted to optimize the model, and two loss functions are designed, which make the performance of this method exceed many supervised deep cross-media hash methods. Experiments on three datasets show that the average performance of this method is about 5 to 6 percentage points higher than the state-of-the-art unsupervised method, which proves the effectiveness and superiority of this method.

Discrete Semantic Matrix Factorization Hashing for Cross-Modal Retrieval

Jianyang Qin, Lunke Fei, Shaohua Teng, Wei Zhang, Genping Zhao, Haoliang Yuan

Responsive image

Auto-TLDR; Discrete Semantic Matrix Factorization Hashing for Cross-Modal Retrieval

Slides Poster Similar

Hashing has been widely studied for cross-modal retrieval due to its promising efficiency and effectiveness in massive data analysis. However, most existing supervised hashing has the limitations of inefficiency for very large-scale search and intractable discrete constraint for hash codes learning. In this paper, we propose a new supervised hashing method, namely, Discrete Semantic Matrix Factorization Hashing (DSMFH), for cross-modal retrieval. First, we conduct the matrix factorization via directly utilizing the available label information to obtain a latent representation, so that both the inter-modality and intra-modality similarities are well preserved. Then, we simultaneously learn the discriminative hash codes and corresponding hash functions by deriving the matrix factorization into a discrete optimization. Finally, we adopt an alternatively iterative procedure to efficiently optimize the matrix factorization and discrete learning. Extensive experimental results on three widely used image-tag databases demonstrate the superiority of the DSMFH over state-of-the-art cross-modal hashing methods.

Tensorized Feature Spaces for Feature Explosion

Ravdeep Pasricha, Pravallika Devineni, Evangelos Papalexakis, Ramakrishnan Kannan

Responsive image

Auto-TLDR; Tensor Rank Decomposition for Hyperspectral Image Classification

Slides Poster Similar

In this paper, we present a novel framework that uses tensor factorization to generate richer feature spaces for pixel classification in hyperspectral images. In particular, we assess the performance of different tensor rank decomposition methods as compared to the traditional kernel-based approaches for the hyperspectral image classification problem. We propose ORION, which takes as input a hyperspectral image tensor and a rank and outputs an enhanced feature space from the factor matrices of the decomposed tensor. Our method is a feature explosion technique that inherently maps low dimensional input space in R^K to high dimensional space in R^R, where R >> K, say in the order of 1000x, like a kernel. We show how the proposed method exploits the multi-linear structure of hyperspectral three-dimensional tensor. We demonstrate the effectiveness of our method with experiments on three publicly available hyperspectral datasets with labeled pixels and compare their classification performance against traditional linear and non-linear supervised learning methods such as SVM with Linear, Polynomial, RBF kernels, and the Multi-Layer Perceptron model. Finally, we explore the relationship between the rank of the tensor decomposition and the classification accuracy using several hyperspectral datasets with ground truth.

T-SVD Based Non-Convex Tensor Completion and Robust Principal Component Analysis

Tao Li, Jinwen Ma

Responsive image

Auto-TLDR; Non-Convex tensor rank surrogate function and non-convex sparsity measure for tensor recovery

Slides Poster Similar

In this paper, we propose a novel non-convex tensor rank surrogate function and a novel non-convex sparsity measure. The basic idea is to sidestep the bias of $\ell_1-$norm by introducing the concavity. Furthermore, we employ this non-convex penalty in tensor recovery problems such as tensor completion and tensor robust principal component analysis. Due to the concavity, the parameters of these models are difficult to solve. To tackle this problem, we devise a majorization minimization algorithm that can optimize the upper bound of the original function in each iteration, and every sub-problem is solved by the alternating direction multiplier method. We also analyze the theoretical properties of the proposed algorithm. Finally, the experimental results on natural and hyperspectral images demonstrate the efficacy and efficiency of the proposed method.

Making Every Label Count: Handling Semantic Imprecision by Integrating Domain Knowledge

Clemens-Alexander Brust, Björn Barz, Joachim Denzler

Responsive image

Auto-TLDR; Class Hierarchies for Imprecise Label Learning and Annotation eXtrapolation

Slides Poster Similar

Noisy data, crawled from the web or supplied by volunteers such as Mechanical Turkers or citizen scientists, is considered an alternative to professionally labeled data. There has been research focused on mitigating the effects of label noise. It is typically modeled as inaccuracy, where the correct label is replaced by an incorrect label from the same set. We consider an additional dimension of label noise: imprecision. For example, a non-breeding snow bunting is labeled as a bird. This label is correct, but not as precise as the task requires. Standard softmax classifiers cannot learn from such a weak label because they consider all classes mutually exclusive, which non-breeding snow bunting and bird are not. We propose CHILLAX (Class Hierarchies for Imprecise Label Learning and Annotation eXtrapolation), a method based on hierarchical classification, to fully utilize labels of any precision. Experiments on noisy variants of NABirds and ILSVRC2012 show that our method outperforms strong baselines by as much as 16.4 percentage points, and the current state of the art by up to 3.9 percentage points.

Enriching Video Captions with Contextual Text

Philipp Rimle, Pelin Dogan, Markus Gross

Responsive image

Auto-TLDR; Contextualized Video Captioning Using Contextual Text

Slides Poster Similar

Understanding video content and generating caption with context is an important and challenging task. Unlike prior methods that typically attempt to generate generic video captions without context, our architecture contextualizes captioning by infusing extracted information from relevant text data. We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input, and mines relevant knowledge such as names and locations from contextual text. In contrast to previous approaches, we do not preprocess the text further, and let the model directly learn to attend over it. Guided by the visual input, the model is able to copy words from the contextual text via a pointer-generator network, allowing to produce more specific video captions. We show competitive performance on the News Video Dataset and, through ablation studies, validate the efficacy of contextual video captioning as well as individual design choices in our model architecture.

Cross-Supervised Joint-Event-Extraction with Heterogeneous Information Networks

Yue Wang, Zhuo Xu, Yao Wan, Lu Bai, Lixin Cui, Qian Zhao, Edwin Hancock, Philip Yu

Responsive image

Auto-TLDR; Joint-Event-extraction from Unstructured corpora using Structural Information Network

Slides Poster Similar

Joint-event-extraction, which extracts structural information (i.e., entities or triggers of events) from unstructured real-world corpora, has attracted more and more research attention in natural language processing. \revised{Most existing works do not fully address the sparse co-occurred relationships between entities and triggers. This exacerbates the error-propagation problem} which may degrade the extraction performance. To mitigate this issue, we first define the joint-event-extraction as a sequence-to-sequence labeling task with a tag set which is composed of tags of triggers and entities. Then, to incorporate the missing information in the aforementioned co-occurred relationships, we propose a \underline{C}ross-\underline{S}upervised \underline{M}echanism (CSM) to alternately supervise the extraction of either triggers or entities based on the type distribution of each other. Moreover, since the connected entities and triggers naturally form a heterogeneous information network (HIN), we leverage the latent pattern along meta-paths for a given corpus to further improve the performance of our proposed method. To verify the effectiveness of our proposed method, we conduct extensive experiments on real-world datasets as well as compare our method with state-of-the-art methods. Empirical results and analysis show that our approach outperforms the state-of-the-art methods in both entity and trigger extraction.

Text Synopsis Generation for Egocentric Videos

Aidean Sharghi, Niels Lobo, Mubarak Shah

Responsive image

Auto-TLDR; Egocentric Video Summarization Using Multi-task Learning for End-to-End Learning

Slides Similar

Mass utilization of body-worn cameras has led to a huge corpus of available egocentric video. Existing video summarization algorithms can accelerate browsing such videos by selecting (visually) interesting shots from them. Nonetheless, since the system user still has to watch the summary videos, browsing large video databases remain a challenge. Hence, in this work, we propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos. Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database using text queries. Since egocentric videos are long and contain many activities and events, using video-to-text algorithms results in thousands of descriptions, many of which are incorrect. Therefore, we propose a multi-task learning scheme to simultaneously generate descriptions for video segments and summarize the resulting descriptions in an end-to-end fashion. We Input a set of video shots and the network generates a text description for each shot. Next, visual-language content matching unit that is trained with a weakly supervised objective, identifies the correct descriptions. Finally, the last component of our network, called purport network, evaluates the descriptions all together to select the ones containing crucial information. Out of thousands of descriptions generated for the video, a few informative sentences are returned to the user. We validate our framework on the challenging UT Egocentric video dataset, where each video is between 3 to 5 hours long, associated with over 3000 textual descriptions on average. The generated textual summaries, including only 5 percent (or less) of the generated descriptions, are compared to groundtruth summaries in text domain using well-established metrics in natural language processing.

RWMF: A Real-World Multimodal Foodlog Database

Pengfei Zhou, Cong Bai, Kaining Ying, Jie Xia, Lixin Huang

Responsive image

Auto-TLDR; Real-World Multimodal Foodlog: A Real-World Foodlog Database for Diet Assistant

Slides Poster Similar

With the increasing health concerns on diet, it's worthwhile to develop an intelligent assistant that can help users eat healthier. Such assistants can automatically give personal advice for the users' diet and generate health reports about eating on a regular basis. To boost the research on such diet assistant, we establish a real-world foodlog database using various methods such as filter, cluster and graph convolutional network. This database is built based on real-world lifelog and medical data, which is named as Real-World Multimodal Foodlog (RWMF). It contains 7500 multimodal pairs, and each pair consists of a food image paired with a line of personal biometrics data (such as Blood Glucose) and a textual food description of food composition paired with a line of food nutrition data. In this paper, we present the detailed procedures for setting up the database. We evaluate the performance of RWMF using different food classification and cross-modal retrieval approaches. We also test the performance of multimodal fusion on RWMF through ablation experiments. The experimental results show that the RWMF database is quite challenging and can be widely used to evaluate the performance of food analysis methods based on multimodal data.

Multi-Scale 2D Representation Learning for Weakly-Supervised Moment Retrieval

Ding Li, Rui Wu, Zhizhong Zhang, Yongqiang Tang, Wensheng Zhang

Responsive image

Auto-TLDR; Multi-scale 2D Representation Learning for Weakly Supervised Video Moment Retrieval

Slides Poster Similar

Video moment retrieval aims to search the moment most relevant to a given language query. However, most existing methods in this community often require temporal boundary annotations which are expensive and time-consuming to label. Hence weakly supervised methods have been put forward recently by only using coarse video-level label. Despite effectiveness, these methods usually process moment candidates independently, while ignoring a critical issue that the natural temporal dependencies between candidates in different temporal scales. To cope with this issue, we propose a Multi-scale 2D Representation Learning method for weakly supervised video moment retrieval. Specifically, we first construct a two-dimensional map for each temporal scale to capture the temporal dependencies between candidates. Two dimensions in this map indicate the start and end time points of these candidates. Then, we select top-K candidates from each scale-varied map with a learnable convolutional neural network. With a newly designed Moments Evaluation Module, we obtain the alignment scores of the selected candidates. At last, the similarity between captions and language query is served as supervision for further training the candidates' selector. Experiments on two benchmark datasets Charades-STA and ActivityNet Captions demonstrate that our approach achieves superior performance to state-of-the-art results.

Enhanced User Interest and Expertise Modeling for Expert Recommendation

Tongze He, Caili Guo, Yunfei Chu

Responsive image

Auto-TLDR; A Unified Framework for Expert Recommendation in Community Question Answering

Slides Poster Similar

The rapid development of Community Question Answering (CQA) satisfies users' request for professional and personal knowledge. In CQA, one key issue is to recommend users with high expertise and willingness to answer the given questions, namely expert recommendation. However, most of existing methods for expert recommendation ignore some key information, such as time information and historical feedback information, degrading the performance. On the one hand, users' interest are changing over time. It is biased if we don't consider the dynamics. On the other hand, feedback information is critical to estimate users' expertise. To solve these problems, we propose a unified framework for expert recommendation to exploit user interest and expertise more precisely. Considering the inconsistency between them, we propose to learn their embeddings separately. We leverage Long Short-Term Memory (LSTM) to model user's short-term interest and combine it with long-term interest. The user expertise is learned by the designed user expertise network, which explicitly models feedback on users' historical behavior. The extensive experiments on a large-scale dataset from a real-world CQA site demonstrate the superior performance of our method than state-of-the-art solutions to the problem.

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Yanan Li, Yuetan Lin, Hongrui Zhao, Donghui Wang

Responsive image

Auto-TLDR; TextVQA: An End-to-End Visual Question Answering Model for Text-Based VQA

Slides Similar

As a typical cross-modal problem, visual question answering (VQA) has received increasing attention from the communities of computer vision and natural language processing. Reading and reasoning about texts and visual contents in the images is a burgeoning and important research topic in VQA, especially for the visually impaired assistance applications. Given an image, it aims to predict an answer to a provided natural language question closely related to its textual contents. In this paper, we propose a novel end-to-end textual content based VQA model, which grounds question answering both on the visual and textual information. After encoding the image, question and recognized text words, it uses multi-modal factorized high-order modules and the attention mechanism to fuse question-image and question-text features respectively. The complex correlations among different features can be captured efficiently. To ensure the model's extendibility, it embeds candidate answers and recognized texts in a semantic embedding space and adopts semantic embedding based classifier to perform answer prediction. Extensive experiments on the newly proposed benchmark TextVQA demonstrate that the proposed model can achieve promising results.

Probabilistic Latent Factor Model for Collaborative Filtering with Bayesian Inference

Jiansheng Fang, Xiaoqing Zhang, Yan Hu, Yanwu Xu, Ming Yang, Jiang Liu

Responsive image

Auto-TLDR; Bayesian Latent Factor Model for Collaborative Filtering

Slides Similar

Latent Factor Model (LFM) is one of the most successful methods for Collaborative filtering (CF) in the recommendation system, in which both users and items are projected into a joint latent factor space. Base on matrix factorization applied usually in pattern recognition, LFM models user-item interactions as inner products of factor vectors of user and item in that space and can be efficiently solved by least square methods with optimal estimation. However, such optimal estimation methods are prone to overfitting due to the extreme sparsity of user-item interactions. In this paper, we propose a Bayesian treatment for LFM, named Bayesian Latent Factor Model (BLFM). Based on observed user-item interactions, we build a probabilistic factor model in which the regularization is introduced via placing prior constraint on latent factors, and the likelihood function is established over observations and parameters. Then we draw samples of latent factors from the posterior distribution with Variational Inference (VI) to predict expected value. We further make an extension to BLFM, called BLFMBias, incorporating user-dependent and item-dependent biases into the model for enhancing performance. Extensive experiments on the movie rating dataset show the effectiveness of our proposed models by compared with several strong baselines.

MEG: Multi-Evidence GNN for Multimodal Semantic Forensics

Ekraam Sabir, Ayush Jaiswal, Wael Abdalmageed, Prem Natarajan

Responsive image

Auto-TLDR; Scalable Image Repurposing Detection with Graph Neural Network Based Model

Slides Poster Similar

Image repurposing is a category of fake news where a digitally unmanipulated image is misrepresented by means of its accompanying metadata such as captions, location, etc., where the image and accompanying metadata together comprise a multimedia package. The problem setup is to authenticate a query multimedia package using a reference dataset of potentially related packages as evidences. Existing methods are limited to using a single evidence (retrieved package), which ignores potential performance improvement from the use of multiple evidences. In this work, we introduce a novel graph neural network based model for image repurposing detection, which effectively utilizes multiple retrieved packages as evidences and is scalable with the number of evidences. We compare the scalability and performance of our model against existing methods. Experimental results show that the proposed model outperforms existing state-of-the-art for image repurposing detection with an error reduction of up to 25%.

Context Visual Information-Based Deliberation Network for Video Captioning

Min Lu, Xueyong Li, Caihua Liu

Responsive image

Auto-TLDR; Context visual information-based deliberation network for video captioning

Slides Poster Similar

Video captioning is to automatically and accurately generate a textual description for a video. The typical methods following the encoder-decoder architecture directly utilized hidden states to predict words. Nevertheless, these methods did not amend the inaccurate hidden states before feeding those states into word prediction. This led to a cascade of errors on generating word by word. In this paper, the context visual information-based deliberation network is proposed, abbreviated as CVI-DelNet. Its key idea is to introduce the deliberator into the encoder-decoder framework. The encoder-decoder firstly generates a raw hidden state sequence. Unlike the existing methods, the raw hidden state is no more directly used for word prediction but is fed into the deliberator to generate the refined hidden state. The words are then predicted according to the refined hidden states and the contextual visual features. Results on two datasets shows that the proposed method significantly outperforms the baselines.

Picture-To-Amount (PITA): Predicting Relative Ingredient Amounts from Food Images

Jiatong Li, Fangda Han, Ricardo Guerrero, Vladimir Pavlovic

Responsive image

Auto-TLDR; PITA: A Deep Learning Architecture for Predicting the Relative Amount of Ingredients from Food Images

Slides Poster Similar

Increased awareness of the impact of food consumption on health and lifestyle today has given rise to novel data-driven food analysis systems. Although these systems may recognize the ingredients, a detailed analysis of their amounts in the meal, which is paramount for estimating the correct nutrition, is usually ignored. In this paper, we study the novel and challenging problem of predicting the relative amount of each ingredient from a food image. We propose PITA, the Picture-to-Amount deep learning architecture to solve the problem. More specifically, we predict the ingredient amounts using a domain-driven Wasserstein loss from image-to-recipe cross-modal embeddings learned to align the two views of food data. Experiments on a dataset of recipes collected from the Internet show the model generates promising results and improves the baselines on this challenging task.

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Amar Shrestha, Krittaphat Pugdeethosapol, Haowen Fang, Qinru Qiu

Responsive image

Auto-TLDR; MAGNet: A Multi-Region Attention-Aware Grounding Network for Free-form Textual Queries

Slides Poster Similar

Grounding free-form textual queries necessitates an understanding of these textual phrases and its relation to the visual cues to reliably reason about the described locations. Spatial attention networks are known to learn this relationship and focus its gaze on salient objects in the image. Thus, we propose to utilize spatial attention networks for image-level visual-textual fusion preserving local (word) and global (phrase) information to refine region proposals with an in-network Region Proposal Network (RPN) and detect single or multiple regions for a phrase query. We focus only on the phrase query - ground truth pair (referring expression) for a model independent of the constraints of the datasets i.e. additional attributes, context etc. For such referring expression dataset ReferIt game, our Multi- region Attention-assisted Grounding network (MAGNet) achieves over 12% improvement over the state-of-the-art. Without the con- text from image captions and attribute information in Flickr30k Entities, we still achieve competitive results compared to the state- of-the-art.

Double Manifolds Regularized Non-Negative Matrix Factorization for Data Representation

Jipeng Guo, Shuai Yin, Yanfeng Sun, Yongli Hu

Responsive image

Auto-TLDR; Double Manifolds Regularized Non-negative Matrix Factorization for Clustering

Slides Poster Similar

Non-negative matrix factorization (NMF) is an important method in learning latent data representation. The local geometrical structure can make the learned representation more effectively and significantly improve the performance of NMF. However, most of existing graph-based learning methods are determined by a predefined similarity graph which may be not optimal for specific tasks. To solve the above the problem, we propose the Double Manifolds Regularized NMF (DMR-NMF) model which jointly learns an adaptive affinity matrix with the non-negative matrix factorization. The learned affinity matrix can guide the NMF to fit the clustering task. Moreover, we develop the iterative updating optimization schemes for DMR-NMF, and provide the strict convergence proof of our optimization strategy. Empirical experiments on four different real-world data sets demonstrate the state-of-the-art performance of DMR-NMF in comparison with the other related algorithms.

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

Edward Kim, Connor Onweller, Kathleen F. Mccoy

Responsive image

Auto-TLDR; A multimodal deep learning framework that can generate summarization text supporting the main idea of an information graphic for presentation to blind or visually impaired

Slides Similar

We present a multimodal deep learning framework that can generate summarization text supporting the main idea of an information graphic for presentation to a person who is blind or visually impaired. The framework utilizes the visual, textual, positional, and size characteristics extracted from the image to create the summary. Different and complimentary neural architectures are optimized for each task using crowdsourced training data. From our quantitative experiments and results, we explain the reasoning behind our framework and show the effectiveness of our models. Our qualitative results showcase text generated from our framework and show that Mechanical Turk participants favor them to other automatic and human generated summarizations. We describe the design and of of an experiment to evaluate the utility of our system for people who have visual impairments in the context of understanding Twitter Tweets containing line graphs.

RGB-Infrared Person Re-Identification Via Image Modality Conversion

Huangpeng Dai, Qing Xie, Yanchun Ma, Yongjian Liu, Shengwu Xiong

Responsive image

Auto-TLDR; CE2L: A Novel Network for Cross-Modality Re-identification with Feature Alignment

Slides Poster Similar

As a cross modality retrieval task, RGB-infrared person re-identification(Re-ID) is an important and challenging tasking, because of its important role in video surveillance applications and large cross-modality variations between visible and infrared images. Most previous works addressed the problem of cross-modality gap with feature alignment by original feature representation learning straightly. In this paper, different from existing works, we propose a novel network(CE2L) to tackle the cross-modality gap with feature alignment. CE2L mainly focuses on adding discriminative information and learning robust features by converting modality between visible and infrared images. Its merits are highlighted in two aspects: 1)Using CycleGAN to convert infrared images into color images can not only increase the recognition characteristics of images, but also allow the our network to better learn the two modal image features; 2)Our novel method can serve as data augmentation. Specifically, it can increase data diversity and total data against over-fitting by converting labeled training images to another modal images. Extensive experimental results on two datasets demonstrate superior performance compared to the baseline and the state-of-the-art methods.

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

Yaning Li, Liu Yang

Responsive image

Auto-TLDR; Fully Associative Network for Fully Exploiting Correlation Information in Multi-Label Classification

Slides Poster Similar

Recent researches demonstrate that correlation modeling plays a key role in high-performance multi-label classification methods. However, existing methods do not take full advantage of correlation information, especially correlations in feature and label spaces of each image, which limits the performance of correlation-based multi-label classification methods. With more correlations considered, in this study, a Fully Associative Network (FAN) is proposed for fully exploiting correlation information, which involves both visual feature and label correlations. Specifically, FAN introduces a robust covariance pooling to summarize convolution features as global image representation for capturing feature correlation in the multi-label task. Moreover, it constructs an effective label correlation matrix based on a re-weighted scheme, which is fed into a graph convolution network for capturing label correlation. Then, correlation between covariance representations (i.e., feature correlation ) and the outputs of GCN (i.e., label correlation) are modeled for final prediction. Experimental results on two datasets illustrate the effectiveness and efficiency of our proposed FAN compared with state-of-the-art methods.

Attentive Visual Semantic Specialized Network for Video Captioning

Jesus Perez-Martin, Benjamin Bustos, Jorge Pérez

Responsive image

Auto-TLDR; Adaptive Visual Semantic Specialized Network for Video Captioning

Slides Poster Similar

As an essential high-level task of video understanding topic, automatically describing a video with natural language has recently gained attention as a fundamental challenge in computer vision. Previous models for video captioning have several limitations, such as the existence of gaps in current semantic representations and the inexpressibility of the generated captions. To deal with these limitations, in this paper, we present a new architecture that we callAttentive Visual Semantic Specialized Network(AVSSN), which is an encoder-decoder model based on our Adaptive Attention Gate and Specialized LSTM layers. This architecture can selectively decide when to use visual or semantic information into the text generation process. The adaptive gate makes the decoder to automatically select the relevant information for providing a better temporal state representation than the existing decoders. Besides, the model is capable of learning to improve the expressiveness of generated captions attending to their length, using a sentence-length-related loss function. We evaluate the effectiveness of the proposed approach on the Microsoft Video Description(MSVD) and the Microsoft Research Video-to-Text (MSR-VTT) datasets, achieving state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, CIDEr, and ROUGE_L.

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

Yasunori Ohishi, Yuki Tanaka, Kunio Kashino

Responsive image

Auto-TLDR; A guided attention scheme for audio-visual co-segmentation

Slides Poster Similar

Audio-visual co-segmentation is a task to extract segments and regions corresponding to specific events on unlabelled audio and video signals. It is particularly important to accomplish it in an unsupervised way, since it is generally very difficult to manually label all the objects and events appearing in audio-visual signals for supervised learning. Here, we propose to take advantage of temporal proximity of corresponding audio and video entities included in the signals. For this purpose, we newly employ a guided attention scheme to this task to efficiently detect and utilize temporal cooccurrences of audio and video information. The experiments using a real TV broadcasting of Sumo wrestling, a sport event, with live commentaries show that our model can automatically extract specific athlete movements and its spoken descriptions in an unsupervised manner.

Soft Label and Discriminant Embedding Estimation for Semi-Supervised Classification

Fadi Dornaika, Abdullah Baradaaji, Youssof El Traboulsi

Responsive image

Auto-TLDR; Semi-supervised Semi-Supervised Learning for Linear Feature Extraction and Label Propagation

Slides Poster Similar

In recent times, graph-based semi-supervised learning proved to be a powerful paradigm for processing and mining large datasets. The main advantage relies on the fact that these methods can be useful in propagating a small set of known labels to a large set of unlabeled data. The scarcity of labeled data may affect the performance of the semi-learning. This paper introduces a new semi-supervised framework for simultaneous linear feature extraction and label propagation. The proposed method simultaneously estimates a discriminant transformation and the unknown label by exploiting both labeled and unlabeled data. In addition, the unknowns of the learning model are estimated by integrating two types of graph-based smoothness constraints. The resulting semi-supervised model is expected to learn more discriminative information. Experiments are conducted on six public image datasets. These experimental results show that the performance of the proposed method can be better than that of many state-of-the-art graph-based semi-supervised algorithms.

Exploiting Elasticity in Tensor Ranks for Compressing Neural Networks

Jie Ran, Rui Lin, Hayden Kwok-Hay So, Graziano Chesi, Ngai Wong

Responsive image

Auto-TLDR; Nuclear-Norm Rank Minimization Factorization for Deep Neural Networks

Slides Poster Similar

Elasticities in depth, width, kernel size and resolution have been explored in compressing deep neural networks (DNNs). Recognizing that the kernels in a convolutional neural network (CNN) are 4-way tensors, we further exploit a new elasticity dimension along the input-output channels. Specifically, a novel nuclear-norm rank minimization factorization (NRMF) approach is proposed to dynamically and globally search for the reduced tensor ranks during training. Correlation between tensor ranks across multiple layers is revealed, and a graceful tradeoff between model size and accuracy is obtained. Experiments then show the superiority of NRMF over the previous non-elastic variational Bayesian matrix factorization (VBMF) scheme.

Multi-Task Learning for Calorie Prediction on a Novel Large-Scale Recipe Dataset Enriched with Nutritional Information

Robin Ruede, Verena Heusser, Lukas Frank, Monica Haurilet, Alina Roitberg, Rainer Stiefelhagen

Responsive image

Auto-TLDR; Pic2kcal: Learning Food Recipes from Images for Calorie Estimation

Slides Poster Similar

A rapidly growing amount of content posted online, such as food recipes, opens doors to new exciting applications at the intersection of vision and language. In this work, we aim to estimate the calorie amount of a meal directly from an image by learning from recipes people have published on the Internet, thus skipping time-consuming manual data annotation. Since there are few large-scale publicly available datasets captured in unconstrained environments, we propose the pic2kcal benchmark comprising 308,000 images from over 70,000 recipes including photographs, ingredients and instructions. To obtain nutritional information of the ingredients and automatically determine the ground-truth calorie value, we match the items in the recipes with structured information from a food item database. We evaluate various neural networks for regression of the calorie quantity and extend them with the multi-task paradigm. Our learning procedure combines the calorie estimation with prediction of proteins, carbohydrates, and fat amounts as well as a multi-label ingredient classification. Our experiments demonstrate clear benefits of multi-task learning for calorie estimation, surpassing the single-task calorie regression by 9.9%. To encourage further research on this task, we make the code for generating the dataset and the models publicly available.

Temporal Collaborative Filtering with Graph Convolutional Neural Networks

Esther Rodrigo-Bonet, Minh Duc Nguyen, Nikos Deligiannis

Responsive image

Auto-TLDR; Temporal Collaborative Filtering with Graph-Neural-Network-based Neural Networks

Slides Poster Similar

Temporal collaborative filtering (TCF) methods aim at modelling non-static aspects behind recommender systems, such as the dynamics in users' preferences and social trends around items. State-of-the-art TCF methods employ recurrent neural networks (RNNs) to model such aspects. These methods deploy matrix-factorization-based (MF-based) approaches to learn the user and item representations. Recently, graph-neural-network-based (GNN-based) approaches have shown improved performance in providing accurate recommendations over traditional MF-based approaches in non-temporal CF settings. Motivated by this, we propose a novel TCF method that leverages GNNs to learn user and item representations, and RNNs to model their temporal dynamics. A challenge with this method lies in the increased data sparsity, which negatively impacts obtaining meaningful quality representations with GNNs. To overcome this challenge, we train a GNN model at each time step using a set of observed interactions accumulated time-wise. Comprehensive experiments on real-world data show the improved performance obtained by our method over several state-of-the-art temporal and non-temporal CF models.

Low Rank Representation on Product Grassmann Manifolds for Multi-viewSubspace Clustering

Jipeng Guo, Yanfeng Sun, Junbin Gao, Yongli Hu, Baocai Yin

Responsive image

Auto-TLDR; Low Rank Representation on Product Grassmann Manifold for Multi-View Data Clustering

Slides Poster Similar

Clustering high dimension multi-view data with complex intrinsic properties and nonlinear manifold structure is a challenging task since these data are always embedded in low dimension manifolds. Inspired by Low Rank Representation (LRR), some researchers extended classic LRR on Grassmann manifold or Product Grassmann manifold to represent data with non-linear metrics. However, most of these methods utilized convex nuclear norm to leverage a low-rank structure, which was over-relaxation of true rank and would lead to the results deviated from the true underlying ones. And, the computational complexity of singular value decomposition of matrix is high for nuclear norm minimization. In this paper, we propose a new low rank model for high-dimension multi-view data clustering on Product Grassmann Manifold with the matrix tri-factorization which is used to control the upper bound of true rank of representation matrix. And, the original problem can be transformed into the nuclear norm minimization with smaller scale matrices. An effective solution and theoretical analysis are also provided. The experimental results show that the proposed method obviously outperforms other state-of-the-art methods on several multi-source human/crowd action video datasets.

VSB^2-Net: Visual-Semantic Bi-Branch Network for Zero-Shot Hashing

Xin Li, Xiangfeng Wang, Bo Jin, Wenjie Zhang, Jun Wang, Hongyuan Zha

Responsive image

Auto-TLDR; VSB^2-Net: inductive zero-shot hashing for image retrieval

Slides Poster Similar

Zero-shot hashing aims at learning hashing model from seen classes and the obtained model is capable of generalizing to unseen classes for image retrieval. Inspired by zero-shot learning, existing zero-shot hashing methods usually transfer the supervised knowledge from seen to unseen classes, by embedding the hamming space to a shared semantic space. However, this makes instances difficult to distinguish due to limited hashing bit numbers, especially for semantically similar unseen classes. We propose a novel inductive zero-shot hashing framework, i.e., VSB^2-Net, where both semantic space and visual feature space are embedded to the same hamming space instead. The reconstructive semantic relationships are established in the hamming space, preserving local similarity relationships and explicitly enlarging the discrepancy between semantic hamming vectors. A two-task architecture, comprising of classification module and visual feature reconstruction module, is employed to enhance the generalization and transfer abilities. Extensive evaluation results on several benchmark datasets demonstratethe superiority of our proposed method compared to several state-of-the-art baselines.

Equation Attention Relationship Network (EARN) : A Geometric Deep Metric Framework for Learning Similar Math Expression Embedding

Saleem Ahmed, Kenny Davila, Srirangaraj Setlur, Venu Govindaraju

Responsive image

Auto-TLDR; Representational Learning for Similarity Based Retrieval of Mathematical Expressions

Slides Poster Similar

Representational Learning in the form of high dimensional embeddings have been used for multiple pattern recognition applications. There has been a significant interest in building embedding based systems for learning representationsin the mathematical domain. At the same time, retrieval of structured information such as mathematical expressions is an important need for modern IR systems. In this work, our motivation is to introduce a robust framework for learning representations for similarity based retrieval of mathematical expressions. Given a query by example, the embedding can find the closest matching expression as a function of euclidean distance between them. We leverage recent advancements in image-based and graph-based deep learning algorithms to learn our similarity embeddings. We do this first, by using uni-modal encoders in graph space and image space and then, a multi-modal combination of the same. To overcome the lack of training data, we force the networks to learn a deep metric using triplets generated with a heuristic scoring function. We also adopt a custom strategy for mining hard samples to train our neural networks. Our system produces rankings similar to those generated by the original scoring function, but using only a fraction of the time. Our results establish the viability of using such a multi-modal embedding for this task.

Label Incorporated Graph Neural Networks for Text Classification

Yuan Xin, Linli Xu, Junliang Guo, Jiquan Li, Xin Sheng, Yuanyuan Zhou

Responsive image

Auto-TLDR; Graph Neural Networks for Semi-supervised Text Classification

Slides Poster Similar

Graph Neural Networks (GNNs) have achieved great success on graph-structured data, and their applications on traditional data structures such as natural language processing and semi-supervised text classification have been extensively explored in recent years. While previous works only consider the text information while building the graph, heterogeneous information such as labels is ignored. In this paper, we consider to incorporate the label information while building the graph by adding text-label-text paths, through which the supervision information will propagate among the graph more directly. Specifically, we treat labels as nodes in the graph which also contains text and word nodes, and then connect labels with texts belonging to that label. Through graph convolutions, label embeddings are jointly learned with text embeddings in the same latent semantic space. The newly incorporated label nodes will facilitate learning more accurate text embeddings by introducing the label information, and thus benefit the downstream text classification tasks. Extensive results on several benchmark datasets show that the proposed framework outperforms baseline methods by a significant margin.

Supervised Domain Adaptation Using Graph Embedding

Lukas Hedegaard, Omar Ali Sheikh-Omar, Alexandros Iosifidis

Responsive image

Auto-TLDR; Domain Adaptation from the Perspective of Multi-view Graph Embedding and Dimensionality Reduction

Slides Poster Similar

Getting deep convolutional neural networks to perform well requires a large amount of training data. When the available labelled data is small, it is often beneficial to use transfer learning to leverage a related larger dataset (source) in order to improve the performance on the small dataset (target). Among the transfer learning approaches, domain adaptation methods assume that distributions between the two domains are shifted and attempt to realign them. In this paper, we consider the domain adaptation problem from the perspective of multi-view graph embedding and dimensionality reduction. Instead of solving the generalised eigenvalue problem to perform the embedding, we formulate the graph-preserving criterion as loss in the neural network and learn a domain-invariant feature transformation in an end-to-end fashion. We show that the proposed approach leads to a powerful Domain Adaptation framework which generalises the prior methods CCSA and d-SNE, and enables simple and effective loss designs; an LDA-inspired instantiation of the framework leads to performance on par with the state-of-the-art on the most widely used Domain Adaptation benchmarks, Office31 and MNIST to USPS datasets.

Multi-Modal Deep Clustering: Unsupervised Partitioning of Images

Guy Shiran, Daphna Weinshall

Responsive image

Auto-TLDR; Multi-Modal Deep Clustering for Unlabeled Images

Slides Poster Similar

The clustering of unlabeled raw images is a daunting task, which has recently been approached with some success by deep learning methods. Here we propose an unsupervised clustering framework, which learns a deep neural network in an end-to-end fashion, providing direct cluster assignments of images without additional processing. Multi-Modal Deep Clustering (MMDC), trains a deep network to align its image embeddings with target points sampled from a Gaussian Mixture Model distribution. The cluster assignments are then determined by mixture component association of image embeddings. Simultaneously, the same deep network is trained to solve an additional self-supervised task. This pushes the network to learn more meaningful image representations and stabilizes the training. Experimental results show that MMDC achieves or exceeds state-of-the-art performance on four challenging benchmarks. On natural image datasets we improve on previous results with significant margins of up to 11% absolute accuracy points, yielding an accuracy of 70% on CIFAR-10 and 61% on STL-10.

Transformer Networks for Trajectory Forecasting

Francesco Giuliari, Hasan Irtiza, Marco Cristani, Fabio Galasso

Responsive image

Auto-TLDR; TransformerNetworks for Trajectory Prediction of People Interactions

Slides Poster Similar

Most recent successes on forecasting the people mo-tion are based on LSTM models andallmost recent progress hasbeen achieved by modelling the social interaction among peopleand the people interaction with the scene. We question the useof the LSTM models and propose the novel use of TransformerNetworks for trajectory forecasting. This is a fundamental switchfrom the sequential step-by-step processing of LSTMs to theonly-attention-based memory mechanisms of Transformers. Inparticular, we consider both the original Transformer Network(TF) and the larger Bidirectional Transformer (BERT), state-of-the-art on all natural language processing tasks. Our proposedTransformers predict the trajectories of the individual peoplein the scene. These are “simple” models because each personis modelled separately without any complex human-human norscene interaction terms. In particular, the TF modelwithoutbells and whistlesyields the best score on the largest and mostchallenging trajectory forecasting benchmark of TrajNet [1]. Ad-ditionally, its extension which predicts multiple plausible futuretrajectories performs on par with more engineered techniqueson the 5 datasets of ETH [2]+UCY [3]. Finally, we showthat Transformers may deal with missing observations, as itmay be the case with real sensor data. Code is available atgithub.com/FGiuliari/Trajectory-Transformer

Joint Supervised and Self-Supervised Learning for 3D Real World Challenges

Antonio Alliegro, Davide Boscaini, Tatiana Tommasi

Responsive image

Auto-TLDR; Self-supervision for 3D Shape Classification and Segmentation in Point Clouds

Slides Similar

Point cloud processing and 3D shape understanding are very challenging tasks for which deep learning techniques have demonstrated great potentials. Still further progresses are essential to allow artificial intelligent agents to interact with the real world. In many practical conditions the amount of annotated data may be limited and integrating new sources of knowledge becomes crucial to support autonomous learning. Here we consider several scenarios involving synthetic and real world point clouds where supervised learning fails due to data scarcity and large domain gaps. We propose to enrich standard feature representations by leveraging self-supervision through a multi-task model that can solve a 3D puzzle while learning the main task of shape classification or part segmentation. An extensive analysis investigating few-shot, transfer learning and cross-domain settings shows the effectiveness of our approach with state-of-the-art results for 3D shape classification and part segmentation.

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, Rong Xiao

Responsive image

Auto-TLDR; PICK: A Graph Learning Framework for Key Information Extraction from Documents

Slides Poster Similar

Computer vision with state-of-the-art deep learning models have achieved huge success in the field of Optical Character Recognition (OCR) including text detection and recognition tasks recently. However, Key Information Extraction (KIE) from documents as the downstream task of OCR, having a large number of use scenarios in real-world, remains a challenge because documents not only have textual features extracting from OCR systems but also have semantic visual features that are not fully exploited and play a critical role in KIE. Too little work has been devoted to efficiently make full use of both textual and visual features of the documents. In this paper, we introduce PICK, a framework that is effective and robust in handling complex documents layout for KIE by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity. Extensive experiments on real-world datasets have been conducted to show that our method outperforms baselines methods by significant margins.

Adaptive L2 Regularization in Person Re-Identification

Xingyang Ni, Liang Fang, Heikki Juhani Huttunen

Responsive image

Auto-TLDR; AdaptiveReID: Adaptive L2 Regularization for Person Re-identification

Slides Poster Similar

We introduce an adaptive L2 regularization mechanism termed AdaptiveReID, in the setting of person re-identification. In the literature, it is common practice to utilize hand-picked regularization factors which remain constant throughout the training procedure. Unlike existing approaches, the regularization factors in our proposed method are updated adaptively through backpropagation. This is achieved by incorporating trainable scalar variables as the regularization factors, which are further fed into a scaled hard sigmoid function. Extensive experiments on the Market-1501, DukeMTMC-reID and MSMT17 datasets validate the effectiveness of our framework. Most notably, we obtain state-of-the-art performance on MSMT17, which is the largest dataset for person re-identification. Source code will be published at https://github.com/nixingyang/AdaptiveReID.

Deep Convolutional Embedding for Digitized Painting Clustering

Giovanna Castellano, Gennaro Vessio

Responsive image

Auto-TLDR; A Deep Convolutional Embedding Model for Clustering Artworks

Slides Poster Similar

Clustering artworks is difficult because of several reasons. On one hand, recognizing meaningful patterns in accordance with domain knowledge and visual perception is extremely hard. On the other hand, the application of traditional clustering and feature reduction techniques to the highly dimensional pixel space can be ineffective. To address these issues, we propose to use a deep convolutional embedding model for digitized painting clustering, in which the task of mapping the input raw data to an abstract, latent space is jointly optimized with the task of finding a set of cluster centroids in this latent feature space. Quantitative and qualitative experimental results show the effectiveness of the proposed method. The model is also able to outperform other state-of-the-art deep clustering approaches to the same problem. The proposed method may be beneficial to several art-related tasks, particularly visual link retrieval and historical knowledge discovery in painting datasets.

Person Recognition with HGR Maximal Correlation on Multimodal Data

Yihua Liang, Fei Ma, Yang Li, Shao-Lun Huang

Responsive image

Auto-TLDR; A correlation-based multimodal person recognition framework that learns discriminative embeddings of persons by joint learning visual features and audio features

Slides Poster Similar

Multimodal person recognition is a common task in video analysis and public surveillance, where information from multiple modalities, such as images and audio extracted from videos, are used to jointly determine the identity of a person. Previous person recognition techniques either use only uni-modal data or only consider shared representations between different input modalities, while leaving the extraction of their relationship with identity information to downstream tasks. Furthermore, real-world data often contain noise, which makes recognition more challenging practical situations. In our work, we propose a novel correlation-based multimodal person recognition framework that is relatively simple but can efficaciously learn supervised information in multimodal data fusion and resist noise. Specifically, our framework learns a discriminative embeddings of persons by joint learning visual features and audio features while maximizing HGR maximal correlation among multimodal input and persons' identities. Experiments are done on a subset of Voxceleb2. Compared with state-of-the-art methods, the proposed method demonstrates an improvement of accuracy and robustness to noise.

Text Recognition - Real World Data and Where to Find Them

Klára Janoušková, Lluis Gomez, Dimosthenis Karatzas, Jiri Matas

Responsive image

Auto-TLDR; Exploiting Weakly Annotated Images for Text Extraction

Slides Poster Similar

We present a method for exploiting weakly annotated images to improve text extraction pipelines. The approach uses an arbitrary end-to-end text recognition system to obtain text region proposals and their, possibly erroneous, transcriptions. The proposed method includes matching of imprecise transcription to weak annotations and edit distance guided neighbourhood search. It produces nearly error-free, localised instances of scene text, which we treat as "pseudo ground truth" (PGT). We apply the method to two weakly-annotated datasets. Training with the extracted PGT consistently improves the accuracy of a state of the art recognition model, by 3.7 % on average, across different benchmark datasets (image domains) and 24.5 % on one of the weakly annotated datasets.