RWMF: A Real-World Multimodal Foodlog Database

Pengfei Zhou, Cong Bai, Kaining Ying, Jie Xia, Lixin Huang

Responsive image

Auto-TLDR; Real-World Multimodal Foodlog: A Real-World Foodlog Database for Diet Assistant

Slides Poster

With the increasing health concerns on diet, it's worthwhile to develop an intelligent assistant that can help users eat healthier. Such assistants can automatically give personal advice for the users' diet and generate health reports about eating on a regular basis. To boost the research on such diet assistant, we establish a real-world foodlog database using various methods such as filter, cluster and graph convolutional network. This database is built based on real-world lifelog and medical data, which is named as Real-World Multimodal Foodlog (RWMF). It contains 7500 multimodal pairs, and each pair consists of a food image paired with a line of personal biometrics data (such as Blood Glucose) and a textual food description of food composition paired with a line of food nutrition data. In this paper, we present the detailed procedures for setting up the database. We evaluate the performance of RWMF using different food classification and cross-modal retrieval approaches. We also test the performance of multimodal fusion on RWMF through ablation experiments. The experimental results show that the RWMF database is quite challenging and can be widely used to evaluate the performance of food analysis methods based on multimodal data.

Similar papers

Picture-To-Amount (PITA): Predicting Relative Ingredient Amounts from Food Images

Jiatong Li, Fangda Han, Ricardo Guerrero, Vladimir Pavlovic

Responsive image

Auto-TLDR; PITA: A Deep Learning Architecture for Predicting the Relative Amount of Ingredients from Food Images

Slides Poster Similar

Increased awareness of the impact of food consumption on health and lifestyle today has given rise to novel data-driven food analysis systems. Although these systems may recognize the ingredients, a detailed analysis of their amounts in the meal, which is paramount for estimating the correct nutrition, is usually ignored. In this paper, we study the novel and challenging problem of predicting the relative amount of each ingredient from a food image. We propose PITA, the Picture-to-Amount deep learning architecture to solve the problem. More specifically, we predict the ingredient amounts using a domain-driven Wasserstein loss from image-to-recipe cross-modal embeddings learned to align the two views of food data. Experiments on a dataset of recipes collected from the Internet show the model generates promising results and improves the baselines on this challenging task.

Multi-Task Learning for Calorie Prediction on a Novel Large-Scale Recipe Dataset Enriched with Nutritional Information

Robin Ruede, Verena Heusser, Lukas Frank, Monica Haurilet, Alina Roitberg, Rainer Stiefelhagen

Responsive image

Auto-TLDR; Pic2kcal: Learning Food Recipes from Images for Calorie Estimation

Slides Poster Similar

A rapidly growing amount of content posted online, such as food recipes, opens doors to new exciting applications at the intersection of vision and language. In this work, we aim to estimate the calorie amount of a meal directly from an image by learning from recipes people have published on the Internet, thus skipping time-consuming manual data annotation. Since there are few large-scale publicly available datasets captured in unconstrained environments, we propose the pic2kcal benchmark comprising 308,000 images from over 70,000 recipes including photographs, ingredients and instructions. To obtain nutritional information of the ingredients and automatically determine the ground-truth calorie value, we match the items in the recipes with structured information from a food item database. We evaluate various neural networks for regression of the calorie quantity and extend them with the multi-task paradigm. Our learning procedure combines the calorie estimation with prediction of proteins, carbohydrates, and fat amounts as well as a multi-label ingredient classification. Our experiments demonstrate clear benefits of multi-task learning for calorie estimation, surpassing the single-task calorie regression by 9.9%. To encourage further research on this task, we make the code for generating the dataset and the models publicly available.

Partially Supervised Multi-Task Network for Single-View Dietary Assessment

Ya Lu, Thomai Stathopoulou, Stavroula Mougiakakou

Responsive image

Auto-TLDR; Food Volume Estimation from a Single Food Image via Geometric Understanding and Semantic Prediction

Slides Poster Similar

Food volume estimation is an essential step in the pipeline of dietary assessment and demands the precise depth estimation of the food surface and table plane. Existing methods based on computer vision require either multi-image input or additional depth maps, reducing convenience of implementation and practical significance. Despite the recent advances in unsupervised depth estimation from a single image, the achieved performance in the case of large texture-less areas needs to be improved. In this paper, we propose a network architecture that jointly performs geometric understanding (i.e., depth prediction and 3D plane estimation) and semantic prediction on a single food image, enabling a robust and accurate food volume estimation regardless of the texture characteristics of the target plane. For the training of the network, only monocular videos with semantic ground truth are required, while the depth map and 3D plane ground truth are no longer needed. Experimental results on two separate food image databases demonstrate that our method performs robustly on texture-less scenarios and is superior to unsupervised networks and structure from motion based approaches, while it achieves comparable performance to fully-supervised methods.

Uncertainty-Aware Data Augmentation for Food Recognition

Eduardo Aguilar, Bhalaji Nagarajan, Rupali Khatun, Marc Bolaños, Petia Radeva

Responsive image

Auto-TLDR; Data Augmentation for Food Recognition Using Epistemic Uncertainty

Slides Poster Similar

Food recognition has recently attracted attention of many researchers. However, high food ambiguity, inter-class variability and intra-class similarity define a real challenge for the Deep learning and Computer Vision algorithms. In order to improve their performance, it is necessary to better understand what the model learns and, from this, to determine the type of data that should be additionally included for being the most beneficial to the training procedure. In this paper, we propose a new data augmentation strategy that estimates and uses the epistemic uncertainty to guide the model training. The method follows an active learning framework, where the new synthetic images are generated from the hard to classify real ones present in the training data based on the epistemic uncertainty. Hence, it allows the food recognition algorithm to focus on difficult images in order to learn their discriminatives features. On the other hand, avoiding data generation from images that do not contribute to the recognition makes it faster and more efficient. We show that the proposed method allows to improve food recognition and provides a better trade-off between micro- and macro-recall measures.

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

Yaning Li, Liu Yang

Responsive image

Auto-TLDR; Fully Associative Network for Fully Exploiting Correlation Information in Multi-Label Classification

Slides Poster Similar

Recent researches demonstrate that correlation modeling plays a key role in high-performance multi-label classification methods. However, existing methods do not take full advantage of correlation information, especially correlations in feature and label spaces of each image, which limits the performance of correlation-based multi-label classification methods. With more correlations considered, in this study, a Fully Associative Network (FAN) is proposed for fully exploiting correlation information, which involves both visual feature and label correlations. Specifically, FAN introduces a robust covariance pooling to summarize convolution features as global image representation for capturing feature correlation in the multi-label task. Moreover, it constructs an effective label correlation matrix based on a re-weighted scheme, which is fed into a graph convolution network for capturing label correlation. Then, correlation between covariance representations (i.e., feature correlation ) and the outputs of GCN (i.e., label correlation) are modeled for final prediction. Experimental results on two datasets illustrate the effectiveness and efficiency of our proposed FAN compared with state-of-the-art methods.

Transformer Reasoning Network for Image-Text Matching and Retrieval

Nicola Messina, Fabrizio Falchi, Andrea Esuli, Giuseppe Amato

Responsive image

Auto-TLDR; A Transformer Encoder Reasoning Network for Image-Text Matching in Large-Scale Information Retrieval

Slides Poster Similar

Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolution of deep-learning-based image and text processing systems, multi-modal matching remains a challenging problem. In this work, we consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. State-of-the-art results in image-text matching are achieved by inter-playing image and text features from the two different processing pipelines, usually using mutual attention mechanisms. However, this invalidates any chance to extract separate visual and textual features needed for later indexing steps in large-scale retrieval systems. In this regard, we introduce the Transformer Encoder Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive architectures, the Transformer Encoder (TE). This architecture is able to separately reason on the two different modalities and to enforce a final common abstract concept space by sharing the weights of the deeper transformer layers. Thanks to this design, the implemented network is able to produce compact and very rich visual and textual features available for the successive indexing step. Experiments are conducted on the MS-COCO dataset, and we evaluate the results using a discounted cumulative gain metric with relevance computed exploiting caption similarities, in order to assess possibly non-exact but relevant search results. We demonstrate that on this metric we are able to achieve state-of-the-art results in the image retrieval task. Our code is freely available at https://github.com/mesnico/TERN.

Webly Supervised Image-Text Embedding with Noisy Tag Refinement

Niluthpol Mithun, Ravdeep Pasricha, Evangelos Papalexakis, Amit Roy-Chowdhury

Responsive image

Auto-TLDR; Robust Joint Embedding for Image-Text Retrieval Using Web Images

Slides Similar

In this paper, we address the problem of utilizing web images in training robust joint embedding models for the image-text retrieval task. Prior webly supervised approaches directly leverage weakly annotated web images in the joint embedding learning framework. The objective of these approaches would suffer significantly when the ratio of noisy and missing tags associated with the web images is very high. In this regard, we propose a CP decomposition based tensor completion framework to refine the tags of web images by modeling observed ternary inter-relations between the sets of labeled images, tags, and web images as a tensor. To effectively deal with the high ratio of missing entries likely in our case, we incorporate intra-modal correlation as side information in the proposed framework. Our tag refinement approach combined with existing web supervised image-text embedding approaches provide a more principled way for learning the joint embedding models in the presence of significant noise from web data and limited clean labeled data. Experiments on benchmark datasets demonstrate that the proposed approach helps to achieve a significant performance gain in image-text retrieval.

Zero-Shot Text Classification with Semantically Extended Graph Convolutional Network

Tengfei Liu, Yongli Hu, Junbin Gao, Yanfeng Sun, Baocai Yin

Responsive image

Auto-TLDR; Semantically Extended Graph Convolutional Network for Zero-shot Text Classification

Slides Poster Similar

As a challenging task of Natural Language Processing(NLP), zero-shot text classification has attracted more and more attention recently. It aims to detect classes that the model has never seen in the training set. For this purpose, a feasible way is to construct connection between the seen and unseen classes by semantic extension and classify the unseen classes by information propagation over the connection. Although many related zero-shot text classification methods have been exploited, how to realize semantic extension properly and propagate information effectively is far from solved. In this paper, we propose a novel zero-shot text classification method called Semantically Extended Graph Convolutional Network (SEGCN). In the proposed method, the semantic category knowledge from ConceptNet is utilized to semantic extension for linking seen classes to unseen classes and constructing a graph of all classes. Then, we build upon Graph Convolutional Network (GCN) for predicting the textual classifier for each category, which transfers the category knowledge by the convolution operators on the constructed graph and is trained in a semi-supervised manner using the samples of the seen classes. The experimental results on Dbpedia and 20newsgroup datasets show that our method outperforms the state of the art zero-shot text classification methods.

Price Suggestion for Online Second-Hand Items

Liang Han, Zhaozheng Yin, Zhurong Xia, Li Guo, Mingqian Tang, Rong Jin

Responsive image

Auto-TLDR; An Intelligent Price Suggestion System for Online Second-hand Items

Slides Poster Similar

This paper describes an intelligent price suggestion system for online second-hand listings. In contrast to conventional pricing strategies which are employed to a large number of identical products, or to non-identical but similar products such as homes on Airbnb, the proposed system provides price suggestions for online second-hand items which are non-identical and fall into numerous different categories. Moreover, simplifying the item listing process for users is taken into consideration when designing the price suggestion system. Specifically, we design a truncate loss to train a vision-based price suggestion module which mainly takes some vision-based features as input to first classify whether an uploaded item image is qualified for price suggestion, and then offer price suggestions for items with qualified images. For the items with unqualified images, we encourage users to input some text descriptions of the items, and with the text descriptions, we design a multimodal item retrieval module to offer price suggestions. Extensive experiments demonstrate the effectiveness of the proposed system.

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

Tobia Tesan, Pasquale Coscia, Lamberto Ballan

Responsive image

Auto-TLDR; Context-Based Image Annotation with Multiple Semantic Embeddings and Recurrent Neural Networks

Slides Poster Similar

Images represent a commonly used form of visual communication among people. Nevertheless, image classification may be a challenging task when dealing with unclear or non-common images needing more context to be correctly annotated. Metadata accompanying images on social-media represent an ideal source of additional information for retrieving proper neighborhoods easing image annotation task. To this end, we blend visual features extracted from neighbors and their metadata to jointly leverage context and visual cues. Our models use multiple semantic embeddings to achieve the dual objective of being robust to vocabulary changes between train and test sets and decoupling the architecture from the low-level metadata representation. Convolutional and recurrent neural networks (CNNs-RNNs) are jointly adopted to infer similarity among neighbors and query images. We perform comprehensive experiments on the NUS-WIDE dataset showing that our models outperform state-of-the-art architectures based on images and metadata, and decrease both sensory and semantic gaps to better annotate images.

Cross-Media Hash Retrieval Using Multi-head Attention Network

Zhixin Li, Feng Ling, Chuansheng Xu, Canlong Zhang, Huifang Ma

Responsive image

Auto-TLDR; Unsupervised Cross-Media Hash Retrieval Using Multi-Head Attention Network

Slides Poster Similar

The cross-media hash retrieval method is to encode multimedia data into a common binary hash space, which can effectively measure the correlation between samples from different modalities. In order to further improve the retrieval accuracy, this paper proposes an unsupervised cross-media hash retrieval method based on multi-head attention network. First of all, we use a multi-head attention network to make better matching images and texts, which contains rich semantic information. At the same time, an auxiliary similarity matrix is constructed to integrate the original neighborhood information from different modalities. Therefore, this method can capture the potential correlations between different modalities and within the same modality, so as to make up for the differences between different modalities and within the same modality. Secondly, the method is unsupervised and does not require additional semantic labels, so it has the potential to achieve large-scale cross-media retrieval. In addition, batch normalization and replacement hash code generation functions are adopted to optimize the model, and two loss functions are designed, which make the performance of this method exceed many supervised deep cross-media hash methods. Experiments on three datasets show that the average performance of this method is about 5 to 6 percentage points higher than the state-of-the-art unsupervised method, which proves the effectiveness and superiority of this method.

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Responsive image

Auto-TLDR; Fully-Attentive Reduction for Vision and Language

Slides Poster Similar

The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements - like regions and words - proper reduction functions are needed to transform a set of encoded elements into a single response, like a classification or similarity score. In this paper, we propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction, which can be used for both classification and ranking. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices, on both COCO and VQA 2.0 datasets. Experimentally, we demonstrate that our approach leads to a performance increase on both tasks. Further, we conduct ablation studies to validate the role of each component of the approach.

Automatic Classification of Human Granulosa Cells in Assisted Reproductive Technology Using Vibrational Spectroscopy Imaging

Marina Paolanti, Emanuele Frontoni, Giorgia Gioacchini, Giorgini Elisabetta, Notarstefano Valentina, Zacà Carlotta, Carnevali Oliana, Andrea Borini, Marco Mameli

Responsive image

Auto-TLDR; Predicting Oocyte Quality in Assisted Reproductive Technology Using Machine Learning Techniques

Slides Poster Similar

In the field of reproductive technology, the biochemical composition of female gametes has been successfully investigated with the use of vibrational spectroscopy. Currently, in assistive reproductive technology (ART), there are no shared criteria for the choice of oocyte, and automatic classification methods for the best quality oocytes have not yet been applied. In this paper, considering the lack of criteria in Assisted Reproductive Technology (ART), we use Machine Learning (ML) techniques to predict oocyte quality for a successful pregnancy. To improve the chances of successful implantation and minimize any complications during the pregnancy, Fourier transform infrared microspectroscopy (FTIRM) analysis has been applied on granulosa cells (GCs) collected along with the oocytes during oocyte aspiration, as it is routinely done in ART, and specific spectral biomarkers were selected by multivariate statistical analysis. A proprietary biological reference dataset (BRD) was successfully collected to predict the best oocyte for a successful pregnancy. Personal health information are stored, maintained and backed up using a cloud computing service. Using a user-friendly interface, the user will evaluate whether or not the selected oocyte will have a positive result. This interface includes a dashboard for retrospective analysis, reporting, real-time processing, and statistical analysis. The experimental results are promising and confirm the efficiency of the method in terms of classification metrics: precision, recall, and F1-score (F1) measures.

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Tianling Jiang, Yi Ji, Chunping Liu

Responsive image

Auto-TLDR; Integrating Historical States and Co-attention for Visual Dialog

Slides Poster Similar

Visual dialog is a typical multi-modal task which involves both vision and language. Nowadays, it faces two major difficulties. In this paper, we propose Integrating Historical States and Co-attention (HSCA) for visual dialog to solve them. It includes two main modules, Co-ATT and MATCH. Specifically, the main purpose of the Co-ATT module is to guide the image with questions and answers in the early stage to get more specific objects. It tackles the temporal sequence issue in historical information which may influence the precise answer for multi-round questions. The MATCH module is, based on a question with pronouns, to retrieve the best matching historical information block. It overcomes the visual reference problem which requires to solve pronouns referring to unknowns in the text message and then to locate the objects in the given image. We quantitatively and qualitatively evaluate our model on VisDial v1.0, at the same time, ablation studies are carried out. The experimental results demonstrate that HSCA outperforms the state-of-the-art methods in many aspects.

MEG: Multi-Evidence GNN for Multimodal Semantic Forensics

Ekraam Sabir, Ayush Jaiswal, Wael Abdalmageed, Prem Natarajan

Responsive image

Auto-TLDR; Scalable Image Repurposing Detection with Graph Neural Network Based Model

Slides Poster Similar

Image repurposing is a category of fake news where a digitally unmanipulated image is misrepresented by means of its accompanying metadata such as captions, location, etc., where the image and accompanying metadata together comprise a multimedia package. The problem setup is to authenticate a query multimedia package using a reference dataset of potentially related packages as evidences. Existing methods are limited to using a single evidence (retrieved package), which ignores potential performance improvement from the use of multiple evidences. In this work, we introduce a novel graph neural network based model for image repurposing detection, which effectively utilizes multiple retrieved packages as evidences and is scalable with the number of evidences. We compare the scalability and performance of our model against existing methods. Experimental results show that the proposed model outperforms existing state-of-the-art for image repurposing detection with an error reduction of up to 25%.

Hierarchical Multimodal Attention for Deep Video Summarization

Melissa Sanabria, Frederic Precioso, Thomas Menguy

Responsive image

Auto-TLDR; Automatic Summarization of Professional Soccer Matches Using Event-Stream Data and Multi- Instance Learning

Slides Poster Similar

The way people consume sports on TV has drastically evolved in the last years, particularly under the combined effects of the legalization of sport betting and the huge increase of sport analytics. Several companies are nowadays sending observers in the stadiums to collect live data of all the events happening on the field during the match. Those data contain meaningful information providing a very detailed description of all the actions occurring during the match to feed the coaches and staff, the fans, the viewers, and the gamblers. Exploiting all these data, sport broadcasters want to generate extra content such as match highlights, match summaries, players and teams analytics, etc., to appeal subscribers. This paper explores the problem of summarizing professional soccer matches as automatically as possible using both the aforementioned event-stream data collected from the field and the content broadcasted on TV. We have designed an architecture, introducing first (1) a Multiple Instance Learning method that takes into account the sequential dependency among events and then (2) a hierarchical multimodal attention layer that grasps the importance of each event in an action. We evaluate our approach on matches from two professional European soccer leagues, showing its capability to identify the best actions for automatic summarization by comparing with real summaries made by human operators.

Assessing the Severity of Health States Based on Social Media Posts

Shweta Yadav, Joy Prakash Sain, Amit Sheth, Asif Ekbal, Sriparna Saha, Pushpak Bhattacharyya

Responsive image

Auto-TLDR; A Multiview Learning Framework for Assessment of Health State in Online Health Communities

Slides Poster Similar

The unprecedented growth of Internet users has resulted in an abundance of unstructured information on social media including health forums, where patients request health-related information or opinions from other users. Previous studies have shown that online peer support has limited effectiveness without expert intervention. Therefore, a system capable of assessing the severity of health state from the patients' social media posts can help health professionals (HP) in prioritizing the user’s post. In this study, we inspect the efficacy of different aspects of Natural Language Understanding (NLU) to identify the severity of the user’s health state in relation to two perspectives(tasks) (a) Medical Condition (i.e., Recover, Exist, Deteriorate, Other) and (b) Medication (i.e., Effective, Ineffective, Serious Adverse Effect, Other) in online health communities. We propose a multiview learning framework that models both the textual content as well as contextual-information to assess the severity of the user’s health state. Specifically, our model utilizes the NLU views such as sentiment, emotions, personality, and use of figurative language to extract the contextual information. The diverse NLU views demonstrate its effectiveness on both the tasks and as well as on the individual disease to assess a user’s health.

Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization

Li Ren, Kai Li, Liqiang Wang, Kien Hua

Responsive image

Auto-TLDR; Adversarial Discriminative Domain Regularization for Efficient Cross-Modal Matching

Slides Poster Similar

Matching information across image and text modalities is a fundamental challenge for many applications that involve both vision and natural language processing. The objective is to find efficient similarity metrics to compare the similarity between visual and textual information. Existing approaches mainly match the local visual objects and the sentence words in a shared space with attention mechanisms. The matching performance is still limited because the similarity computation is based on simple comparisons of the matching features, ignoring the characteristics of their distribution in the data. In this paper, we address this limitation with an efficient learning objective that considers the discriminative feature distributions between the visual objects and sentence words. Specifically, we propose a novel Adversarial Discriminative Domain Regularization (ADDR) learning framework, beyond the paradigm metric learning objective, to construct a set of discriminative data domains within each image-text pairs. Our approach can generally improve the learning efficiency and the performance of existing metrics learning frameworks by regulating the distribution of the hidden space between the matching pairs. The experimental results show that this new approach significantly improves the overall performance of several popular cross-modal matching techniques (SCAN, VSRN, BFAN) on the MS-COCO and Flickr30K benchmarks.

VSR++: Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching

Hui Yuan, Yan Huang, Dongbo Zhang, Zerui Chen, Wenlong Cheng, Liang Wang

Responsive image

Auto-TLDR; Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching

Slides Poster Similar

Image-text matching has made great progresses recently, but there still remains challenges in fine-grained matching. To deal with this problem, we propose an Improved Visual Semantic Reasoning model (VSR++), which jointly models 1) global alignment between images and texts and 2) local correspondence between regions and words in a unified framework. To exploit their complementary advantages, we also develop a suitable learning strategy to balance their relative importance. As a result, our model can distinguish image regions and text words in a fine-grained level, and thus achieves the current stateof-the-art performance on two benchmark datasets.

Hybrid Decomposition Convolution Neural Network and Vocabulary Forest for Image Retrieval

Djenouri Youcef, Jon Hjelmervik

Responsive image

Auto-TLDR; DCNN-vForest: Convolutional Neural Network and Vocabulary Forest for Efficient Image Retrieval

Slides Poster Similar

This paper introduces a highly efficient image retrieval technique called DCNN-vForest (Decomposition Convolution Neural Network and vocabulary Forest), which aims to retrieve the relevant images to the given image query by studying the correlation between images in the image database based on decomposition. The regional and global features of the image database are first extracted using the convolution neural network, and then divided into similar clusters using the Kmeans algorithm. We propose a new structure called vForest (vocabulary Forest), by calculating the vocabulary tree on each cluster of images. The retrieval process benefits from the knowledge provided by the vForest, and instead of considering the whole image database, only the most similar clusters to the image query are explored. To demonstrate the usefulness of our approach, intensive experiments have been carried out on ground-truth image databases, the results reveal the superiority of DCNN-vForest against the baseline image retrieval solutions, in terms of runtime and accuracy.

BAT Optimized CNN Model Identifies Water Stress in Chickpea Plant Shoot Images

Shiva Azimi, Taranjit Kaur, Tapan Gandhi

Responsive image

Auto-TLDR; BAT Optimized ResNet-18 for Stress Classification of chickpea shoot images under water deficiency

Slides Poster Similar

Stress due to water deficiency in plants can significantly lower the agricultural yield. It can affect many visible plant traits such as size and surface area, the number of leaves and their color, etc. In recent years, computer vision-based plant phenomics has emerged as a promising tool for plant research and management. Such techniques have the advantage of being non-destructive, non-evasive, fast, and offer high levels of automation. Pulses like chickpeas play an important role in ensuring food security in poor countries owing to their high protein and nutrition content. In the present work, we have built a dataset comprising of two varieties of chickpea plant shoot images under different moisture stress conditions. Specifically, we propose a BAT optimized ResNet-18 model for classifying stress induced by water deficiency using chickpea shoot images. BAT algorithm identifies the optimal value of the mini-batch size to be used for training rather than employing the traditional manual approach of trial and error. Experimentation on two crop varieties (JG and Pusa) reveals that BAT optimized approach achieves an accuracy of 96% and 91% for JG and Pusa varieties that is better than the traditional method by 4%. The experimental results are also compared with state of the art CNN models like Alexnet, GoogleNet, and ResNet-50. The comparison results demonstrate that the proposed BAT optimized ResNet-18 model achieves higher performance than the comparison counterparts.

Discrete Semantic Matrix Factorization Hashing for Cross-Modal Retrieval

Jianyang Qin, Lunke Fei, Shaohua Teng, Wei Zhang, Genping Zhao, Haoliang Yuan

Responsive image

Auto-TLDR; Discrete Semantic Matrix Factorization Hashing for Cross-Modal Retrieval

Slides Poster Similar

Hashing has been widely studied for cross-modal retrieval due to its promising efficiency and effectiveness in massive data analysis. However, most existing supervised hashing has the limitations of inefficiency for very large-scale search and intractable discrete constraint for hash codes learning. In this paper, we propose a new supervised hashing method, namely, Discrete Semantic Matrix Factorization Hashing (DSMFH), for cross-modal retrieval. First, we conduct the matrix factorization via directly utilizing the available label information to obtain a latent representation, so that both the inter-modality and intra-modality similarities are well preserved. Then, we simultaneously learn the discriminative hash codes and corresponding hash functions by deriving the matrix factorization into a discrete optimization. Finally, we adopt an alternatively iterative procedure to efficiently optimize the matrix factorization and discrete learning. Extensive experimental results on three widely used image-tag databases demonstrate the superiority of the DSMFH over state-of-the-art cross-modal hashing methods.

Multi-Modal Identification of State-Sponsored Propaganda on Social Media

Xiaobo Guo, Soroush Vosoughi

Responsive image

Auto-TLDR; A balanced dataset for detecting state-sponsored Internet propaganda

Slides Poster Similar

The prevalence of state-sponsored propaganda on the Internet has become a cause for concern in the recent years. While much effort has been made to identify state-sponsored Internet propaganda, the problem remains far from being solved because the ambiguous definition of propaganda leads to unreliable data labelling, and the huge amount of potential predictive features causes the models to be inexplicable. This paper is the first attempt to build a balanced dataset for this task. The dataset is comprised of propaganda by three different organizations across two time periods. A multi-model framework for detecting propaganda messages solely based on the visual and textual content is proposed which achieves a promising performance on detecting propaganda by the three organizations both for the same time period (training and testing on data from the same time period) (F1=0.869) and for different time periods (training on past, testing on future) (F1=0.697). To reduce the influence of false positive predictions, we change the threshold to test the relationship between the false positive and true positive rates and provide explanations for the predictions made by our models with visualization tools to enhance the interpretability of our framework. Our new dataset and general framework provide a strong benchmark for the task of identifying state-sponsored Internet propaganda and point out a potential path for future work on this task.

Label or Message: A Large-Scale Experimental Survey of Texts and Objects Co-Occurrence

Koki Takeshita, Juntaro Shioyama, Seiichi Uchida

Responsive image

Auto-TLDR; Large-scale Survey of Co-occurrence between Objects and Scene Text with a State-of-the-art Scene Text detector and Recognizer

Similar

Our daily life is surrounded by textual information. Nowadays, the automatic collection of textual information becomes possible owing to the drastic improvement of scene text detectors and recognizer. The purpose of this paper is to conduct a large-scale survey of co-occurrence between visual objects (such as book and car) and scene texts with a large image dataset and a state-of-the-art scene text detector and recognizer. Especially, we focus on the function of ``label'' texts, which are attached to objects for detailing the objects. By analyzing co-occurrence between objects and scene texts, it is possible to observe the statistics about the label texts and understand how the scene texts will be useful for recognizing the objects and vice versa.

Deep Convolutional Embedding for Digitized Painting Clustering

Giovanna Castellano, Gennaro Vessio

Responsive image

Auto-TLDR; A Deep Convolutional Embedding Model for Clustering Artworks

Slides Poster Similar

Clustering artworks is difficult because of several reasons. On one hand, recognizing meaningful patterns in accordance with domain knowledge and visual perception is extremely hard. On the other hand, the application of traditional clustering and feature reduction techniques to the highly dimensional pixel space can be ineffective. To address these issues, we propose to use a deep convolutional embedding model for digitized painting clustering, in which the task of mapping the input raw data to an abstract, latent space is jointly optimized with the task of finding a set of cluster centroids in this latent feature space. Quantitative and qualitative experimental results show the effectiveness of the proposed method. The model is also able to outperform other state-of-the-art deep clustering approaches to the same problem. The proposed method may be beneficial to several art-related tasks, particularly visual link retrieval and historical knowledge discovery in painting datasets.

Automatic Annotation of Corpora for Emotion Recognition through Facial Expressions Analysis

Alex Mircoli, Claudia Diamantini, Domenico Potena, Emanuele Storti

Responsive image

Auto-TLDR; Automatic annotation of video subtitles on the basis of facial expressions using machine learning algorithms

Slides Poster Similar

The recent diffusion of social networks has made available an unprecedented amount of user-generated content, which may be analyzed in order to determine people's opinions and emotions about a large variety of topics. Research has made many efforts in defining accurate algorithms for analyzing emotions expressed by users in texts; however, their performance often rely on the existence of large annotated datasets, whose current scarcity represents a major issue. The manual creation of such datasets represents a costly and time-consuming activity and hence there is an increasing demand for techniques for the automatic annotation of corpora. In this work we present a methodology for the automatic annotation of video subtitles on the basis of the analysis of facial expressions of people in videos, with the goal of creating annotated corpora that may be used to train emotion recognition algorithms. Facial expressions are analyzed through machine learning algorithms, on the basis of a set of manually-engineered facial features that are extracted from video frames. The soundness of the proposed methodology has been evaluated through an extensive experimentation aimed at determining the performance on real datasets of each methodological step.

JECL: Joint Embedding and Cluster Learning for Image-Text Pairs

Sean Yang, Kuan-Hao Huang, Bill Howe

Responsive image

Auto-TLDR; JECL: Clustering Image-Caption Pairs with Parallel Encoders and Regularized Clusters

Poster Similar

We propose JECL, a method for clustering image-caption pairs by training parallel encoders with regularized clustering and alignment objectives, simultaneously learning both representations and cluster assignments. These image-caption pairs arise frequently in high-value applications where structured training data is expensive to produce, but free-text descriptions are common. JECL trains by minimizing the Kullback-Leibler divergence between the distribution of the images and text to that of a combined joint target distribution and optimizing the Jensen-Shannon divergence between the soft cluster assignments of the images and text. Regularizers are also applied to JECL to prevent trivial solutions. Experiments show that JECL outperforms both single-view and multi-view methods on large benchmark image-caption datasets, and is remarkably robust to missing captions and varying data sizes.

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

Edward Kim, Connor Onweller, Kathleen F. Mccoy

Responsive image

Auto-TLDR; A multimodal deep learning framework that can generate summarization text supporting the main idea of an information graphic for presentation to blind or visually impaired

Slides Similar

We present a multimodal deep learning framework that can generate summarization text supporting the main idea of an information graphic for presentation to a person who is blind or visually impaired. The framework utilizes the visual, textual, positional, and size characteristics extracted from the image to create the summary. Different and complimentary neural architectures are optimized for each task using crowdsourced training data. From our quantitative experiments and results, we explain the reasoning behind our framework and show the effectiveness of our models. Our qualitative results showcase text generated from our framework and show that Mechanical Turk participants favor them to other automatic and human generated summarizations. We describe the design and of of an experiment to evaluate the utility of our system for people who have visual impairments in the context of understanding Twitter Tweets containing line graphs.

RGB-Infrared Person Re-Identification Via Image Modality Conversion

Huangpeng Dai, Qing Xie, Yanchun Ma, Yongjian Liu, Shengwu Xiong

Responsive image

Auto-TLDR; CE2L: A Novel Network for Cross-Modality Re-identification with Feature Alignment

Slides Poster Similar

As a cross modality retrieval task, RGB-infrared person re-identification(Re-ID) is an important and challenging tasking, because of its important role in video surveillance applications and large cross-modality variations between visible and infrared images. Most previous works addressed the problem of cross-modality gap with feature alignment by original feature representation learning straightly. In this paper, different from existing works, we propose a novel network(CE2L) to tackle the cross-modality gap with feature alignment. CE2L mainly focuses on adding discriminative information and learning robust features by converting modality between visible and infrared images. Its merits are highlighted in two aspects: 1)Using CycleGAN to convert infrared images into color images can not only increase the recognition characteristics of images, but also allow the our network to better learn the two modal image features; 2)Our novel method can serve as data augmentation. Specifically, it can increase data diversity and total data against over-fitting by converting labeled training images to another modal images. Extensive experimental results on two datasets demonstrate superior performance compared to the baseline and the state-of-the-art methods.

VSB^2-Net: Visual-Semantic Bi-Branch Network for Zero-Shot Hashing

Xin Li, Xiangfeng Wang, Bo Jin, Wenjie Zhang, Jun Wang, Hongyuan Zha

Responsive image

Auto-TLDR; VSB^2-Net: inductive zero-shot hashing for image retrieval

Slides Poster Similar

Zero-shot hashing aims at learning hashing model from seen classes and the obtained model is capable of generalizing to unseen classes for image retrieval. Inspired by zero-shot learning, existing zero-shot hashing methods usually transfer the supervised knowledge from seen to unseen classes, by embedding the hamming space to a shared semantic space. However, this makes instances difficult to distinguish due to limited hashing bit numbers, especially for semantically similar unseen classes. We propose a novel inductive zero-shot hashing framework, i.e., VSB^2-Net, where both semantic space and visual feature space are embedded to the same hamming space instead. The reconstructive semantic relationships are established in the hamming space, preserving local similarity relationships and explicitly enlarging the discrepancy between semantic hamming vectors. A two-task architecture, comprising of classification module and visual feature reconstruction module, is employed to enhance the generalization and transfer abilities. Extensive evaluation results on several benchmark datasets demonstratethe superiority of our proposed method compared to several state-of-the-art baselines.

Fast Discrete Cross-Modal Hashing Based on Label Relaxation and Matrix Factorization

Donglin Zhang, Xiaojun Wu, Zhen Liu, Jun Yu, Josef Kittler

Responsive image

Auto-TLDR; LRMF: Label Relaxation and Discrete Matrix Factorization for Cross-Modal Retrieval

Poster Similar

In recent years, cross-media retrieval has drawn considerable attention due to the exponential growth of multimedia data. Many hashing approaches have been proposed for the cross-media search task. However, there are still open problems that warrant investigation. For example, most existing supervised hashing approaches employ a binary label matrix, which achieves small margins between wrong labels (0) and true labels (1). This may affect the retrieval performance by generating many false negatives and false positives. In addition, some methods adopt a relaxation scheme to solve the binary constraints, which may cause large quantization errors. There are also some discrete hashing methods that have been presented, but most of them are time-consuming. To conquer these problems, we present a label relaxation and discrete matrix factorization method (LRMF) for cross-modal retrieval. It offers a number of innovations. First of all, the proposed approach employs a novel label relaxation scheme to control the margins adaptively, which has the benefit of reducing the quantization error. Second, by virtue of the proposed discrete matrix factorization method designed to learn the binary codes, large quantization errors caused by relaxation can be avoided. The experimental results obtained on two widely-used databases demonstrate that LRMF outperforms state-of-the-art cross-media methods.

To Honor Our Heroes: Analysis of the Obituaries of Australians Killed in Action in WWI and WWII

Marc Cheong, Mark Alfano

Responsive image

Auto-TLDR; Obituaries of World War I and World War II: A Map of Values and Virtues attributed to Australian Military Personnel

Slides Poster Similar

Obituaries represent a prominent way of expressing the human universal of grief. According to philosophers, obituaries are a ritualized way of evaluating both individuals who have passed away and the communities that helped to shape them. The basic idea is that you can tell what it takes to count as a good person of a particular type in a particular community by seeing how persons of that type are described and celebrated in their obituaries. Obituaries of those killed in conflict, in particular, are rich repositories of communal values, as they reflect the values and virtues that are admired and respected in individuals who are considered to be heroes in their communities. In this paper, we use natural language processing techniques to map the patterns of values and virtues attributed to Australian military personnel who were killed in action during World War I and World War II. Doing so reveals several clusters of values and virtues that tend to be attributed together. In addition, we use named entity recognition and geotagging the track the movements of these soldiers to various theatres of the wars, including North Africa, Europe, and the Pacific.

On Identification and Retrieval of Near-Duplicate Biological Images: A New Dataset and Protocol

Thomas E. Koker, Sai Spandana Chintapalli, San Wang, Blake A. Talbot, Daniel Wainstock, Marcelo Cicconet, Mary C. Walsh

Responsive image

Auto-TLDR; BINDER: Bio-Image Near-Duplicate Examples Repository for Image Identification and Retrieval

Poster Similar

Manipulation and re-use of images in scientific publications is a growing issue, not only for biomedical publishers, but also for the research community in general. In this work we introduce BINDER -- Bio-Image Near-Duplicate Examples Repository, a novel dataset to help researchers develop, train, and test models to detect same-source biomedical images. BINDER contains 7,490 unique image patches for model training, 1,821 same-size patch duplicates for validation and testing, and 868 different-size image/patch pairs for image retrieval validation and testing. Except for the training set, patches already contain manipulations including rotation, translation, scale, perspective transform, contrast adjustment and/or compression artifacts. We further use the dataset to demonstrate how novel adaptations of existing image retrieval and metric learning models can be applied to achieve high-accuracy inference results, creating a baseline for future work. In aggregate, we thus present a supervised protocol for near-duplicate image identification and retrieval without any "real-world" training example. Our dataset and source code are available at hms-idac.github.io/BINDER.

Face Anti-Spoofing Using Spatial Pyramid Pooling

Lei Shi, Zhuo Zhou, Zhenhua Guo

Responsive image

Auto-TLDR; Spatial Pyramid Pooling for Face Anti-Spoofing

Slides Poster Similar

Face recognition system is vulnerable to many kinds of presentation attacks, so how to effectively detect whether the image is from the real face is particularly important. At present, many deep learning-based anti-spoofing methods have been proposed. But these approaches have some limitations, for example, global average pooling (GAP) easily loses local information of faces, single-scale features easily ignore information differences in different scales, while a complex network is prune to be overfitting. In this paper, we propose a face anti-spoofing approach using spatial pyramid pooling (SPP). Firstly, we use ResNet-18 with a small amount of parameter as the basic model to avoid overfitting. Further, we use spatial pyramid pooling module in the single model to enhance local features while fusing multi-scale information. The effectiveness of the proposed method is evaluated on three databases, CASIA-FASD, Replay-Attack and CASIA-SURF. The experimental results show that the proposed approach can achieve state-of-the-art performance.

Self-Supervised Learning with Graph Neural Networks for Region of Interest Retrieval in Histopathology

Yigit Ozen, Selim Aksoy, Kemal Kosemehmetoglu, Sevgen Onder, Aysegul Uner

Responsive image

Auto-TLDR; Self-supervised Contrastive Learning for Deep Representation Learning of Histopathology Images

Slides Poster Similar

Deep learning has achieved successful performance in representation learning and content-based retrieval of histopathology images. The commonly used setting in deep learning-based approaches is supervised training of deep neural networks for classification, and using the trained model to extract representations that are used for computing and ranking the distances between images. However, there are two remaining major challenges. First, supervised training of deep neural networks requires large amount of manually labeled data which is often limited in the medical field. Transfer learning has been used to overcome this challenge, but its success remained limited. Second, the clinical practice in histopathology necessitates working with regions of interest (ROI) of multiple diagnostic classes with arbitrary shapes and sizes. The typical solution to this problem is to aggregate the representations of fixed-sized patches cropped from these regions to obtain region-level representations. However, naive methods cannot sufficiently exploit the rich contextual information in the complex tissue structures. To tackle these two challenges, we propose a generic method that utilizes graph neural networks (GNN), combined with a self-supervised training method using a contrastive loss. GNN enables representing arbitrarily-shaped ROIs as graphs and encoding contextual information. Self-supervised contrastive learning improves quality of learned representations without requiring labeled data. The experiments using a challenging breast histopathology data set show that the proposed method achieves better performance than the state-of-the-art.

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Yanan Li, Yuetan Lin, Hongrui Zhao, Donghui Wang

Responsive image

Auto-TLDR; TextVQA: An End-to-End Visual Question Answering Model for Text-Based VQA

Slides Similar

As a typical cross-modal problem, visual question answering (VQA) has received increasing attention from the communities of computer vision and natural language processing. Reading and reasoning about texts and visual contents in the images is a burgeoning and important research topic in VQA, especially for the visually impaired assistance applications. Given an image, it aims to predict an answer to a provided natural language question closely related to its textual contents. In this paper, we propose a novel end-to-end textual content based VQA model, which grounds question answering both on the visual and textual information. After encoding the image, question and recognized text words, it uses multi-modal factorized high-order modules and the attention mechanism to fuse question-image and question-text features respectively. The complex correlations among different features can be captured efficiently. To ensure the model's extendibility, it embeds candidate answers and recognized texts in a semantic embedding space and adopts semantic embedding based classifier to perform answer prediction. Extensive experiments on the newly proposed benchmark TextVQA demonstrate that the proposed model can achieve promising results.

Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

Jianrong Wang, Tong Wu, Shanyu Wang, Mei Yu, Qiang Fang, Ju Zhang, Li Liu

Responsive image

Auto-TLDR; Lip Motion Network for Text-Independent and Text-Dependent Speaker Recognition

Slides Poster Similar

Lip motion reflects behavior characteristics of speakers, and thus can be used as a new kind of biometrics in speaker recognition. In the literature, lots of works used two dimensional (2D) lip images to recognize speaker in a text-dependent context. However, 2D lip easily suffers from face orientations. To this end, in this work, we present a novel end-to-end 3D lip motion Network (3LMNet) by utilizing the sentence-level 3D lip motion (S3DLM) to recognize speakers in both the text-independent and text-dependent contexts. A novel regional feedback module (RFM) is proposed to explore attentions in different lip regions. Besides, prior knowledge of lip motion is investigated to complement RFM, where landmark-level and frame-level features are merged to form a better feature representation. Moreover, we present two methods, i.e., coordinate transformation and face posture correction to pre-process the LSD-AV dataset, which contains 68 speakers and 146 sentences per speaker. The evaluation results on this dataset demonstrate that our proposed 3LMNet is superior to the baseline models, i.e., LSTM, VGG-16 and ResNet-34, and outperforms the state-of-the-art using 2D lip image as well as the 3D face. The code of this work is released at https://github.com/wutong18/Three-Dimensional-Lip-Motion-Ne twork-for-Text-Independent-Speaker-Recognition.

Person Recognition with HGR Maximal Correlation on Multimodal Data

Yihua Liang, Fei Ma, Yang Li, Shao-Lun Huang

Responsive image

Auto-TLDR; A correlation-based multimodal person recognition framework that learns discriminative embeddings of persons by joint learning visual features and audio features

Slides Poster Similar

Multimodal person recognition is a common task in video analysis and public surveillance, where information from multiple modalities, such as images and audio extracted from videos, are used to jointly determine the identity of a person. Previous person recognition techniques either use only uni-modal data or only consider shared representations between different input modalities, while leaving the extraction of their relationship with identity information to downstream tasks. Furthermore, real-world data often contain noise, which makes recognition more challenging practical situations. In our work, we propose a novel correlation-based multimodal person recognition framework that is relatively simple but can efficaciously learn supervised information in multimodal data fusion and resist noise. Specifically, our framework learns a discriminative embeddings of persons by joint learning visual features and audio features while maximizing HGR maximal correlation among multimodal input and persons' identities. Experiments are done on a subset of Voxceleb2. Compared with state-of-the-art methods, the proposed method demonstrates an improvement of accuracy and robustness to noise.

A Systematic Investigation on End-To-End Deep Recognition of Grocery Products in the Wild

Marco Leo, Pierluigi Carcagni, Cosimo Distante

Responsive image

Auto-TLDR; Automatic Recognition of Products on grocery shelf images using Convolutional Neural Networks

Slides Poster Similar

Automatic recognition of products on grocery shelf images is a new and attractive topic in computer vision and machine learning since, it can be exploited in different application areas. This paper introduces a complete end-to-end pipeline (without preliminary radiometric and spatial transformations usually involved while dealing with the considered issue) and it provides a systematic investigation of recent machine learning models based on convolutional neural networks for addressing the product recognition task by exploiting the proposed pipeline on a recent challenging grocery product dataset. The investigated models were never been used in this context: they derive from the successful and more generic object recognition task and have been properly tuned to address this specific issue. Besides, also ensembles of nets built by most advanced theoretical fundaments have been taken into account. Gathered classification results were very encouraging since the recognition accuracy has been improved up to 15\% with respect to the leading approaches in the state of art on the same dataset. A discussion about the pros and cons of the investigated solutions are discussed by paving the path towards new research lines.

Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval

Kuan-Hsun Wang, Chia Chun Cheng, Yi-Ling Chen, Yale Song, Shang-Hong Lai

Responsive image

Auto-TLDR; Attention-based Deep Metric Learning for Near-duplicate Video Retrieval

Slides Similar

Near-duplicate video retrieval (NDVR) is an important and challenging problem due to the increasing amount of videos uploaded to the Internet. In this paper, we propose an attention-based deep metric learning method for NDVR. Our method is based on well-established principles: We leverage two-stream networks to combine RGB and optical flow features, and incorporate an attention module to effectively deal with distractor frames commonly observed in near duplicate videos. We further aggregate the features corresponding to multiple video segments to enhance the discriminative power. The whole system is trained using a deep metric learning objective with a Siamese architecture. Our experiments show that the attention module helps eliminate redundant and noisy frames, while focusing on visually relevant frames for solving NVDR. We evaluate our approach on recent large-scale NDVR datasets, CC_WEB_VIDEO, VCDB, FIVR and SVD. To demonstrate the generalization ability of our approach, we report results in both within- and cross-dataset settings, and show that the proposed method significantly outperforms state-of-the-art approaches.

Prior Knowledge about Attributes: Learning a More Effective Potential Space for Zero-Shot Recognition

Chunlai Chai, Yukuan Lou

Responsive image

Auto-TLDR; Attribute Correlation Potential Space Generation for Zero-Shot Learning

Slides Poster Similar

Zero-shot learning (ZSL) aims to recognize unseen classes accurately by learning seen classes and known attributes, but correlations in attributes were ignored by previous study which lead to classification results confused. To solve this problem, we build an Attribute Correlation Potential Space Generation (ACPSG) model which uses a graph convolution network and attribute correlation to generate a more discriminating potential space. Combining potential discrimination space and user-defined attribute space, we can better classify unseen classes. Our approach outperforms some existing state-of-the-art methods on several benchmark datasets, whether it is conventional ZSL or generalized ZSL.

Multi-Graph Convolutional Network for Relationship-Driven Stock Movement Prediction

Jiexia Ye, Juanjuan Zhao, Kejiang Ye, Cheng-Zhong Xu

Responsive image

Auto-TLDR; Multi-GCGRU: A Deep Learning Framework for Stock Price Prediction with Cross Effect

Slides Poster Similar

Stock price movement prediction is commonly accepted as a very challenging task due to the volatile nature of financial markets. Previous works typically predict the stock price mainly based on its own information, neglecting the cross effect among involved stocks. However, it is well known that an individual stock price is correlated with prices of other stocks in complex ways. To take the cross effect into consideration, we propose a deep learning framework, called Multi-GCGRU, which comprises graph convolutional network (GCN) and gated recurrent units (GRU) to predict stock movement. Specifically, we first encode multiple relationships among stocks into graphs based on financial domain knowledge and utilize GCN to extract the cross effect based on the pre-defined graphs. The cross-correlation features produced by GCN are concatenated with historical records and fed into GRU to model the temporal pattern in stock price. To further get rid of prior knowledge, we explore an adaptive stock graph learned by data automatically. Experiments on two stock indexes in China market show that our model outperforms other baselines. Note that our model is rather feasible to incorporate more effective pre-defined stock relationships. What's more, it can also learn a data-driven relationship without any domain knowledge.

Weight Estimation from an RGB-D Camera in Top-View Configuration

Marco Mameli, Marina Paolanti, Nicola Conci, Filippo Tessaro, Emanuele Frontoni, Primo Zingaretti

Responsive image

Auto-TLDR; Top-View Weight Estimation using Deep Neural Networks

Slides Poster Similar

The development of so-called soft-biometrics aims at providing information related to the physical and behavioural characteristics of a person. This paper focuses on bodyweight estimation based on the observation from a top-view RGB-D camera. In fact, the capability to estimate the weight of a person can be of help in many different applications, from health-related scenarios to business intelligence and retail analytics. To deal with this issue, a TVWE (Top-View Weight Estimation) framework is proposed with the aim of predicting the weight. The approach relies on the adoption of Deep Neural Networks (DNNs) that have been trained on depth data. Each network has also been modified in its top section to replace classification with prediction inference. The performance of five state-of-art DNNs has been compared, namely VGG16, ResNet, Inception, DenseNet and Efficient-Net. In addition, a convolutional auto-encoder has also been included for completeness. Considering the limited literature in this domain, the TVWE framework has been evaluated on a new publicly available dataset: “VRAI Weight estimation Dataset”, which also collects, for each subject, labels related to weight, gender, and height. The experimental results have demonstrated that the proposed methods are suitable for this task, bringing different and significant insights for the application of the solution in different domains.

Open Set Domain Recognition Via Attention-Based GCN and Semantic Matching Optimization

Xinxing He, Yuan Yuan, Zhiyu Jiang

Responsive image

Auto-TLDR; Attention-based GCN and Semantic Matching Optimization for Open Set Domain Recognition

Slides Poster Similar

Open set domain recognition has got the attention in recent years. The task aims to specifically classify each sample in the practical unlabeled target domain, which consists of all known classes in the manually labeled source domain and target-specific unknown categories. The absence of annotated training data or auxiliary attribute information for unknown categories makes this task especially difficult. Moreover, exiting domain discrepancy in label space and data distribution further distracts the knowledge transferred from known classes to unknown classes. To address these issues, this work presents an end-to-end model based on attention-based GCN and semantic matching optimization, which first employs the attention mechanism to enable the central node to learn more discriminating representations from its neighbors in the knowledge graph. Moreover, a coarse-to-fine semantic matching optimization approach is proposed to progressively bridge the domain gap. Experimental results validate that the proposed model not only has superiority on recognizing the images of known and unknown classes, but also can adapt to various openness of the target domain.

Making Every Label Count: Handling Semantic Imprecision by Integrating Domain Knowledge

Clemens-Alexander Brust, Björn Barz, Joachim Denzler

Responsive image

Auto-TLDR; Class Hierarchies for Imprecise Label Learning and Annotation eXtrapolation

Slides Poster Similar

Noisy data, crawled from the web or supplied by volunteers such as Mechanical Turkers or citizen scientists, is considered an alternative to professionally labeled data. There has been research focused on mitigating the effects of label noise. It is typically modeled as inaccuracy, where the correct label is replaced by an incorrect label from the same set. We consider an additional dimension of label noise: imprecision. For example, a non-breeding snow bunting is labeled as a bird. This label is correct, but not as precise as the task requires. Standard softmax classifiers cannot learn from such a weak label because they consider all classes mutually exclusive, which non-breeding snow bunting and bird are not. We propose CHILLAX (Class Hierarchies for Imprecise Label Learning and Annotation eXtrapolation), a method based on hierarchical classification, to fully utilize labels of any precision. Experiments on noisy variants of NABirds and ILSVRC2012 show that our method outperforms strong baselines by as much as 16.4 percentage points, and the current state of the art by up to 3.9 percentage points.

Fusion of Global-Local Features for Image Quality Inspection of Shipping Label

Sungho Suh, Paul Lukowicz, Yong Oh Lee

Responsive image

Auto-TLDR; Input Image Quality Verification for Automated Shipping Address Recognition and Verification

Slides Poster Similar

The demands of automated shipping address recognition and verification have increased to handle a large number of packages and to save costs associated with misdelivery. A previous study proposed a deep learning system where the shipping address is recognized and verified based on a camera image capturing the shipping address and barcode area. Because the system performance depends on the input image quality, inspection of input image quality is necessary for image preprocessing. In this paper, we propose an input image quality verification method combining global and local features. Object detection and scale-invariant feature transform in different feature spaces are developed to extract global and local features from several independent convolutional neural networks. The conditions of shipping label images are classified by fully connected fusion layers with concatenated global and local features. The experimental results regarding real captured and generated images show that the proposed method achieves better performance than other methods. These results are expected to improve the shipping address recognition and verification system by applying different image preprocessing steps based on the classified conditions.

A Systematic Investigation on Deep Architectures for Automatic Skin Lesions Classification

Pierluigi Carcagni, Marco Leo, Andrea Cuna, Giuseppe Celeste, Cosimo Distante

Responsive image

Auto-TLDR; RegNet: Deep Investigation of Convolutional Neural Networks for Automatic Classification of Skin Lesions

Slides Poster Similar

Computer vision-based techniques are more and more employed in healthcare and medical fields nowadays in order, principally, to be as a support to the experienced medical staff to help them to make a quick and correct diagnosis. One of the hot topics in this arena concerns the automatic classification of skin lesions. Several promising works exist about it, mainly leveraging Convolutional Neural Networks (CNN), but proposed pipeline mainly rely on complex data preprocessing and there is no systematic investigation about how available deep models can actually reach the accuracy needed for real applications. In order to overcome these drawbacks, in this work, an end-to-end pipeline is introduced and some of the most recent Convolutional Neural Networks (CNNs) architectures are included in it and compared on the largest common benchmark dataset recently introduced. To this aim, for the first time in this application context, a new network design paradigm, namely RegNet, has been exploited to get the best models among a population of configurations. The paper introduces a threefold level of contribution and novelty with respect the previous literature: the deep investigation of several CNN architectures driving to a consistent improvement of the lesions recognition accuracy, the exploitation of a new network design paradigm able to study the behavior of populations of models and a deep discussion about pro and cons of each analyzed method paving the path towards new research lines.

Learning Neural Textual Representations for Citation Recommendation

Thanh Binh Kieu, Inigo Jauregi Unanue, Son Bao Pham, Xuan-Hieu Phan, M. Piccardi

Responsive image

Auto-TLDR; Sentence-BERT cascaded with Siamese and triplet networks for citation recommendation

Slides Poster Similar

With the rapid growth of the scientific literature, manually selecting appropriate citations for a paper is becoming increasingly challenging and time-consuming. While several approaches for automated citation recommendation have been proposed in the recent years, effective document representations for citation recommendation are still elusive to a large extent. For this reason, in this paper we propose a novel approach to citation recommendation which leverages a deep sequential representation of the documents (Sentence-BERT) cascaded with Siamese and triplet networks in a submodular scoring function. To the best of our knowledge, this is the first approach to combine deep representations and submodular selection for a task of citation recommendation. Experiments have been carried out using a popular benchmark dataset -- the ACL Anthology Network corpus -- and evaluated against baselines and a state-of-the-art approach using metrics such as the MRR and F1@k score. The results show that the proposed approach has been able to outperform all the compared approaches in every measured metric.