ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Mutual Alignment between Audiovisual Features for End-To-End Audiovisual Speech Recognition

Hong Liu, Yawei Wang, Bing Yang

Auto-TLDR; Mutual Iterative Attention for Audio Visual Speech Recognition

Abstract Slides Poster

Asynchronization issue caused by different types of modalities is one of the major problems in audio visual speech recognition (AVSR) research. However, most AVSR systems merely rely on up sampling of video or down sampling of audio to align audio and visual features, assuming that the feature sequences are aligned frame-by-frame. These pre-processing steps oversimplify the asynchrony relation between acoustic signal and lip motion, lacking ﬂexibility and impairing the performance of the system. Although there are systems modeling the asynchrony between the modalities, sometimes they fail to align speech and video precisely in some even all noise conditions. In this paper, we propose a mutual feature alignment method for AVSR which can make full use of cross modility information to address the asynchronization issue by introducing Mutual Iterative Attention (MIA) mechanism. Our method can automatically learn an alignment in a mutual way by performing mutual attention iteratively between the audio and visual features, relying on the modified encoder structure of Transformer. Experimental results show that our proposed method obtains absolute improvements up to 20.42% over the audio modality alone depending upon the signal-to-noise-ratio (SNR) level. Better recognition performance can also be achieved comparing with the traditional feature concatenation method under both clean and noisy conditions. It is expectable that our proposed mutual feature alignment method can be easily generalized to other multimodal tasks with semantically correlated information.

Similar papers

Robust Audio-Visual Speech Recognition Based on Hybrid Fusion

Hong Liu, Wenhao Li, Bing Yang

Auto-TLDR; Hybrid Fusion Based AVSR with Residual Networks and Bidirectional Gated Recurrent Unit for Robust Speech Recognition in Noise Conditions

Mutual Alignment between Audiovisual Features for End-To-End Audiovisual Speech Recognition

Similar papers

Robust Audio-Visual Speech Recognition Based on Hybrid Fusion

Audio-Visual Speech Recognition Using a Two-Step Feature Fusion Strategy

Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Context Matters: Self-Attention for Sign Language Recognition

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Person Recognition with HGR Maximal Correlation on Multimodal Data

DenseRecognition of Spoken Languages

Talking Face Generation Via Learning Semantic and Temporal Synchronous Landmarks

Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

Spatial Bias in Vision-Based Voice Activity Detection

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Audio-Video Detection of the Active Speaker in Meetings

Hybrid Network for End-To-End Text-Independent Speaker Identification

A Transformer-Based Radical Analysis Network for Chinese Character Recognition

End-To-End Triplet Loss Based Emotion Embedding System for Speech Emotion Recognition

ESResNet: Environmental Sound Classification Based on Visual Domain Models

3D Audio-Visual Speaker Tracking with a Novel Particle Filter

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning

Two-Stream Temporal Convolutional Network for Dynamic Facial Attractiveness Prediction

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News Anchors

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

Multi-Stage Attention Based Visual Question Answering

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

VTT: Long-Term Visual Tracking with Transformers

Anticipating Activity from Multimodal Signals

Global Context-Based Network with Transformer for Image2latex

Single-Modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

Let's Play Music: Audio-Driven Performance Video Generation

PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

Video Face Manipulation Detection through Ensemble of CNNs

ConvMath : A Convolutional Sequence Network for Mathematical Expression Recognition

Context Visual Information-Based Deliberation Network for Video Captioning

Continuous Sign Language Recognition with Iterative Spatiotemporal Fine-Tuning

SAT-Net: Self-Attention and Temporal Fusion for Facial Action Unit Detection

The Application of Capsule Neural Network Based CNN for Speech Emotion Recognition

A Multi-Head Self-Relation Network for Scene Text Recognition

Towards Practical Compressed Video Action Recognition: A Temporal Enhanced Multi-Stream Network

Ballroom Dance Recognition from Audio Recordings

Exploring Spatial-Temporal Representations for fNIRS-based Intimacy Detection via an Attention-enhanced Cascade Convolutional Recurrent Neural Network

Which are the factors affecting the performance of audio surveillance systems?

Attention-Driven Body Pose Encoding for Human Activity Recognition