ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

DenseRecognition of Spoken Languages

Jaybrata Chakraborty, Bappaditya Chakraborty, Ujjwal Bhattacharya

Auto-TLDR; DenseNet: A Dense Convolutional Network Architecture for Speech Recognition in Indian Languages

Abstract Slides Poster

In the present study, we have, for the first time, con- sidered a large number of Indian languages for recog- nition from their audio signals of different sources. A dense convolutional network architecture (DenseNet) has been proposed for this classification problem. Dy- namic elimination of low energy frames from the input speech signal has been considered as a preprocessing operation. Mel-spectrogram of pre-processed speech signal is fed to a DenseNet architecture for recogni- tion of its language. Recognition performance of the proposed architecture has been compared with that of several state-of-the-art deep architectures which include a traditional convolutional neural network (CNN), multiple ResNet architectures, CNN-BLSTM and DenseNet-BLSTM hybrid architectures. Addition- ally, we obtained recognition performances of a stacked BLSTM architecture fed with different sets of hand- crafted features for comparison purpose. Simulations have been performed on two different standard datasets which include (i) IITKGP-MLILSC dataset of news clips in 27 different Indian languages and (ii) Linguistic Data Consortium (LDC) dataset of telephonic conver- sations in 5 different Indian languages. Recognition performance of the proposed framework has been found to be consistently and significantly better than all other frameworks implemented in this study.

Similar papers

Hybrid Network for End-To-End Text-Independent Speaker Identification

Wajdi Ghezaiel, Luc Brun, Olivier Lezoray

Auto-TLDR; Text-Independent Speaker Identification with Scattering Wavelet Network and Convolutional Neural Networks

DenseRecognition of Spoken Languages

Similar papers

Hybrid Network for End-To-End Text-Independent Speaker Identification

Which are the factors affecting the performance of audio surveillance systems?

Ballroom Dance Recognition from Audio Recordings

The Application of Capsule Neural Network Based CNN for Speech Emotion Recognition

ESResNet: Environmental Sound Classification Based on Visual Domain Models

End-To-End Triplet Loss Based Emotion Embedding System for Speech Emotion Recognition

Detection of Calls from Smart Speaker Devices

Audio-Video Detection of the Active Speaker in Meetings

Toward Text-Independent Cross-Lingual Speaker Recognition Using English-Mandarin-Taiwanese Dataset

Audio-Visual Speech Recognition Using a Two-Step Feature Fusion Strategy

Audio-Based Near-Duplicate Video Retrieval with Audio Similarity Learning

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

Digit Recognition Applied to Reconstructed Audio Signals Using Deep Learning

One-Shot Learning for Acoustic Identification of Bird Species in Non-Stationary Environments

Feature Engineering and Stacked Echo State Networks for Musical Onset Detection

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Mutual Alignment between Audiovisual Features for End-To-End Audiovisual Speech Recognition

Mood Detection Analyzing Lyrics and Audio Signal Based on Deep Learning Architectures

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Adversarially Training for Audio Classifiers

Spatial Bias in Vision-Based Voice Activity Detection

Improving Gravitational Wave Detection with 2D Convolutional Neural Networks

Exploring Spatial-Temporal Representations for fNIRS-based Intimacy Detection via an Attention-enhanced Cascade Convolutional Recurrent Neural Network

Influence of Event Duration on Automatic Wheeze Classification

Anticipating Activity from Multimodal Signals

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning

Trainable Spectrally Initializable Matrix Transformations in Convolutional Neural Networks

RWF-2000: An Open Large Scale Video Database for Violence Detection

On the Information of Feature Maps and Pruning of Deep Neural Networks

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

Graph Convolutional Neural Networks for Power Line Outage Identification

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

Person Recognition with HGR Maximal Correlation on Multimodal Data

Detecting Marine Species in Echograms Via Traditional, Hybrid, and Deep Learning Frameworks

Single-Modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning

Robust Audio-Visual Speech Recognition Based on Hybrid Fusion

S2I-Bird: Sound-To-Image Generation of Bird Species Using Generative Adversarial Networks

Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection

Automatic Annotation of Corpora for Emotion Recognition through Facial Expressions Analysis

Are Multiple Cross-Correlation Identities Better Than Just Two? Improving the Estimate of Time Differences-Of-Arrivals from Blind Audio Signals

Wireless Localisation in WiFi Using Novel Deep Architectures

Continuous Sign Language Recognition with Iterative Spatiotemporal Fine-Tuning

ResMax: Detecting Voice Spoofing Attacks with Residual Network and Max Feature Map

Personalized Models in Human Activity Recognition Using Deep Learning

Recursive Recognition of Offline Handwritten Mathematical Expressions

Modulation Pattern Detection Using Complex Convolutions in Deep Learning