ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Detection of Calls from Smart Speaker Devices

Vinay Maddali, David Looney, Kailash Patil

Auto-TLDR; Distinguishing Between Smart Speaker and Cell Devices Using Only the Audio Using a Feature Set

Abstract Slides Poster

The ubiquity of smart speakers is increasing, with a growing number of households utilising these devices to make calls over the telephony network. As the technology is typically configured to retain the cellular phone number of the user, it presents challenges in applications where knowledge of the true call origin is required. There are a wide range of makes and models for these devices, as is the case with cell phones, and it is challenging to detect the general category as a smart speaker or cell, independent of the designated phone number. In this paper, we present an approach to differentiate between calls originating from smart speakers and ones from cellular devices using only the audio. We present a feature set that characterises the relevant acoustic information, such as the degree of reverberation and noise, to distinguish between these categories. When evaluated on a dataset spanning multiple models for each device category, as well as different modes-of-usage and microphone-speaker distances, the method yields an Equal Error Rate (EER) of 12.6%.

Similar papers

Toward Text-Independent Cross-Lingual Speaker Recognition Using English-Mandarin-Taiwanese Dataset

Yi-Chieh Wu, Wen-Hung Liao

Auto-TLDR; Cross-lingual Speech for Biometric Recognition

Abstract Poster Similar

Over 40% of the world's population is bilingual. Existing speaker identification/verification systems, however, assume the same language type for both enrollment and recognition stages. In this work, we investigate the feasibility of employing multilingual speech for biometric application. We establish a dataset containing audio recorded in English, Mandarin and Taiwanese. Three acoustic features, namely, i-vector, d-vector and x-vector have been evaluated for both speaker verification (SV) and identification (SI) tasks. Preliminary experimental results indicate that x-vector achieves the best overall performance. Additionally, model trained with hybrid data demonstrates highest accuracy associated with the cost of data collection efforts. In SI tasks, we obtained over 91\% cross-lingual accuracy all models using 3-second audio. In SV tasks, the EER among cross-lingual test is at most 6.52\%, which is observed on the model trained by English corpus. The outcome suggests the feasibility of adopting cross-lingual speech in building text-independent speaker recognition systems.

DenseRecognition of Spoken Languages

Jaybrata Chakraborty, Bappaditya Chakraborty, Ujjwal Bhattacharya

Auto-TLDR; DenseNet: A Dense Convolutional Network Architecture for Speech Recognition in Indian Languages

Detection of Calls from Smart Speaker Devices

Similar papers

Toward Text-Independent Cross-Lingual Speaker Recognition Using English-Mandarin-Taiwanese Dataset

DenseRecognition of Spoken Languages

Hybrid Network for End-To-End Text-Independent Speaker Identification

ResMax: Detecting Voice Spoofing Attacks with Residual Network and Max Feature Map

Feature Engineering and Stacked Echo State Networks for Musical Onset Detection

Which are the factors affecting the performance of audio surveillance systems?

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Spatial Bias in Vision-Based Voice Activity Detection

Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection

Ballroom Dance Recognition from Audio Recordings

Electroencephalography Signal Processing Based on Textural Features for Monitoring the Driver’s State by a Brain-Computer Interface

Influence of Event Duration on Automatic Wheeze Classification

The Application of Capsule Neural Network Based CNN for Speech Emotion Recognition

Audio-Video Detection of the Active Speaker in Meetings

Digit Recognition Applied to Reconstructed Audio Signals Using Deep Learning

One-Shot Learning for Acoustic Identification of Bird Species in Non-Stationary Environments

Audio-Based Near-Duplicate Video Retrieval with Audio Similarity Learning

End-To-End Triplet Loss Based Emotion Embedding System for Speech Emotion Recognition

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

Handwritten Signature and Text Based User Verification Using Smartwatch

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

3D Audio-Visual Speaker Tracking with a Novel Particle Filter

Automatic Classification of Human Granulosa Cells in Assisted Reproductive Technology Using Vibrational Spectroscopy Imaging

Anticipating Activity from Multimodal Signals

Translation Resilient Opportunistic WiFi Sensing

Appliance Identification Using a Histogram Post-Processing of 2D Local Binary Patterns for Smart Grid Applications

Adversarially Training for Audio Classifiers

Mood Detection Analyzing Lyrics and Audio Signal Based on Deep Learning Architectures

Person Recognition with HGR Maximal Correlation on Multimodal Data

Feasibility Study of Using MyoBand for Learning Electronic Keyboard

Are Multiple Cross-Correlation Identities Better Than Just Two? Improving the Estimate of Time Differences-Of-Arrivals from Blind Audio Signals

Single-Modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning

Automatic Annotation of Corpora for Emotion Recognition through Facial Expressions Analysis

One-Shot Representational Learning for Joint Biometric and Device Authentication

Documents Counterfeit Detection through a Deep Learning Approach

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Personalized Models in Human Activity Recognition Using Deep Learning

Face Anti-Spoofing Based on Dynamic Color Texture Analysis Using Local Directional Number Pattern

Wireless Localisation in WiFi Using Novel Deep Architectures

Cross-People Mobile-Phone Based Airwriting Character Recognition

Trainable Spectrally Initializable Matrix Transformations in Convolutional Neural Networks

Graph Convolutional Neural Networks for Power Line Outage Identification

Real-Time Drone Detection and Tracking with Visible, Thermal and Acoustic Sensors

Computational Data Analysis for First Quantization Estimation on JPEG Double Compressed Images

Exploring Seismocardiogram Biometrics with Wavelet Transform

EEG-Based Cognitive State Assessment Using Deep Ensemble Model and Filter Bank Common Spatial Pattern

Radar Image Reconstruction from Raw ADC Data Using Parametric Variational Autoencoder with Domain Adaptation