Browse ICPR2020 papers

Visual Localization for Autonomous Driving: Mapping the Accurate Location in the City Maze

Dongfang Liu, Yiming Cui, Xiaolei Guo, Wei Ding, Baijian Yang, Yingjie Chen
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.2

Responsive image

Auto-TLDR; Feature Voting for Robust Visual Localization in Urban Settings

Underline Similar papers

Accurate localization is a foundational capacity, required for autonomous vehicles to accomplish other tasks such as navigation or path planning. It is a common practice for vehicles to use GPS to acquire location information. However, the application of GPS can result in severe challenges when vehicles run within the inner city where different kinds of structures may shadow the GPS signal and lead to inaccurate location results. To address the localization challenges of urban settings, we propose a novel feature voting technique for visual localization. Different from the conventional front-view-based method, our approach employs views from three directions (front, left, and right) and thus significantly improves the robustness of location prediction. In our work, we craft the proposed feature voting method into three state-of-the-art visual localization networks and modify their architectures properly so that they can be applied for vehicular operation. Extensive field test results indicate that our approach can predict location robustly even in challenging inner-city settings. Our research sheds light on using the visual localization approach to help autonomous vehicles to find accurate location information in a city maze, within a desirable time constraint.

Automatical Enhancement and Denoising of Extremely Low-Light Images

Yuda Song, Yunfang Zhu, Xin Du
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 14:00 in session PS T3.8

Responsive image

Auto-TLDR; INSNet: Illumination and Noise Separation Network for Low-Light Image Restoring

Underline Similar papers

Deep convolutional neural networks (DCNN) based methodologies have achieved remarkable performance on various low-level vision tasks recently. Restoring images captured at night is one of the trickiest low-level vision tasks due to its high-level noise and low-level intensity. We propose a DCNN-based methodology, Illumination and Noise Separation Network (INSNet), which performs both denoising and enhancement on these extremely low-light images. INSNet fully utilizes global-ware features and local-ware features using the modified network structure and image sampling scheme. Compared to well-designed complex neural networks, our proposed methodology only needs to add a bypass network to the existing network. However, it can boost the quality of recovered images dramatically but only increase the computational cost by less than 0.1%. Even without any manual settings, INSNet can stably restore the extremely low-light images to desired high-quality images.

A GAN-Based Blind Inpainting Method for Masonry Wall Images

Yahya Ibrahim, Balázs Nagy, Csaba Benedek
Track 5: Image and Signal Processing
Thu 14 Jan 2021 at 16:00 in session PS T5.6

Responsive image

Auto-TLDR; An End-to-End Blind Inpainting Algorithm for Masonry Wall Images

Underline Similar papers

In this paper we introduce a novel end-to-end blind inpainting algorithm for masonry wall images, performing the automatic detection and virtual completion of occluded or damaged wall regions. For this purpose, we propose a three-stage deep neural network that comprises a U-Net-based sub-network for wall segmentation into brick, mortar and occluded regions, which is followed by a two-stage adversarial inpainting model. The first adversarial network predicts the schematic mortar-brick pattern of the occluded areas based on the observed wall structure, providing in itself valuable structural information for archeological and architectural applications. Finally, the second adversarial network predicts the RGB pixel values yielding a realistic visual experience for the observer. While the three stages implement a sequential pipeline, they interact through dependencies of their loss functions admitting the consideration of hidden feature dependencies between the different network components. For training and testing the network a new dataset has been created, and an extensive qualitative and quantitative evaluation versus the state-of-the-art is given.

Cross-People Mobile-Phone Based Airwriting Character Recognition

Yunzhe Li, Hui Zheng, He Zhu, Haojun Ai, Xiaowei Dong
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 15:00 in session PS T1.14

Responsive image

Auto-TLDR; Cross-People Airwriting Recognition via Motion Sensor Signal via Deep Neural Network

Underline Similar papers

Airwriting using mobile phones has many applications in human-computer interaction. However, the recognition of airwriting character needs a lot of training data from user, which brings great difficulties to the pratical application. The model learnt from a specific person often cannot yield satisfied results when used on another person. The data gap between people is mainly caused by the following factors: personal writing styles, mobile phone sensors, and ways to hold mobile phones. To address the cross-people problem, we propose a deep neural network(DNN) that combines convolutional neural network(CNN) and bilateral long short-term memory(BLSTM). In each layer of the network, we also add an AdaBN layer which is able to increase the generalization ability of the DNN. Different from the original AdaBN method, we explore the feasibility for semi-supervised learning. We implement it to our design and conduct comprehensive experiments. The evaluation results show that our system can achieve an accuracy of 99% for recognition and an improvement of 10% on average for transfer learning between various factors such as people, devices and postures. To the best of our knowledge, our work is the first to implement cross-people airwriting recognition via motion sensor signal, which is a fundamental step towards ubiquitous sensing.

Small Object Detection by Generative and Discriminative Learning

Yi Gu, Jie Li, Chentao Wu, Weijia Jia, Jianping Chen
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 12:00 in session PS T3.4

Responsive image

Auto-TLDR; Generative and Discriminative Learning for Small Object Detection

Underline Similar papers

With the development of deep convolutional neural networks (CNNs), the object detection accuracy has been greatly improved. But the performance of small object detection is still far from satisfactory, mainly because small objects are so tiny that the information contained in the feature map is limited. Existing methods focus on improving classification accuracy but still suffer from the limitation of bounding box prediction. To solve this issue, we propose a detection framework by generative and discriminative learning. First, a reconstruction generator network is designed to reconstruct the mapping from low frequency to high frequency for anchor box prediction. Then, a detector module extracts the regions of interest (ROIs) from generated results and implements a RoI-Head to predict object category and refine bounding box. In order to guide the reconstructed image related to the corresponding one, a discriminator module is adopted to tell from the generated result and the original image. Extensive evaluations on the challenging MS-COCO dataset demonstrate that our model outperforms most state-of-the-art models in detecting small objects, especially the reconstruction module improves the average precision for small object (APs) by 7.7%.

Local Gradient Difference Based Mass Features for Classification of 2D-3D Natural Scene Text Images

Lokesh Nandanwar, Shivakumara Palaiahnakote, Raghavendra Ramachandra, Tong Lu, Umapada Pal, Daniel Lopresti, Nor Badrul Anuar
Track 4: Document and Media Analysis
Thu 14 Jan 2021 at 12:00 in session PS T4.3

Responsive image

Auto-TLDR; Classification of 2D and 3D Natural Scene Images Using COLD

Underline Similar papers

Methods developed for normal 2D text detection do not work well for a text that is rendered using decorative, 3D effects. This paper proposes a new method for classification of 2D and 3D natural scene images such that an appropriate method can be chosen or modified according to the complexity of the individual classes. The proposed method explores local gradient differences for obtaining candidate pixels, which represent a stroke. To study the spatial distribution of candidate pixels, we propose a measure we call COLD, which is denser for pixels toward the center of strokes and scattered for non-stroke pixels. This observation leads us to introduce mass features for extracting the regular spatial pattern of COLD, which indicates a 2D text image. The extracted features are fed to a Neural Network (NN) for classification. The proposed method is tested on both a new dataset introduced in this work and a standard dataset assembled from different natural scene datasets, and compared to from existing methods to show its effectiveness. The approach improves text detection performance significantly after classification.

Image Defocus Analysis for Finger Detection on a Virtual Keyboard

Miwa Michio, Honda Kenji, Sato Makoto
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 12:00 in session PS T2.2

Responsive image

Auto-TLDR; Analysis of defocus information when a finger touching a virtual keyboard by using DCT (Discrete Cosine Transform) coefficient without detecting 3D position

Underline Similar papers

This paper describes the analysis of defocus information when a finger touching a virtual keyboard by using DCT (Discrete Cosine Transform) coefficient without detecting 3D position (especially exact depth value) of thr finger. We use 2 cameras and a half mirror to get the images of the finger with same optical axis. The focal length of the two cameras are slightly different. We can know the finger touching the virtual keyboard when the finger is in the middle of the two focal length from the defocus information, because the position of the virtual keyboard is in the middle of the two focal position where the defocus information of the two cameras are same. Previous work of a virtual keyboard is realized by detecting the feature point of the finger at first. After that the finger position and the depth are checked if they are same to the position and the depth of the keyboard. A problem of the previous virtual keyboard is 3D point of the finger (feature point or edge, small region) should be detected. This is very time consuming. We overcome this problem by comparing the defocus information of the finger in the key area of virtual keyboard on the 2 images captured by the two cameras. In this paper we describe the optical system of the virtual keyboard and the resolution of the depth of the finger. Experimental result make it clear that middle range of the DCT coefficients is effective for detecting the finger. In our system, the finger touch is detected by comparing the DCT coefficients of the 2 images. The minimum distance of the detectable finger is depended on the diameter and the focal length of the lens, the resolution of the image sensor. We formulate this theoretical minimum distance and verify it by experiment. Besides finger, our system can also be generalized to detect other object touching the keyboard or not.

Auto Encoding Explanatory Examples with Stochastic Paths

Cesar Ali Ojeda Marin, Ramses J. Sanchez, Kostadin Cvejoski, Bogdan Georgiev
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Thu 14 Jan 2021 at 14:00 in session PS T1.11

Responsive image

Auto-TLDR; Semantic Stochastic Path: Explaining a Classifier's Decision Making Process using latent codes

Underline Similar papers

In this paper we ask for the main factors that determine a classifier's decision making process and uncover such factors by studying latent codes produced by auto-encoding frameworks. To deliver an explanation of a classifier's behaviour, we propose a method that provides series of examples highlighting semantic differences between the classifier's decisions. These examples are generated through interpolations in latent space. We introduce and formalize the notion of a semantic stochastic path, as a suitable stochastic process defined in feature (data) space via latent code interpolations. We then introduce the concept of semantic Lagrangians as a way to incorporate the desired classifier's behaviour and find that the solution of the associated variational problem allows for highlighting differences in the classifier decision. Very importantly, within our framework the classifier is used as a black-box, and only its evaluation is required.

A Spectral Clustering on Grassmann Manifold Via Double Low Rank Constraint

Xinglin Piao, Yongli Hu, Junbin Gao, Yanfeng Sun, Xin Yang, Baocai Yin
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 14:00 in session OS T1.4

Responsive image

Auto-TLDR; Double Low Rank Representation for High-Dimensional Data Clustering on Grassmann Manifold

Underline Similar papers

High-dimension data clustering is a fundamental topic in machine learning and data mining areas. In recent year, researchers have proposed a series of effective methods based on Low Rank Representation (LRR) which could explore low-dimension subspace structure embedded in original data effectively. The traditional LRR methods usually treat original data as samples in Euclidean space. They generally adopt linear metric to measure the distance between two data. However, high-dimension data (such as video clip or imageset) are always considered as non-linear manifold data such as Grassmann manifold. Therefore, the traditional linear Euclidean metric would be no longer suitable for these special data. In addition, traditional LRR clustering method always adopt nuclear norm as low rank constraint which would lead to suboptimal solution and decrease the clustering accuracy. In this paper, we proposed a new low rank method on Grassmann manifold for high-dimension data clustering task. In the proposed method, a double low rank representation approach is proposed by combining the nuclear norm and bilinear representation for better construct the representation matrix. The experimental results on several public datasets show that the proposed method outperforms the state-of-the-art clustering methods.

Unsupervised Domain Adaptation for Person Re-Identification through Source-Guided Pseudo-Labeling

Fabian Dubourvieux, Romaric Audigier, Angélique Loesch, Ainouz-Zemouche Samia, Stéphane Canu
Track 2: Biometrics, Human Analysis and Behavior Understanding
Thu 14 Jan 2021 at 12:00 in session PS T2.4

Responsive image

Auto-TLDR; Pseudo-labeling for Unsupervised Domain Adaptation for Person Re-Identification

Underline Similar papers

Person Re-Identification (re-ID) aims at retrieving images of the same person taken by different cameras. A challenge for re-ID is the performance preservation when a model is used on data of interest (target data) which belong to a different domain from the training data domain (source data). Unsupervised Domain Adaptation (UDA) is an interesting research direction for this challenge as it avoids a costly annotation of the target data. Pseudo-labeling methods achieve the best results in UDA-based re-ID. They incrementally learn with identity pseudo-labels which are initialized by clustering features in the source re-ID encoder space. Surprisingly, labeled source data are discarded after this initialization step. However, we believe that pseudo-labeling could further leverage the labeled source data in order to improve the post-initialization training steps. In order to improve robustness against erroneous pseudo-labels, we advocate the exploitation of both labeled source data and pseudo-labeled target data during all training iterations. To support our guideline, we introduce a framework which relies on a two-branch architecture optimizing classification in source and target domains, respectively, in order to allow adaptability to the target domain while ensuring robustness to noisy pseudo-labels. Indeed, shared low and mid-level parameters benefit from the source classification signal while high-level parameters of the target branch learn domain-specific features. Our method is simple enough to be easily combined with existing pseudo-labeling UDA approaches. We show experimentally that it is efficient and improves performance when the base method has no mechanism to deal with pseudo-label noise. And it maintains performance when combined with base method that already manages pseudo-label noise. Our approach reaches state-of-the-art performance when evaluated on commonly used datasets, Market-1501 and DukeMTMC-reID, and outperforms the state of the art when targeting the bigger and more challenging dataset MSMT.

Selecting Useful Knowledge from Previous Tasks for Future Learning in a Single Network

Feifei Shi, Peng Wang, Zhongchao Shi, Yong Rui
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Thu 14 Jan 2021 at 12:00 in session PS T1.10

Responsive image

Auto-TLDR; Continual Learning with Gradient-based Threshold Threshold

Underline Similar papers

Continual learning is able to learn new tasks incrementally while avoiding catastrophic forgetting. Recent work has shown that packing multiple tasks into a single network incrementally by iterative pruning and re-training network is a promising method. We build upon this idea and propose an improved version of PackNet, specifically, we propose a novel gradient-based threshold method to reuse the knowledge of the previous tasks selectively when learning new tasks. Our experiments on a variety of classification tasks and different network architectures demonstrate that our method obtains competitive results when compared to PackNet.

Multi-Scale Relational Reasoning with Regional Attention for Visual Question Answering

Yuntao Ma, Yirui Wu, Tong Lu
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 15:00 in session PS T3.10

Responsive image

Auto-TLDR; Question-Guided Relational Reasoning for Visual Question Answering

Underline Similar papers

The main challenges of visual question answering (VQA) lie in modeling an alignment between image and question to find out informative regions in images that related to the question and reasoning relations among visual objects according to the question. In this paper, we propose question-guided relational reasoning in multi-scales for visual question answering, in which each region is enhanced by regional attention. Specifically, we present regional attention, which consists of a soft attention and a hard attention, to pick up informative regions of the image according to informative evaluations implemented by question-guided soft attention. And combinations of different informative regions are then concatenated with question embedding in different scales to capture relational information. Relational reasoning can extract question-based relational information between regions, and the multi-scale mechanism gives it the ability to analyze relationships in diversity and sensitivity to numbers by modeling scales of relationships. We conduct experiments to show that our proposed architecture is effective and achieves a new state-of-the-art on VQA v2.

Using Machine Learning to Refer Patients with Chronic Kidney Disease to Secondary Care

Lee Au-Yeung, Xianghua Xie, Timothy Marcus Scale, James Anthony Chess
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 16:00 in session PS T1.16

Responsive image

Auto-TLDR; A Machine Learning Approach for Chronic Kidney Disease Prediction using Blood Test Data

Underline Similar papers

There has been growing interest recently in using machine learning techniques as an aid in clinical medicine. Machine learning offers a range of classification algorithms which can be applied to medical data to aid in making clinical predictions. Recent studies have demonstrated the high predictive accuracy of various classification algorithms applied to clinical data. Several studies have already been conducted in diagnosing or predicting chronic kidney disease at various stages using different sets of variables. In this study we are investigating the use machine learning techniques with blood test data. Such a system could aid renal teams in making recommendations to primary care general practitioners to refer patients to secondary care where patients may benefit from earlier specialist assessment and medical intervention. We are able to achieve an overall accuracy of 88.48\% using logistic regression, 87.12\% using ANN and 85.29\% using SVM. ANNs performed with the highest sensitivity at 89.74\% compared to 86.67\% for logistic regression and 85.51\% for SVM.

Computational Data Analysis for First Quantization Estimation on JPEG Double Compressed Images

Sebastiano Battiato, Oliver Giudice, Francesco Guarnera, Giovanni Puglisi
Track 5: Image and Signal Processing
Fri 15 Jan 2021 at 15:00 in session PS T5.7

Responsive image

Auto-TLDR; Exploiting Discrete Cosine Transform Coefficients for Multimedia Forensics

Underline Similar papers

Multimedia Forensics experts work consists in providing answers about integrity of a specific media content and from where it comes from. Exploitation of any traces from JPEG double compressed images is often one of the main investigative path to be used for these purposes. Thus it is fundamental to have tools and algorithms able to safely estimate the first quantization matrix to further proceed with camera model identification and related tasks. In this paper, a technique based on extensive simulation is proposed, with the aim to infer the first quantization for a certain numbers of Discrete Cosine Transform (DCT) coefficients exploiting local image statistics without using any a-priori knowledge. The method provides also a reliable confidence value for the estimation which is of great importance for forensic purposes. Experimental results w.r.t. the state-of-the-art demonstrate the effectiveness of the proposed technique both in terms of precision and overall reliability.

Adaptive Context-Aware Discriminative Correlation Filters for Robust Visual Object Tracking

Tianyang Xu, Zhenhua Feng, Xiaojun Wu, Josef Kittler
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 15:00 in session PS T3.1

Responsive image

Auto-TLDR; ACA-DCF: Adaptive Context-Aware Discriminative Correlation Filter with complementary attention mechanisms

Underline Similar papers

In recent years, Discriminative Correlation Filters (DCFs) have gained popularity due to their superior performance in visual object tracking. However, existing DCF trackers usually learn filters using fixed attention mechanisms that focus on the centre of an image and suppresses filter amplitudes in surroundings. In this paper, we propose an Adaptive Context-Aware Discriminative Correlation Filter (ACA-DCF) that is able to improve the existing DCF formulation with complementary attention mechanisms. Our ACA-DCF integrates foreground attention and background attention for complementary context-aware filter learning. More importantly, we ameliorate the design using an adaptive weighting strategy that takes complex appearance variations into account. The experimental results obtained on several well-known benchmarks demonstrate the effectiveness and superiority of the proposed method over the state-of-the-art approaches.

Stochastic 3D Rock Reconstruction Using GANs

Sergio Damas, Andrea Valsecchi
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Tue 12 Jan 2021 at 15:00 in session PS T1.2

Responsive image

Auto-TLDR; Generative Adversarial Neural Networks for 3D-to-3D Reconstruction of Porous Media

Underline Similar papers

The study of the physical properties of porous media is crucial for petrophysics laboratories. Even though micro computed tomography (CT) could be useful, the appropriate evaluation of flow properties would involve the acquisition of a large number of representative images. That is often unfeasible. Stochastic reconstruction methods aim to generate novel, realistic rock images from a small sample, thus avoiding a large acquisition process. In this contribution, we improve a previous method for 3D-to-3D reconstruction of the structure of porous media by applying generative adversarial neural networks (GANs). We compare several measures of pore morphology between simulated and acquired images. Experiments include Beadpack, Berea sandstone, and Ketton limestone images. Results show that our GANs-based method can reconstruct three-dimensional images of porous media at different scales that are representative of the morphology of the original images. Furthermore, the generation of multiple images is much faster than classical image reconstruction methods.

Enhancing Depth Quality of Stereo Vision Using Deep Learning-Based Prior Information of the Driving Environment

Weifu Li, Vijay John, Seiichi Mita
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 12:00 in session PS T1.3

Responsive image

Auto-TLDR; A Novel Post-processing Mathematical Framework for Stereo Vision

Underline Similar papers

Generation of high density depth values of the driving environment is indispensable for autonomous driving. Stereo vision is one of the practical and effective methods to generate these depth values. However, the accuracy of the stereo vision is limited by texture-less regions, such as sky and road areas, and repeated patterns in the image. To overcome these problems, we propose to enhance the stereo generated depth by incorporating prior information of the driving environment. Prior information, generated by deep learning-based U-Net model, is utilized in a novel post-processing mathematical framework to refine the stereo generated depth. The proposed mathematical framework is formulated as an optimization problem, which refines the errors due to texture-less regions and repeated patterns. Owing to its mathematical formulation, the post-processing framework is not a black-box and is explainable, and can be readily utilized for depth maps generated by any stereo vision algorithm. The proposed framework is qualitatively validated on the acquired dataset and KITTI dataset. The results obtained show that the proposed framework improves the stereo depth generation accuracy

Explore and Explain: Self-Supervised Navigation and Recounting

Roberto Bigazzi, Federico Landi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 14:00 in session OS T3.2

Responsive image

Auto-TLDR; Exploring a Photorealistic Environment for Explanation and Navigation

Underline Similar papers

Embodied AI has been recently gaining attention as it aims to foster the development of autonomous and intelligent agents. In this paper, we devise a novel embodied setting in which an agent needs to explore a previously unknown environment while recounting what it sees during the path. In this context, the agent needs to navigate the environment driven by an exploration goal, select proper moments for description, and output natural language descriptions of relevant objects and scenes. Our model integrates a novel self-supervised exploration module with penalty, and a fully-attentive captioning model for explanation. Also, we investigate different policies for selecting proper moments for explanation, driven by information coming from both the environment and the navigation. Experiments are conducted on photorealistic environments from the Matterport3D dataset and investigate the navigation and explanation capabilities of the agent as well as the role of their interactions.

Near-Infrared Depth-Independent Image Dehazing using Haar Wavelets

Sumit Laha, Ankit Sharma, Shengnan Hu, Hassan Foroosh
Track 5: Image and Signal Processing
Thu 14 Jan 2021 at 16:00 in session PS T5.6

Responsive image

Auto-TLDR; A fusion algorithm for haze removal using Haar wavelets

Underline Similar papers

We propose a fusion algorithm for haze removal that combines color information from an RGB image and edge information extracted from its corresponding NIR image using Haar wavelets. The proposed algorithm is based on the key observation that NIR edge features are more prominent in the hazy regions of the image than the RGB edge features in those same regions. To combine the color and edge information, we introduce a haze-weight map which proportionately distributes the color and edge information during the fusion process. Because NIR images are, intrinsically, nearly haze-free, our work makes no assumptions like existing works that rely on a scattering model and essentially designing a depth-independent method. This helps in minimizing artifacts and gives a more realistic sense to the restored haze-free image. Extensive experiments show that the proposed algorithm is both qualitatively and quantitatively better on several key metrics when compared to existing state-of-the-art methods.

One-Shot Learning for Acoustic Identification of Bird Species in Non-Stationary Environments

Michelangelo Acconcjaioco, Stavros Ntalampiras
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 16:30 in session PS T5.4

Responsive image

Auto-TLDR; One-shot Learning in the Bioacoustics Domain using Siamese Neural Networks

Underline Similar papers

This work introduces the one-shot learning paradigm in the computational bioacoustics domain. Even though, most of the related literature assumes availability of data characterizing the entire class dictionary of the problem at hand, that is rarely true as a habitat's species composition is only known up to a certain extent. Thus, the problem needs to be addressed by methodologies able to cope with non-stationarity. To this end, we propose a framework able to detect changes in the class dictionary and incorporate new classes on the fly. We design an one-shot learning architecture composed of a Siamese Neural Network operating in the logMel spectrogram space. We extensively examine the proposed approach on two datasets of various bird species using suitable figures of merit. Interestingly, such a learning scheme exhibits state of the art performance, while taking into account extreme non-stationarity cases.