Classification of sound scenes and events in real-world scenarios with deep learning techniques
- Naranjo Alcazar, Javier
- Pedro Diego Zuccarello Doktorvater
- Máximo Cobos Serrano Co-Doktorvater
Universität der Verteidigung: Universitat de València
Fecha de defensa: 14 von Januar von 2021
- Valeriana Naranjo Ornedo Präsident/in
- Miguel Arevalillo Herráez Sekretär
- Julio Jose Carabias Orti Vocal
Art: Dissertation
Zusammenfassung
The classification of sound events is a field of machine listening that is becoming increasingly interesting due to the large number of applications that could benefit from this technology. Unlike other fields of machine listening related to music information retrieval or speech recognition, sound event classification has a number of intrinsic problems. These problems are the polyphonic nature of most environmental sound recordings, the difference in the nature of each sound, the lack of temporal structure and the addition of background noise and reverberation in the recording process. These problems are fields of study for the scientific community today. However, it should be noted that when a machine listening solution is deployed in real environments, a number of extra problems may arise. These problems are Open-Set Recognition (OSR), Few-Shot Learning (FSL) and consideration of system runtime (low-complexity). OSR is defined as the problem that appears when an artificial intelligence system has to face an unknown situation where classes unseen during the training stage are present at a usage stage. FSL corresponds to the problem that occurs when there are very few samples available for each considered class. Finally, since these systems are normally deployed in edge devices, the consideration of the execution time must be taken into account, as the less time the system takes to give a response, the better the experience perceived by the users. Solutions based on Deep Learning techniques for similar problems in the image domain have shown promising results. The most widespread solutions are those that implement Convolutional Neural Networks (CNNs). Therefore, many state-of-the-art audio systems propose to convert audio signals into a two-dimensional representation that can be treated as an image. The generation of internal maps is often done by the convolutional layers of the CNNs. However, these layers have a series of limitations that must be studied in order to be able to propose techniques for improving the resulting feature maps. To this end, novel networks have been proposed that merge two different methods such as residual learning and squeeze-excitation techniques. The results show an improvement in the accuracy of the system with the addition of few number of extra parameters. On the other hand, these solutions based on two-dimensional inputs can show a certain bias since the choice of audio representation can be specific to a particular task. Therefore, a comparative study of different residual networks directly fed by the raw audio signal has been carried out. These solutions are known as end-to-end. While similar studies have been carried out in the literature in the image domain, the results suggest that the best performing residual blocks for computer vision tasks may not be the same as those for audio classification. Regarding the FSL and OSR problems, an autoencoder-based framework capable of mitigating both problems together is proposed. This solution is capable of creating robust representations of these audio patterns from just a few samples while being able to reject unwanted audio classes.