Audio analysis is very important today since most interactions consist of interactive voice supports. The main information extracted from the audio is the semantics, that is the meaning of the concepts expressed.
Some areas of application are:
– voice recognition, for example of feelings and/or emotions;
– the geographical origin;
– medical information (indications on the symptoms of some pathologies).
A possible resolution of tasks related to audio, and in particular to voice, is very demanding and the techniques that are used to solve these problems are often based on Deep Learning approaches that require high computing power.
Federico Schipani, Machine learning engineer
Voice interactions are increasingly widespread because they improve the user-user relationship and the user-software / machine relationship.
Some examples are: voice controls inside cars, on smartphones or even for industrial applications and, more generally, all those situations in which you cannot, or do not want to, use your hands to type a text.
For this reason, audio analysis has become a very important topic in the contemporary world since, compared to a text file, the information contained within a spoken sentence is much more detailed.
In the world of sales it is increasingly difficult to both establish and make oneself attractive in the eyes of potential customers and for this reason, having information about conversations can play a decisive role both in the sales and the analysis phase. During a negotiation, it is, in fact, possible to collect a lot of information that may improve the sales experience: to quickly understand the degree of interest of the parties to evaluate the continuation or adjust an offer to make it more appropriate. At a strategic level, it is important for managers to know the progress of campaigns and to be able to analyse aggregate data, supported by appropriate evaluation metrics, which can be explored and on which to base future decisions.
In written information, such problems are usually solved with deep learning techniques, in fact, a major limitation faced when working with audio is undoubtedly a large amount of data necessary for training models of this type. Furthermore, most of the models that are available have been developed for the English language and applications for the Italian language are less common and often are not sufficient to offer a good performance; this implies that it is often necessary to go through a data collection phase which can often be both long and expensive.
Going into detail we can identify four tasks regarding the analysis of audio: speech recognition, sentiment analysis, emotion analysis and dialect recognition.
Speech recognition is the cornerstone of audio analysis: starting with a voice audio input, the aim is to then transcribe the content into a text document and for this reason, it is comparable to a more complex version of a sound classifier. Many times the problems of this type are modelled as phoneme classifiers, then they are eventually followed by other models for the prediction of the text based on the recognized phonemes.
Recognition of dialects
Without a doubt, the task in which there is less difficulty in understanding its nature is the identification of a dialect. In this case, in fact, starting from an audio file, the final purpose is to identify the origin of the speaker. Although a nation may share a single language many times between different regions, the language varies, giving rise to dialects. A borderline case, for example, is in Italy: the national language is Italian, but in Sardinia, Sardinian is spoken which is considered as an autonomous language as opposed to a dialect.
Figure: Grouping of dialects in Italy 
In general, however, most dialects tend to have some similarities to each other, so even if this can be considered in the same way as a language classification problem it becomes much more complicated.
Sentiment and Emotion analysis
Many times the expression sentiment analysis is used interchangeably with emotion analysis. The substantial difference between the analysis of feelings and the analysis of emotions lies within the number of classes. The analysis of feelings can be traced back to a binary classification problem, in which the input can be assigned to a negative or positive class. In some cases, however, it is preferable to introduce an additional level of precision by inserting the neutral class.
The analysis of emotions, on the other hand, is a problem that is more complex and difficult to manage. Opposed to the first type of analysis there are not two or three classes with obvious differences, but in this case, there are more classes that may instead have similarities between them. For example, you can have the happiness and excitement class which, despite being two different types of emotions, have different characteristics in common such as volume or tone of voice.
In some contexts, it is possible to see the analysis of emotions as an extension of the analysis of feelings, this leads to a more detailed and less simplified view of a conversation. For example, it goes without saying that angry or bored are two emotions that are not exactly positive, in fact in a possible analysis of feelings they would be included in the negative class.
Due to the nature of the problems, sometimes, when only textual information is available, such as product reviews, we tend to prefer the analysis of feelings as opposed to the analysis of emotions as having no audio or images the first task is much more difficult.
Techniques for analysing audio
Audio analysis can be done by using different techniques, each one with a different type of pre-processing of the audio file. The most natural that comes to mind is of course to carry out speech recognition and then to work on the text using Natural Language Processing tools. Other techniques, a little less intuitive, involve the generation of audio features which are then be analysed using a classifier created using a Neural Network.
Classification through Neural Networks
An artificial neural network is nothing more than a mathematical model whose purpose it is to approximate a certain function. Its composition is called tiered as it contains a first input level, intermediate levels that are called hidden and a last output level. Each hidden layer of the network is made up of a predefined number of processing units, these are called neurons. Each neuron at the level has different input signals from neurons at the n-1 level and an output signal for neurons at the n + 1 level.
Figure: Artificial Neural Network 
The most common networks that we describe below are called feedforward since their computation graph is both direct and acyclic. We will now see how learning by a neural network works. First of all, the internal structure of a neuron must be described in more detail. As previously mentioned, the single computational unit has both an input and an output, but how is the data inside it transformed?
Figure: Artificial neuron 
We define for each connection i, entering a neuron, an input x and an associated weight w; the inputs are then combined linearly with the associated weights by carrying out the operation
The result of this operation is then given as input to a function that is called theactivation function. Summarising and formalising the operations carried out within a single neuron more concisely, the output is:
The activation function is that component of the model that is responsible for determining the output value based on the linear combination of the neuron weights with the outputs of the previous layers. There are several activation functions: for the current purpose, it is sufficient to know that for the intermediate layers the (non-linear) function called ReLU is often used. While if a classification problem is solved on the last layer, it is done so with a function called Softmax, the property of which is to output a probability distribution on the various classes.
The last problem to be solved now is the training of a model of this type. The training of a Neural Network takes place in two phases: the first phase is called forward propagation, in which it is fed with data taken from a training set and then the second phase, called back propagation, in which there is the actual training.
In forward propagation, a part of the dataset that is called the train set is used to calculate the prediction of the model and to compare it with reality. The comparison, calculated by means of a loss function, is an objective measure of how correct the prediction of the model is. The training is the mathematical optimization of the loss function: in other words, the model weights are modified in an attempt to reduce the loss function and therefore ensure that the prediction of the model is as close as possible to reality. Therefore, in the back propagation phase, a rule for the calculation of the derivatives called the chain rule is used. In this way, by calculating the derivative of the loss function with respect to the weights of the levels, we can vary the weights to try and decrease the value of the function being optimised.
The audio files, before they are given as input to a classifier, must undergo a preprocessing phase where coefficients called features are extracted.
An audio file has a characteristic called the sampling rate. Usually, it is assumed to be 16 kHz, this means that in one second of audio there are sixteen thousand samples. We will use this frequency as a standard in the following examples. Calculating a set of coefficients for each sample is a very onerous operation, as well as useless as the variation between one sample and its next, or indeed the previous one, is minimal. Therefore we consider batches of consecutive samples of size 400, corresponding therefore to 25 ms. For each set of samples, the audio features will then be calculated. In speech recognition tasks a particular type of feature called the Mel-Frequency Cepstral Coefficent (MFCC)is generally used; these are calculated with the following procedure:
- Once the batch of frames is taken, the Fourier Transform is calculated. Considering:
And given the sampling rate
the frequency corresponding to is given by
Calculating it for all batches of samples we will then get
the Short Time Fourier Transform Matrix (STFT Matrix)
- The Mel Filterbank is calculated:
- Using the conversion formula, you pass from the frequency scale to the Mel scale:
With F number of filters.
- It is calculated
then, using the inverse formula, this value is scaled to the frequencies, obtaining
- The Mel filter bank is given by this function:
- We proceed with the calculation of MFCC:
- We pass to the logarithmic scale of the Mel:
- The discrete cosine transform applies:
therefore obtaining that the p-th column of the Phi matrix represents the MFCC coefficients of the corresponding signal.
From the calculation of these coefficients, it is then possible to generate spectrograms or matrices with which the relative analyses will then be carried out.
Audio analysis through recurring networks
Nowadays the two types of very effective recurring networks are Long Short Term Memory (LSTM) and Bidirectional LSTM (BiLSTM). For an accurate description of this type of structure, please refer to this previously published article.
The large disadvantage of this type of structure is the impossibility of varying the size of the input as they always receive a vector of a fixed size
However, there are recent works that are based on the attention mechanism that does solve this problem and at the same time manage to manage addictions more effectively in the long term. By way of example, it is possible to describe one of the simplest mechanisms of attention as a vector. Let’s first define
the input matrix of the form T x F, where T is the number of frames and F is the number of features per frame. The attention mechanism is a vector
such that, if multiplied by H, forming
produces an attention map. A possible use for this attention map can be as a heavy factor for the output of an RNN, in such a way that it tells the model where to pay more attention.
Audio analysis via CNN
From the audio features extracted during preprocessing it is possible to generate a spectrogram, which is a three-dimensional graphic representation in the form of an image that shows the variation of the frequency spectrum over time. On the horizontal axis, there is therefore time, while on the vertical axis there is frequency. The third dimension is represented by the colour scale that is used to show the change in amplitude of a frequency.
|(d) Happiness||(e) Sadness||(f) Neutral|
Figure: Examples of spectrograms for each emotion class
So to all intents and purposes, being, therefore, images, it leads back to a problem of classification of images and therefore it becomes possible to use CNN, or models based on it, to then classify them. For a quick and detailed explanation of what a Convolutional Neural Network is, please refer to this previously published article.
Nowadays there are many Deep Learning models that are suitable for carrying out Image Classification, one of the most famous is VGG in all its possible forms.
Furthermore, thanks to pre-training carried out on other datasets it is possible to achieve satisfactory results with relatively little data at our disposal. In fact, many times in the DL libraries that implement these models you can also choose whether or not to download a set of weights from which to start the training. In this case, we speak of Transfer Learning or Fine Tuning.
Audio analysis is spreading in many fields of application and allows, through voice interaction, to control software and machines, solving many situations in which the user is unable, or hindered, in physical interaction with the device. Through deep learning tools, it is possible to extract a lot of information from voice interaction. For example information concerning the emotion or the origin of the interlocutor, improves the interaction and the user experience, but also leads to the collection of new data on which to generate much more exhaustive and precise performance indicators.
The use of these tools is having a very positive impact in the field of telephone assistance as it allows for better call management, accurate real-time analysis and the planning of more effective strategies. Furthermore, it is possible to use question answering models to support the answers in a much faster way as opposed to the classic research tools.
Finally, in the preventative medicine field, the latest research has investigated the possibility of using audio analysis, especially coughing, to help doctors in identifying subjects that are positive for COVID-19. The studies , although preliminary, seem very promising and would allow for fast, non-invasive and large-scale screening with simple audio collection tools such as smartphones.
In summary, the strengths and weaknesses of using audio analysis techniques are:
- Acquisition of new data and creation of more effective KPIs
- Workflow optimisation
- Improvement of human-machine interactions
- Requires a lot of annotated data
- Difficulty in finding Italian datasets
- Privacy and data content management
 Choi, Keunwoo, et al. “A comparison of audio signal preprocessing methods for deep neural networks on music tagging.” 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018.
 Muaidi, Hasan, et al. “Arabic Audio News Retrieval System Using Dependent Speaker Mode, Mel Frequency Cepstral Coefficient and Dynamic Time Warping Techniques.” Research Journal of Applied Sciences, Engineering and Technology 7.24 (2014): 5082-5097.
 Kopparapu, Sunil Kumar, and M. Laxminarayana. “Choice of Mel filter bank in computing MFCC of a resampled speech.” 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010). IEEE, 2010.
 Ramet, Gaetan, et al. “Context-aware attention mechanism for speech emotion recognition.” 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018.
 Laguarta et al. “COVID-19 Artificial Intelligence Diagnosis using only Cough Recordings”, IEEE Open Journal of Engineering in Medicine and Biology, IEEE, 2020.