AI & Medicine: How modern machines will improve healthcare

AI & Medicine: How modern machines will improve healthcare

The healthcare area is showing a huge interest towards Artificial Intelligence and especially for Machine Learning technologies capable to provide better instruments to the entire medical staff and a better customization of caregiving. The field of research is now projected into the use of Machine Learning techniques capable of improving diagnoses. This may lead to the identification of the patterns of a given pathology and, more generally, to provide better services to doctors, so that they can operate on individual patients through countermeasures adapted to specific needs. Furthermore, recent deep learning techniques could allow in the near future to have fully automated phases of therapies, allowing doctors and practitioners to devote more time to the study of new solutions.  AI & Medicine  What if a group of experts could always be at the service of an individual patient? How can we reduce the errors that can be made in making a diagnosis?In an ideal world, each therapy would be formulated by meeting the needs of the individual with extreme precision, taking cues both from his or her clinical history and from the collective knowledge given by millions of other patients. Thanks to Machine Learning, physicians can break the limits imposed by circumstantial knowledge, resulting from the application of therapies on a small number of patients, through the use of a “shared experience”, thanks to which they can formulate therapies optimized for a specific person [1].  The adoption of artificial intelligence techniques can allow the total automation of some clinical steps, facilitating the work of specialists.   Prognosis  A prognosis is a process by which the development of a certain disease can be predicted. A machine learning model can allow doctors to predict future events: how likely is it that a patient will be able to return to everyday life after undergoing treatment? How quickly will a disease progress? An algorithmic model requires data to provide a complete picture, including the results of past treatments [1]. In clinical area, different types of data will be collected, such as: phenotypic, genomic, proteomic and pathological test results, along with medical images [2]. Diagnosis The best doctors are able to understand when a particular clinical event is actually normal or if it represents a risk for the patient’s health. The American Institute of Medicine has pointed out that all people in the course of their lives incur at least once in a misdiagnosis [3]: reducing any kind of error can be crucial in the case of uncommon pathologies, without considering the fact that this can have a beneficial effect even in the case of diseases that are more familiar to us. Suffice it to say that complications thought to be eradicated such as tuberculosis and/or dysentery have at least a chance of going undetected, even though in developed countries there is adequate access to therapies capable of dealing with these dysfunctions [4]. Through data collected during everyday therapies, AI techniques can identify the most likely diagnoses during a clinical visit and project what conditions will manifest themselves in the patient in the future [1]. Treatment In a nationwide healthcare system, with thousands of physicians engaged with as many patients, variations might arise about how certain symptoms are treated. An ML algorithm can detect these natural variations in order to help doctors identify one treatment to be preferred over another [1]. One application might be to compare what the doctor would prescribe to a patient with a treatment suggested by an algorithmic model [1]. Workflows for physicians The introduction of electronic health records (EHRs) has facilitated access to data, but at the same time has brought out “bottlenecks” resulting from bureaucratic and administrative steps, creating additional complications for physicians. ML techniques can enable the streamlining of inefficient and cumbersome steps within the clinical workflow [1]. The same technologies that are used in search engines can highlight relevant information in a patient’s medical record, facilitating the work of specialists. This also allows for further facilitation of new data entry by taking into account a subject’s clinical history [1]. Involvement of more experts The adoption of artificial intelligence can give the possibility to reach more specialists who can provide a medical assessment without their direct involvement [1]. For example: a patient could send a photo from his smartphone so that an immediate diagnosis can be obtained, without resorting to medical channels intended for more urgent cases [1]. Machine Learning techniques commonly used in the medical field The most commonly used ML techniques in the medical literature will be discussed below. It is emphasized that here will be a treatment more oriented to the medical field, not delving into the technical aspects. Support Vector Machine The SVMs are mainly used to classify subjects within two groupings, having Y = 1 and Y = -1[5] as “label” respectively. These groupings are defined by a decision boundary defined by the input X data: The goal of SVM training is to find the optimal parameter w so that the classification is as accurate as possible (Figure 1). One of the most important properties of SVMs is that parameter determination is a convex optimization problem, so the solution is always a global optimum. Figure 1: An example of how a Support Vector Machine works [5]. Convolutional Neural Network The growth of the computational capabilities of modern devices has allowed Deep Learning to become one of the most popular fields of research within various scientific disciplines: in this sense, medicine has known no exception [5][6]. Thanks to these “deep” learning techniques (so called because of the numerous presence of layers, able to abstract very complex schematics), it is possible to perform a detailed analysis of medical images, such as X-ray scans, exploiting the ability of a neural network to manage voluminous and extremely complex data, such as images, in an efficient way [5]. Over the years, Convolutional Neural Networks have gained enormous popularity, especially in the medical world: suffice it to say that from 2014 onwards, this particular type of neural network has supplanted methodologies such as Recurrent Neural Network and Deep Belief Neural Networks [5] (Figure 2). Figure 2: popularity of Deep Learning algorithms in the medical field [5]. A CNN is based on the use of an operation called convolution, which can keep track of the various changes of a multidimensional data, having the position of a gauge within a space [7]. Considering a two-dimensional image I, and a kernel K (a multidimensional array carrying the parameters learned by the algorithm): Figure 3 shows a broad structure of a CNN, while Figure 4 shows what is produced through the application of the convolution operation. Figure 3: Draft of a CNN [8]. Figure 4: Result of some convolution operations[8]. Random Forest The Random Forest is a technique that involves the use of multiple regressors, structured “tree” (i.e.: decision trees). Each tree expresses its own candidate through a classification algorithm: subsequently, the votes of all the trees are averaged. In the equation, the term B stands for individual bagging, i.e., decision tree trainings on different instances of the dataset [9]. In the medical field, this type of technique can be used, for example, to discriminate phenotypic characters of an organism [10], or to classify the clinical data of a patient in such a way that an accurate diagnosis can be provided [11]. Limitations of AI in the medical field Availability of quality data One of the central issues in building an ML model is being able to draw on a representative dataset of all possible subjects, so that it is as diverse as possible [1]. The ideal would be to train algorithms using data very similar, if not identical, to those reported in electronic medical records [1]: unfortunately, many times we will have to deal with small datasets, collected by small clinical centers, sometimes of poor quality (reporting noise, i.e. irregularities generated by erroneous data). Privacy As mentioned, having datasets consisting of correctly compiled medical records would be ideal. However, these data are considered sensitive in the eyes of current legislation, and therefore difficult to find, making it more difficult to outline ML models. A natural solution might be to hand over the clinical data to the patient himself, who will then decide what to do with it [1]. Learning from past bad practices All human activities are unintentionally subjected to cognitive bias: some of the issues to consider when developing an ML system is to understand how much these biases, represented by the data, will affect the final model [12] and what tools to put in place to address this issue [13]. Experience in final evaluation Similar to healthcare systems, the application of ML techniques requires a sophisticated regulatory framework that can ensure the right use of algorithms in the medical field [1]. Physicians and patients must understand the limitations of these tools, such as the inability of a given framework to generalize to another type of issue [13]. Blindly relying on ML models can lead to erroneous decisions: for example, a physician might let his guard down if the algorithm returns an incorrect result, below a certain alarm threshold [1]. Interdisciplinary cooperation Teams of computer scientists, biologists, and physicians must collaborate so that they can build models that can be used in their respective fields. A lack of communication can lead to unusable results from physicians [1]. Scientific publications are often published online as preprints on portals such as arXiv and bioRxiv, not to mention the multitude of computer manuscripts that are not published in traditional scientific journals but rather within conferences such as NeurIPS and ICML [1]. AI techniques applied to COVID-19 Since the beginning of the SARS-CoV-2 virus pandemic, the scientific world has focused its attention on methods that can counteract the growth of infections: in March alone, 24000 preprints were published on the arXiv and bioRxiv portals concerning the use of AI techniques with the task of identifying patients affected by COVID-19 [15]. Many authors have carried out works in which CNNs are used to discriminate COVID patients from others with more common diseases, using datasets consisting of X-ray scans of patients with pneumonia (Figure 5) [16][17]. Figure 5: structure of a CNN for the detection of SARS-CoV-2 patients[16]. In addition, several real-time methods have been proposed for the immediate detection of the disease: the use of smart-watches, for example, allows the monitoring of several physical parameters that can indicate whether the subject has contracted the virus or not [18]. Conclusions Marco, a 49-year-old patient, felt a pain in his shoulder and, despite this, decided not to seek medical assistance. A few months later, he decides to see a doctor who diagnoses seborrheic keratosis (a skin growth similar to a large mole). Next, Marco undergoes a colonoscopy and a nurse notices a dark spot on his shoulder. Marco decides to visit a dermatologist, who obtains a sample of the excrescence: the analyses carried out show a benign pigment lesion. The dermatologist, however, does not trust him and decides to make a second analysis: this time the diagnosis speaks of an invasive melanoma. So, an oncologist subjects Marco to chemotherapeutic treatment but, in the meantime, a friend of the doctor asks the poor patient why he has not yet undergone immunotherapy [1]. If Marco had access to the latest ML technology, he could have simply taken a picture of his shoulder via his smartphone and then forwarded the image to an experienced dermatologist via a dedicated app. Subsequent to a biopsy of the lesion, recommended by the dermatologist, a diagnosis of stage 1 melanoma would have been made: at that point, the dermatologist could have severed the lesion [1]. The application of artificial intelligence in medicine will be able to save time, make better use of specialists’ know-how, enable more accurate diagnoses and, more generally, improve patients’ lives and streamline the work of practitioners. Pros: – More accurate diagnoses – Streamlining of medical and bureaucratic procedures – Acquisition of global therapeutic knowledge – Increased specialization of medical personnel Cons: – Availability of quality data – Lack of shared data collection procedures – Fundamental human component in the final evaluation of the diagnosis


[1] Rajkomar, A. – Dean, J. – Kohane, I.. “Machine Learning in Medicine.” New England Journal of Medicine, 380. 1347-1358. (2019) [2] Qayyum, A – Junaid Q. – Muhammad, B. – Ala A.. “Secure and Robust Machine Learning for Healthcare: A Survey.” IEEE reviews in biomedical engineering (2020) [3] McGlynn, E. – McDonald, K. – Cassel C..”Measurement is Essential for Improving Diagnosis and Reducing Diagnostic Error: A Report from the Institute of Medicine.” The Journal of the American Medical Association, Vol. 314(23), pp. 1-2 (2015) [4] Das, J. – Woskie, L. – Rajbhandari, R. – Abbasi, K. – Jha, A.. “Rethinking assumptions about delivery of healthcare: implications for universal health coverage.” The British Medical Journal (Clinical research ed.), Vol. 361 (2018) [5] Jiang, F. – Jiang, Y. – Zhi, H. – Dong, Y. – Li, H. – Ma, S. – Wang, Y. – Dong, Q. – Shen, H. – Wang, Y.. “Artificial Intelligence in Healthcare: Past, Present and Future.” Stroke and Vascular Neurology Vol. 2 (2017) [6] Mori, J. – Kaji S. – Kawai H. – Kida S. – Tsubokura M. – Fukatsu, M. – Harada, K. – Noji H. – Ikezoe T. – Maeda T. – Matsuda A.. “Assessment of dysplasia in bone marrow smear with convolutional neural network.” Scientific Reports Vol.10 (2020) [7] Goodfellow, I. – Bengio, Y. – Courville, A.. “Deep Learning.” MIT Press (2016) [8] Yamashita, R. – Nishio, M. – Do, R., – Togashi, K.. “Convolutional neural networks: an overview and application in radiology.” Insights into imaging, Vol. 9(4), pp. 611–629 (2018) [9] Breiman, L.. “Random Forests.” Machine Learning Vol.45(1), pp. 5 – 32 (2001) [10] Chen, T. – Cao, Y. – Zhang, Y. – Liu, J. – Bao, Y. – Wang, C. – Jia, W. – Zhao, A.. “Random forest in clinical metabolomics for phenotypic discrimination and biomarker selection.” Evidence-based complementary and alternative medicine : eCAM (2013) [11] Alam, Z. – Rahman, S. – Rahman, S.. “A Random Forest based predictor for medical data classification using feature ranking.” Informatics in Medicine Unlocked, Vol. 15, (2019) [12] Gianfrancesco, MA. – Tamang, S. – Yazdany, J. – Schmajuk, G.. “Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data.” Journal Of American Medical Association, Internal Medicine, Vol.178(11), pp. 1544-1547 (2018) [13] Rajkomar, A. – Hardt, M. – Howell, M. D. – Corrado, G. – Chin, M. H.. “Ensuring Fairness in Machine Learning to Advance Health Equity.” Annals of internal medicine, Vol.169(12), pp. 866–872 (2018) [14] Krumholz, H. M.. “Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system.” Health affairs (Project Hope), Vol.33(7), pp. 1163-70 (2014) [15] Hao, K.. ”Over 24,000 coronavirus research papers are now available in one place”. MIT Technology Review, URL: (2020) [16] Jain, G. – Mittal, D. – Thakur, D. – Mittal, M. K.. “A deep learning approach to detect Covid-19 coronavirus with X-Ray images”. Biocybernetics and biomedical engineering, Vol.40(4), pp. 1391-1405 (2020) [17] Hemdan, E.E – Shouman, M. – Karar, M.. “COVIDX-Net: A Framework of Deep Learning Classifiers to Diagnose COVID-19 in X-Ray Images.” ArXiv, abs/2003.11055 (2020) Mishra, T. – Wang, M. – Metwally, A.A., Bogu G.K. – Brooks A. W. – Bahmani A. – Alavi A. – Celli A. – [18] Higgs, E. – Dagan-Rosenfeld O. – Fay, B. – Kirkpatrick, S. – Kellogg, R. – Gibson M. – Wang, T. – Hunting E. M. – Mamic P. – Ganz A. B. – Rolnik, B. – Li, X. – Snyder M. P.. “Pre-symptomatic detection of COVID-19 from smartwatch data.” Nature Biomedical Engineering (2020).

Deep Learning and synthetic datasets

Deep Learning and synthetic datasets

How to deal with the lack of data

In applications such as Industry 4.0, the available data does not always meet the needs of the complex algorithmic structures or report correct labelling. This is why synthetic data is used to make up for any shortcomings that may be present in real datasets.

For a number of fundamental problems, the interest in the development of procedures capable of generating synthetic data is gradually becoming stronger in the world of industry 4.0: in addition to the well-known difficulties associated with respecting privacy, there is also a need for complex algorithmic structures (such as for example, neural networks) to have to procure a large amount of data. Not only that: the data available needs to show a decisive characteristic tp obtain realistic results, namely an accurate annotation (also called labelling). Unfortunately, for a number of reasons, these requirements are not always possible to meet.

Over the last few years, multiple methods have been developed that can effectively generate synthetic datasets, therefore adding an additional procedural level within a learning pipeline. This allows for the collection of very plausible data, cutting further costs that derive from the collection and annotation of the latter.

In preparing Deep Learning algorithms able (for example) to allow the automation of robots within your factory or to provide assistance when driving a car, it is essential to be able to draw on a lot of visual data (such as images and videos),explained in great detail. The act of highlighting specific regions shown in photos and videos is called  segmentation: the precision in defining these areas can make the difference in whether or not to have an algorithm capable of interacting effectively with the surrounding environment. Usually, this is done by human annotators, resulting in not only an additional cost to be taken into consideration but also the poor accuracy of the annotations. Another fact to consider is the restrictions in terms of privacy that may emerge in the event that it is necessary to have categories of data available which in the eyes of current legislation can be considered “sensitive”. All these difficulties can be overcome, this is thanks to new methodologies that are capable of generating synthetic datasets, or sets of data that are created by taking a cue from real data. This is possible thanks to techniques that learn what “distinguishes” the real data: among the most famous architectures in literature, in this sense, are the GAN (Generative Adversarial Neural Network), able to shape new data thanks to the training of two structures, namely a generator  and a discriminator D. The new synthetic data can be made even more realistic by combining hybridisation techniques. In addition, modern game engines can contribute to the creation of virtual scenarios, these are useful for effectively training robots within an automated production chain.

Index of topics

1 – Limitation of currently existing datasets

How is it possible to generate annotations as precise as possible, in the event that we are dealing with a huge amount of data (like 100,000 images)? Although for some functions (eg classification of images) more than satisfactory datasets are already available (ImageNet), in other situations one could have to deal with data that lapse in the absence of precise annotations (Fig. 1). This phenomenon is referred to in the literature as the curse of dataset annotation (The curse of dataset annotation)[1]. Other limits to be taken into account are:

  • Evaluation of the model: the use of real data alone may not be sufficient if you want to test your “deep” architecture, in an attempt to reveal any flaws in the design of the latter. The use of synthetic datasets can place any critical issues in the spotlight, allowing the formulation of hypotheses to be tested by generating a “controlled environment” [2].
  • Alleviation of bias: some datasets may not be able to better generalize all the possible cases to be subjected to a possible learning algorithm. The risk is to train a “deficient” structure with respect to some input data: more generally, this type of problem is defined in statistics with the term bias. Therefore, synthetic datasets can help cover any statistical flaws present within a real data collection [2].
Figure 1: Two images from the MS COCO dataset. Note the imprecise and raw segmentation produced by human annotators [2]
  • Space optimisation: if the generation of synthetic data improves the algorithm to which they will be fed, it will be possible to produce new datasets “on-site”, thus optimisng the use of memory [2].
  • Solving privacy issues: the production of synthetic data can help overcome any obstacles related to the use of sensitive data. For example, dealing with information regarding a person’s health or financial situation.

2 – Examples of synthetic datasets

Synthetic datasets are currently being successfully applied in two fields: autonomous driving (AD) and industrial automation. For the record, two famous virtual datasets for the CEO will be mentioned: Virtual Kitti[3] e SYNTHIA[5].

Virtual Kitty was born with the idea of ​​mimicking the acquisition of videos and images within an urban setting, exactly as it happened in its real counterpart, that is KITTI[4]: instead of driving a real car with cameras, 3D scanners and lasers, the same acquisition operations are performed in a virtual world within the Unity game engine [3]. Among the aspects not to be underestimated is the fact that this type of acquisition makes it possible to accurately record the 3D bounding boxes ,and 2D (3D 2D bounding box): as the name would suggest, these are spatial coordinates capable of allowing the identification of the area occupied by an object within a 3D space or a 2-dimensional area. SYNTHIA places greater emphasis on the semantic annotation of objects in the virtual world: it offers 13 labelling classes, in order to have a large number of segmentation areas available [5]. Here too, the urban area was generated through the use of Unity [5]. One of the most interesting features of SYNTHIA lies in the fact that all the 3D objects have been made downloadable [5]: consequently, it will be possible to generate new metropolitan areas in a completely random way, using each component as a single unit (Figure 2).

Figure 2: some images taken within virtual environments generated through SYNTHIA’s 3D assets

3 – From synthetic to real: Autoencoders

The Auto-Encoder  They were originally introduced with the aim of carrying out a reduction in the dimensionality of the features of a dataset in a more efficient way than with much better known analytical techniques, such as PCA [7]. Taking a cue from what happened in a Restricted Boltzmann Machinein which each pair of layers acts as a feature detector in such a way as to acquire the necessary characteristics to be able to outline the correct activation functions, the various layers are merged in a sort of “funnel” scheme: in this way, a structure is created pile capable of being able to carry out the correct reduction of dimensionality, while maintaining the generalisation capabilities granted by the Boltzmann architecture. The new deep network is subject to a “loss function”(Loss Function)which must provide a qualitative measure of the reconstructed data, compared to the original data passed as input.

Figure 3: the architecture of an AE, starting from the pairs of layers of an RBM [7]

The formula shows the learning rate and the discrepancy that exists between the original data and the reconstructed one [7]. In summary, an autoencoder is defined by two sub-structures, a encoder  and a decoder, respectively having the function of reducing and restoring the dimensionality of the data. It is possible to ensure that an auto-encoder is able to abstract the probabilistic distribution that governs the dataset, in such a way as to generate new data through the latent space derived from the encoder: this is achievable through Variational Auto-Encoder (VAE), within which the encoder and decoder are denoted in a probabilistic manner [8]. This time, the reconstruction discrepancy is formulated through a Kullback-Leibler divergence, taking into account the two distributions:

Figure 4: the probabilistic graph of a VAE [8]

Using a deep structure it will be possible to create a sample starting with the statistical distribution: this technique is called the   reparametrization trick  ( reparametrization trick)

Although autoencoders have reported remarkable results regarding data generation, their statistical inference is strongly based on the Monte-Carlo Methods, which can lead to a difficult analytical treatment of the problem to be addressed [9].

4 – Learning to create: the GAN

The Generative Adversarial Neural Network (GAN) they are now considered a standard for data generation. The logic that controls this profound architecture is the following: two models,  and D, they are trained on a portion of the dataset. The task of ( or the “Discriminator”) is to estimate the probability that a sample will come from the training dataset rather than from (the “Generator”). The purpose of this “two-player game” is to make sure that  you become skilled to the point of deceiving D[9]. Figure 5 describes all of this: the function of G  and that of the training set overlap, causing D  is no longer able to discern the origin of the data. In statistical terms, D  is trained in such a way that maximize  the probability that a dataset comes from the dataset, as opposed to G  that must minimize:this competition is defined as minmax game,and is described by the following function:

The main strength of GAN lies in the fact that all the problems arising from the use of Markov chains and the Monte-Carlo method are bypassed: the gradient is obtained through the implementation of the algorithm of back propagation (backprop), removing the need for statistical inferences [9]. Once the training of this structure is finished, it will be possible to generate data from scratch, using the probabilistic set stored by G.

Figure 5: Probabilistic distributions learned from D (the blue dotted line), G (the green line) and the one implied by the data (the black dotted line) [9]

Worthy of mention are the Adversarial Auto-Encoder (AAE), that is, probabilistic auto-encoders that implement GANs capable of making statistical inference on the variability of the data, ensuring that a rear distribution of an encoding vector of an auto-encoder matches an arbitrary priority distribution (figure 6 ) [10]. The measure of likelihood reported by the outputs of the experiments suggests that AAEs could lead to further steps forward in the generation of synthetic data. [10]

Figure 6: An AAE with the ability to generate new data. The labelling information is separated from the hidden vector: in this way, the characteristics of the input are learned from the structure[10]
Figure 7: Faces of people generated via an AAE. The original portraits are located on the last column on the right[10]

5 – Greater Realism: Parallel Imaging and PBR

It is important to underline the fact that the mere use of algorithmic structures to model new data risks introducing new biases, due to some repeated patterns that could compromise the variability of the new dataset [2]. This largely stems from the fact that using artificially generated data for real applications is equivalent to “moving” from a fictitious world to a real one: taking visual data as an example, it has been noted how important a realistic rendering is to soften the shock resulting from the translation of domain. This problem is known as domain adaptation (domain adaptation)[2]. To solve this type of problem, over the years they have been proposed  hybrid architectures, able to make the most of deep structures in conjunction with methods of manipulating real data: this is the case with Parallel Imaging, a framework that uses AI guided approaches together with others related to the use of real images and videos. These schematizations do not exclude the use of the various game engines, which can allow the addition of an additional likelihood factor [12]. In some cases, architectures have been proposed that contemplate the use of rendering techniques borrowed directly from the world of special effects [13], with impressive results to say the least (Figure 8).

Figure 8: 3D image rendered with the use of PBR (Physical Based Rendering) techniques[13]

6 – Examples of industrial application

There are many cases of applications within factories and industries of synthetic data: this not only allows the entrepreneur to be able to accelerate the adoption times of solutions more oriented towards automation, but also to be able to earn in terms of training, both for humans and machines. Taking a cue from the architectures seen above, the use of VAE makes it possible to train the machines entirely within virtual environments [6]: it will therefore be possible to use the exact same decoder to perform an effective domain translation from a simulated scenario to a real one, using two different encoders able to handle the two cases. Once the various structures have been trained, it will be possible to use  Convolutional Neural Network   (CNN) to find the spatial coordinates of any object with extreme precision (figure 9) [6].

In the case of dataset generation, Fallen Things (FAT) is certainly worth mentioning: developed by Nvidia, FAT allows optimal training of automata for industrial applications, providing a way of randomly creating highly realistic scenarios. This generation turns out to be quite unique: once the backdrop has been outlined, the objects are literally thrown into the environment, so that the final arrangement is guided by the physics engine of the game engine Unreal [11]. Once this operation is completed, the user will have a detailed annotation of the scenario available, useful for training deep structures.

Figure 9: Structure that uses VAEs to manage the virtual-to-real domain shift. A CNN will then take care of identifying the spatial coordinates of an object

7 – Conclusions

Neural networks and more generally the so-called “deep” architectures are causing a change in the procedures normally used in some fields inherent to learning algorithms: instead of developing methods to solve specific problems, research is pointing towards the adoption of “universal paradigms ”, Able to map an input space (that is, the data that is given in input by a user) directly into the corresponding output space (the data produced by the algorithm) [1]. Synthetic data will not only accelerate research in this direction but will help to make up for all those shortcomings that may be present in real datasets.

  • Pros
    • Qualitative improvement of datasets
    • Possibility of  on-the-fly  building of data
    • Optimization in automation processes (just think that BMW is currently training its artificial co-drivers for 95% on virtual tracks)
    • Deep tools that can be inspected and analyzed thanks to “controlled” data, resulting in less black-box
  • Cons
    • Risk of inserting new biases if the generation is not unpredictable enough
    • Necessary likelihood
    • Need to hybridize with reality to alleviate the domain shift

Speech analysis: feelings, emotions and dialect

Speech analysis: feelings, emotions and dialect

Audio analysis is very important today since most interactions consist of interactive voice supports. The main information extracted from the audio is the semantics, that is the meaning of the concepts expressed.

Some areas of application are:
– voice recognition, for example of feelings and/or emotions;
– the geographical origin;
– medical information (indications on the symptoms of some pathologies).

A possible resolution of tasks related to audio, and in particular to voice, is very demanding and the techniques that are used to solve these problems are often based on Deep Learning approaches that require high computing power.

Federico Schipani, Machine learning engineer

European regulation AI


Voice interactions are increasingly widespread because they improve the user-user relationship and the user-software / machine relationship.
Some examples are: voice controls inside cars, on smartphones or even for industrial applications and, more generally, all those situations in which you cannot, or do not want to, use your hands to type a text.
For this reason, audio analysis has become a very important topic in the contemporary world since, compared to a text file, the information contained within a spoken sentence is much more detailed.
In the world of sales it is increasingly difficult to both establish and make oneself attractive in the eyes of potential customers and for this reason, having information about conversations can play a decisive role both in the sales and the analysis phase. During a negotiation, it is, in fact, possible to collect a lot of information that may improve the sales experience: to quickly understand the degree of interest of the parties to evaluate the continuation or adjust an offer to make it more appropriate. At a strategic level, it is important for managers to know the progress of campaigns and to be able to analyse aggregate data, supported by appropriate evaluation metrics, which can be explored and on which to base future decisions.

In written information, such problems are usually solved with deep learning techniques, in fact, a major limitation faced when working with audio is undoubtedly a large amount of data necessary for training models of this type. Furthermore, most of the models that are available have been developed for the English language and applications for the Italian language are less common and often are not sufficient to offer a good performance; this implies that it is often necessary to go through a data collection phase which can often be both long and expensive.

Going into detail we can identify four tasks regarding the analysis of audio: speech recognition, sentiment analysis, emotion analysis and dialect recognition.

Speech recognition

Speech recognition is the cornerstone of audio analysis: starting with a voice audio input, the aim is to then transcribe the content into a text document and for this reason, it is comparable to a more complex version of a sound classifier. Many times the problems of this type are modelled as phoneme classifiers, then they are eventually followed by other models for the prediction of the text based on the recognized phonemes.

Recognition of dialects

Without a doubt, the task in which there is less difficulty in understanding its nature is the identification of a dialect. In this case, in fact, starting from an audio file, the final purpose is to identify the origin of the speaker. Although a nation may share a single language many times between different regions, the language varies, giving rise to dialects. A borderline case, for example, is in Italy: the national language is Italian, but in Sardinia, Sardinian is spoken which is considered as an autonomous language as opposed to a dialect.

Figure: Grouping of dialects in Italy [7]

In general, however, most dialects tend to have some similarities to each other, so even if this can be considered in the same way as a language classification problem it becomes much more complicated.

Sentiment and Emotion analysis

Many times the expression sentiment analysis is used interchangeably with emotion analysis. The substantial difference between the analysis of feelings and the analysis of emotions lies within the number of classes. The analysis of feelings can be traced back to a binary classification problem, in which the input can be assigned to a negative or positive class. In some cases, however, it is preferable to introduce an additional level of precision by inserting the neutral class.

The analysis of emotions, on the other hand, is a problem that is more complex and difficult to manage. Opposed to the first type of analysis there are not two or three classes with obvious differences, but in this case, there are more classes that may instead have similarities between them. For example, you can have the happiness and excitement class which, despite being two different types of emotions, have different characteristics in common such as volume or tone of voice.

In some contexts, it is possible to see the analysis of emotions as an extension of the analysis of feelings, this leads to a more detailed and less simplified view of a conversation. For example, it goes without saying that angry or bored are two emotions that are not exactly positive, in fact in a possible analysis of feelings they would be included in the negative class.

Due to the nature of the problems, sometimes, when only textual information is available, such as product reviews, we tend to prefer the analysis of feelings as opposed to the analysis of emotions as having no audio or images the first task is much more difficult.

Techniques for analysing audio

Audio analysis can be done by using different techniques, each one with a different type of pre-processing of the audio file. The most natural that comes to mind is of course to carry out speech recognition and then to work on the text using Natural Language Processing tools. Other techniques, a little less intuitive, involve the generation of audio features which are then be analysed using a classifier created using a Neural Network.

Classification through Neural Networks

An artificial neural network is nothing more than a mathematical model whose purpose it is to approximate a certain function. Its composition is called tiered as it contains a first input level, intermediate levels that are called hidden and a last output level. Each hidden layer of the network is made up of a predefined number of processing units, these are called neurons. Each neuron at the level has different input signals from neurons at the n-1 level and an output signal for neurons at the n + 1 level.

Figure: Artificial Neural Network [2]

The most common networks that we describe below are called feedforward since their computation graph is both direct and acyclic. We will now see how learning by a neural network works. First of all, the internal structure of a neuron must be described in more detail. As previously mentioned, the single computational unit has both an input and an output, but how is the data inside it transformed?

Figure: Artificial neuron [3]

We define for each connection i, entering a neuron, an input x and an associated weight w; the inputs are then combined linearly with the associated weights by carrying out the operation


The result of this operation is then given as input to a function that is called theactivation function. Summarising and formalising the operations carried out within a single neuron more concisely, the output is:

The activation function is that component of the model that is responsible for determining the output value based on the linear combination of the neuron weights with the outputs of the previous layers. There are several activation functions: for the current purpose, it is sufficient to know that for the intermediate layers the (non-linear) function called ReLU is often used. While if a classification problem is solved on the last layer, it is done so with a function called Softmax, the property of which is to output a probability distribution on the various classes.

The last problem to be solved now is the training of a model of this type. The training of a Neural Network takes place in two phases: the first phase is called forward propagation, in which it is fed with data taken from a training set and then the second phase, called back propagation, in which there is the actual training.

In forward propagation, a part of the dataset that is called the train set is used to calculate the prediction of the model and to compare it with reality. The comparison, calculated by means of a loss function, is an objective measure of how correct the prediction of the model is. The training is the mathematical optimization of the loss function: in other words, the model weights are modified in an attempt to reduce the loss function and therefore ensure that the prediction of the model is as close as possible to reality. Therefore, in the back propagation phase, a rule for the calculation of the derivatives called the chain rule is used. In this way, by calculating the derivative of the loss function with respect to the weights of the levels, we can vary the weights to try and decrease the value of the function being optimised.

Audio preprocessing

The audio files, before they are given as input to a classifier, must undergo a preprocessing phase where coefficients called features are extracted.

An audio file has a characteristic called the sampling rate. Usually, it is assumed to be 16 kHz, this means that in one second of audio there are sixteen thousand samples. We will use this frequency as a standard in the following examples. Calculating a set of coefficients for each sample is a very onerous operation, as well as useless as the variation between one sample and its next, or indeed the previous one, is minimal. Therefore we consider batches of consecutive samples of size 400, corresponding therefore to 25 ms. For each set of samples, the audio features will then be calculated. In speech recognition tasks a particular type of feature called the Mel-Frequency Cepstral Coefficent (MFCC)is generally used; these are calculated with the following procedure:

  1. Once the batch of frames is taken, the Fourier Transform is calculated. Considering:

And given the sampling rate

the frequency corresponding to is given by


Calculating it for all batches of samples we will then get

the Short Time Fourier Transform Matrix (STFT Matrix)

  1. The Mel Filterbank is calculated:
    1. Using the conversion formula, you pass from the frequency scale to the Mel scale:
  1. Data:

With F number of filters.

  1. It is calculated

then, using the inverse formula, this value is scaled to the frequencies, obtaining

  1. The Mel filter bank is given by this function:
  1. We proceed with the calculation of MFCC:
    1. We pass to the logarithmic scale of the Mel:
  1. The discrete cosine transform applies:

therefore obtaining that the p-th column of the Phi matrix represents the MFCC coefficients of the corresponding signal.

From the calculation of these coefficients, it is then possible to generate spectrograms or matrices with which the relative analyses will then be carried out.

Audio analysis through recurring networks

Nowadays the two types of very effective recurring networks are Long Short Term Memory (LSTM) and Bidirectional LSTM (BiLSTM). For an accurate description of this type of structure, please refer to this previously published article.

The large disadvantage of this type of structure is the impossibility of varying the size of the input as they always receive a vector of a fixed size

However, there are recent works that are based on the attention mechanism that does solve this problem and at the same time manage to manage addictions more effectively in the long term. By way of example, it is possible to describe one of the simplest mechanisms of attention as a vector. Let’s first define

the input matrix of the form T x F, where T is the number of frames and F is the number of features per frame. The attention mechanism is a vector

such that, if multiplied by H, forming

produces an attention map. A possible use for this attention map can be as a heavy factor for the output of an RNN, in such a way that it tells the model where to pay more attention.

Audio analysis via CNN

From the audio features extracted during preprocessing it is possible to generate a spectrogram, which is a three-dimensional graphic representation in the form of an image that shows the variation of the frequency spectrum over time. On the horizontal axis, there is therefore time, while on the vertical axis there is frequency. The third dimension is represented by the colour scale that is used to show the change in amplitude of a frequency.

(d) Happiness(e) Sadness(f) Neutral

Figure: Examples of spectrograms for each emotion class

So to all intents and purposes, being, therefore, images, it leads back to a problem of classification of images and therefore it becomes possible to use CNN, or models based on it, to then classify them. For a quick and detailed explanation of what a Convolutional Neural Network is, please refer to this previously published article.

Nowadays there are many Deep Learning models that are suitable for carrying out Image Classification, one of the most famous is VGG in all its possible forms.

Furthermore, thanks to pre-training carried out on other datasets it is possible to achieve satisfactory results with relatively little data at our disposal. In fact, many times in the DL libraries that implement these models you can also choose whether or not to download a set of weights from which to start the training. In this case, we speak of Transfer Learning or Fine Tuning.


Audio analysis is spreading in many fields of application and allows, through voice interaction, to control software and machines, solving many situations in which the user is unable, or hindered, in physical interaction with the device. Through deep learning tools, it is possible to extract a lot of information from voice interaction. For example information concerning the emotion or the origin of the interlocutor, improves the interaction and the user experience, but also leads to the collection of new data on which to generate much more exhaustive and precise performance indicators.

The use of these tools is having a very positive impact in the field of telephone assistance as it allows for better call management, accurate real-time analysis and the planning of more effective strategies. Furthermore, it is possible to use question answering models to support the answers in a much faster way as opposed to the classic research tools.

Finally, in the preventative medicine field, the latest research has investigated the possibility of using audio analysis, especially coughing, to help doctors in identifying subjects that are positive for COVID-19. The studies [8], although preliminary, seem very promising and would allow for fast, non-invasive and large-scale screening with simple audio collection tools such as smartphones.

In summary, the strengths and weaknesses of using audio analysis techniques are:

  • Pro
    • Acquisition of new data and creation of more effective KPIs
    • Workflow optimisation
    • Improvement of human-machine interactions
  • Cons
    • Requires a lot of annotated data
    • Difficulty in finding Italian datasets
    • Privacy and data content management


[1] Choi, Keunwoo, et al. “A comparison of audio signal preprocessing methods for deep neural networks on music tagging.” 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018.



[4] Muaidi, Hasan, et al. “Arabic Audio News Retrieval System Using Dependent Speaker Mode, Mel Frequency Cepstral Coefficient and Dynamic Time Warping Techniques.” Research Journal of Applied Sciences, Engineering and Technology 7.24 (2014): 5082-5097.

[5] Kopparapu, Sunil Kumar, and M. Laxminarayana. “Choice of Mel filter bank in computing MFCC of a resampled speech.” 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010). IEEE, 2010.

[6] Ramet, Gaetan, et al. “Context-aware attention mechanism for speech emotion recognition.” 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018.


[8] Laguarta et al. “COVID-19 Artificial Intelligence Diagnosis using only Cough Recordings”, IEEE Open Journal of Engineering in Medicine and Biology, IEEE, 2020.