Deep Learning and synthetic datasets

by Sep 13, 2021Deepclever Ai Articles

How to deal with the lack of data

In applications such as Industry 4.0, the available data does not always meet the needs of the complex algorithmic structures or report correct labelling. This is why synthetic data is used to make up for any shortcomings that may be present in real datasets.

For a number of fundamental problems, the interest in the development of procedures capable of generating synthetic data is gradually becoming stronger in the world of industry 4.0: in addition to the well-known difficulties associated with respecting privacy, there is also a need for complex algorithmic structures (such as for example, neural networks) to have to procure a large amount of data. Not only that: the data available needs to show a decisive characteristic tp obtain realistic results, namely an accurate annotation (also called labelling). Unfortunately, for a number of reasons, these requirements are not always possible to meet.

Over the last few years, multiple methods have been developed that can effectively generate synthetic datasets, therefore adding an additional procedural level within a learning pipeline. This allows for the collection of very plausible data, cutting further costs that derive from the collection and annotation of the latter.

In preparing Deep Learning algorithms able (for example) to allow the automation of robots within your factory or to provide assistance when driving a car, it is essential to be able to draw on a lot of visual data (such as images and videos),explained in great detail. The act of highlighting specific regions shown in photos and videos is called  segmentation: the precision in defining these areas can make the difference in whether or not to have an algorithm capable of interacting effectively with the surrounding environment. Usually, this is done by human annotators, resulting in not only an additional cost to be taken into consideration but also the poor accuracy of the annotations. Another fact to consider is the restrictions in terms of privacy that may emerge in the event that it is necessary to have categories of data available which in the eyes of current legislation can be considered “sensitive”. All these difficulties can be overcome, this is thanks to new methodologies that are capable of generating synthetic datasets, or sets of data that are created by taking a cue from real data. This is possible thanks to techniques that learn what “distinguishes” the real data: among the most famous architectures in literature, in this sense, are the GAN (Generative Adversarial Neural Network), able to shape new data thanks to the training of two structures, namely a generator  and a discriminator D. The new synthetic data can be made even more realistic by combining hybridisation techniques. In addition, modern game engines can contribute to the creation of virtual scenarios, these are useful for effectively training robots within an automated production chain.

Index of topics

1 – Limitation of currently existing datasets

How is it possible to generate annotations as precise as possible, in the event that we are dealing with a huge amount of data (like 100,000 images)? Although for some functions (eg classification of images) more than satisfactory datasets are already available (ImageNet), in other situations one could have to deal with data that lapse in the absence of precise annotations (Fig. 1). This phenomenon is referred to in the literature as the curse of dataset annotation (The curse of dataset annotation)[1]. Other limits to be taken into account are:

  • Evaluation of the model: the use of real data alone may not be sufficient if you want to test your “deep” architecture, in an attempt to reveal any flaws in the design of the latter. The use of synthetic datasets can place any critical issues in the spotlight, allowing the formulation of hypotheses to be tested by generating a “controlled environment” [2].
  • Alleviation of bias: some datasets may not be able to better generalize all the possible cases to be subjected to a possible learning algorithm. The risk is to train a “deficient” structure with respect to some input data: more generally, this type of problem is defined in statistics with the term bias. Therefore, synthetic datasets can help cover any statistical flaws present within a real data collection [2].
Figure 1: Two images from the MS COCO dataset. Note the imprecise and raw segmentation produced by human annotators [2]
  • Space optimisation: if the generation of synthetic data improves the algorithm to which they will be fed, it will be possible to produce new datasets “on-site”, thus optimisng the use of memory [2].
  • Solving privacy issues: the production of synthetic data can help overcome any obstacles related to the use of sensitive data. For example, dealing with information regarding a person’s health or financial situation.

2 – Examples of synthetic datasets

Synthetic datasets are currently being successfully applied in two fields: autonomous driving (AD) and industrial automation. For the record, two famous virtual datasets for the CEO will be mentioned: Virtual Kitti[3] e SYNTHIA[5].

Virtual Kitty was born with the idea of ​​mimicking the acquisition of videos and images within an urban setting, exactly as it happened in its real counterpart, that is KITTI[4]: instead of driving a real car with cameras, 3D scanners and lasers, the same acquisition operations are performed in a virtual world within the Unity game engine [3]. Among the aspects not to be underestimated is the fact that this type of acquisition makes it possible to accurately record the 3D bounding boxes ,and 2D (3D 2D bounding box): as the name would suggest, these are spatial coordinates capable of allowing the identification of the area occupied by an object within a 3D space or a 2-dimensional area. SYNTHIA places greater emphasis on the semantic annotation of objects in the virtual world: it offers 13 labelling classes, in order to have a large number of segmentation areas available [5]. Here too, the urban area was generated through the use of Unity [5]. One of the most interesting features of SYNTHIA lies in the fact that all the 3D objects have been made downloadable [5]: consequently, it will be possible to generate new metropolitan areas in a completely random way, using each component as a single unit (Figure 2).

Figure 2: some images taken within virtual environments generated through SYNTHIA’s 3D assets

3 – From synthetic to real: Autoencoders

The Auto-Encoder  They were originally introduced with the aim of carrying out a reduction in the dimensionality of the features of a dataset in a more efficient way than with much better known analytical techniques, such as PCA [7]. Taking a cue from what happened in a Restricted Boltzmann Machinein which each pair of layers acts as a feature detector in such a way as to acquire the necessary characteristics to be able to outline the correct activation functions, the various layers are merged in a sort of “funnel” scheme: in this way, a structure is created pile capable of being able to carry out the correct reduction of dimensionality, while maintaining the generalisation capabilities granted by the Boltzmann architecture. The new deep network is subject to a “loss function”(Loss Function)which must provide a qualitative measure of the reconstructed data, compared to the original data passed as input.

Figure 3: the architecture of an AE, starting from the pairs of layers of an RBM [7]

The formula shows the learning rate and the discrepancy that exists between the original data and the reconstructed one [7]. In summary, an autoencoder is defined by two sub-structures, a encoder  and a decoder, respectively having the function of reducing and restoring the dimensionality of the data. It is possible to ensure that an auto-encoder is able to abstract the probabilistic distribution that governs the dataset, in such a way as to generate new data through the latent space derived from the encoder: this is achievable through Variational Auto-Encoder (VAE), within which the encoder and decoder are denoted in a probabilistic manner [8]. This time, the reconstruction discrepancy is formulated through a Kullback-Leibler divergence, taking into account the two distributions:

Figure 4: the probabilistic graph of a VAE [8]

Using a deep structure it will be possible to create a sample starting with the statistical distribution: this technique is called the   reparametrization trick  ( reparametrization trick)

Although autoencoders have reported remarkable results regarding data generation, their statistical inference is strongly based on the Monte-Carlo Methods, which can lead to a difficult analytical treatment of the problem to be addressed [9].

4 – Learning to create: the GAN

The Generative Adversarial Neural Network (GAN) they are now considered a standard for data generation. The logic that controls this profound architecture is the following: two models,  and D, they are trained on a portion of the dataset. The task of ( or the “Discriminator”) is to estimate the probability that a sample will come from the training dataset rather than from (the “Generator”). The purpose of this “two-player game” is to make sure that  you become skilled to the point of deceiving D[9]. Figure 5 describes all of this: the function of G  and that of the training set overlap, causing D  is no longer able to discern the origin of the data. In statistical terms, D  is trained in such a way that maximize  the probability that a dataset comes from the dataset, as opposed to G  that must minimize:this competition is defined as minmax game,and is described by the following function:

The main strength of GAN lies in the fact that all the problems arising from the use of Markov chains and the Monte-Carlo method are bypassed: the gradient is obtained through the implementation of the algorithm of back propagation (backprop), removing the need for statistical inferences [9]. Once the training of this structure is finished, it will be possible to generate data from scratch, using the probabilistic set stored by G.

Figure 5: Probabilistic distributions learned from D (the blue dotted line), G (the green line) and the one implied by the data (the black dotted line) [9]

Worthy of mention are the Adversarial Auto-Encoder (AAE), that is, probabilistic auto-encoders that implement GANs capable of making statistical inference on the variability of the data, ensuring that a rear distribution of an encoding vector of an auto-encoder matches an arbitrary priority distribution (figure 6 ) [10]. The measure of likelihood reported by the outputs of the experiments suggests that AAEs could lead to further steps forward in the generation of synthetic data. [10]

Figure 6: An AAE with the ability to generate new data. The labelling information is separated from the hidden vector: in this way, the characteristics of the input are learned from the structure[10]
Figure 7: Faces of people generated via an AAE. The original portraits are located on the last column on the right[10]

5 – Greater Realism: Parallel Imaging and PBR

It is important to underline the fact that the mere use of algorithmic structures to model new data risks introducing new biases, due to some repeated patterns that could compromise the variability of the new dataset [2]. This largely stems from the fact that using artificially generated data for real applications is equivalent to “moving” from a fictitious world to a real one: taking visual data as an example, it has been noted how important a realistic rendering is to soften the shock resulting from the translation of domain. This problem is known as domain adaptation (domain adaptation)[2]. To solve this type of problem, over the years they have been proposed  hybrid architectures, able to make the most of deep structures in conjunction with methods of manipulating real data: this is the case with Parallel Imaging, a framework that uses AI guided approaches together with others related to the use of real images and videos. These schematizations do not exclude the use of the various game engines, which can allow the addition of an additional likelihood factor [12]. In some cases, architectures have been proposed that contemplate the use of rendering techniques borrowed directly from the world of special effects [13], with impressive results to say the least (Figure 8).

Figure 8: 3D image rendered with the use of PBR (Physical Based Rendering) techniques[13]

6 – Examples of industrial application

There are many cases of applications within factories and industries of synthetic data: this not only allows the entrepreneur to be able to accelerate the adoption times of solutions more oriented towards automation, but also to be able to earn in terms of training, both for humans and machines. Taking a cue from the architectures seen above, the use of VAE makes it possible to train the machines entirely within virtual environments [6]: it will therefore be possible to use the exact same decoder to perform an effective domain translation from a simulated scenario to a real one, using two different encoders able to handle the two cases. Once the various structures have been trained, it will be possible to use  Convolutional Neural Network   (CNN) to find the spatial coordinates of any object with extreme precision (figure 9) [6].

In the case of dataset generation, Fallen Things (FAT) is certainly worth mentioning: developed by Nvidia, FAT allows optimal training of automata for industrial applications, providing a way of randomly creating highly realistic scenarios. This generation turns out to be quite unique: once the backdrop has been outlined, the objects are literally thrown into the environment, so that the final arrangement is guided by the physics engine of the game engine Unreal [11]. Once this operation is completed, the user will have a detailed annotation of the scenario available, useful for training deep structures.

Figure 9: Structure that uses VAEs to manage the virtual-to-real domain shift. A CNN will then take care of identifying the spatial coordinates of an object

7 – Conclusions

Neural networks and more generally the so-called “deep” architectures are causing a change in the procedures normally used in some fields inherent to learning algorithms: instead of developing methods to solve specific problems, research is pointing towards the adoption of “universal paradigms ”, Able to map an input space (that is, the data that is given in input by a user) directly into the corresponding output space (the data produced by the algorithm) [1]. Synthetic data will not only accelerate research in this direction but will help to make up for all those shortcomings that may be present in real datasets.

  • Pros
    • Qualitative improvement of datasets
    • Possibility of  on-the-fly  building of data
    • Optimization in automation processes (just think that BMW is currently training its artificial co-drivers for 95% on virtual tracks)
    • Deep tools that can be inspected and analyzed thanks to “controlled” data, resulting in less black-box
  • Cons
    • Risk of inserting new biases if the generation is not unpredictable enough
    • Necessary likelihood
    • Need to hybridize with reality to alleviate the domain shift

Smart Agriculture & Deep Learning

Smart Agriculture & Deep Learning

Smart Agriculture, or rather, precision agriculture, represents an innovative model for managing agricultural activities.It, through the use of sensors, translators and hardware/software systems, allows the reduction of the use of pesticides and the consumption of...