Chatbot and Vocal Assistant: New frontiers of communication

by Sep 13, 2021Deepclever Ai Articles

What are chatbots? How can chatbots improve customer service? What is behind the functioning of a chatbot? In this article we will try to answer these questions, also providing a practical example of how an automated dialogue software can be integrated in a real context.

A chatbot is software designed to emulate a conversation, as smooth and natural as possible, with a human being. Deep Learning, as in every field, has also intervened here revolutionizing the way of building automated conversational agents. If before there were only simple software that through fixed rules only answered a certain number of questions formulated only in certain ways, now we can benefit from chatbots capable of interpreting human language in an almost completely natural and transparent way. Many times the latter also make use of additional Text-To-Speech or Speech-To-Text modules to make the interaction between human and machine even simpler and more intuitive. Other times it is possible to add CNN-based modules [1] for the recognition and description of images. Traditionally, chatbots are divided into two categories: closed domain and open domain. The former has the purpose of answering questions limited, precisely, to a restricted domain. An example can be the chatbot used to collect orders or reservations, to create assistance systems and so on. They are agents that by their nature are very robust as they have to operate only in very few fields and cases, thus making the management of any unexpected inputs very basic. The second type of agents, the so-called open-dominated ones, try to mimic a conversation with a human being as faithfully as possible. Currently, the three best performing models are Google’s Meena, Facebook’s Blender, and OpenAI’s GPT-3. Recently some works have tried to unite these two worlds by creating some chatbots capable of both making conversations and answering specific questions about the environment for which they were designed. (Generative Encoder-Decoder Models for Task-Oriented Spoken Dialog Systems with Chatting Capability, 2017)

A possible practical application

Figure 1: AVA operation diagram

The most classic example for a chatbot is the one created for Customer Service purposes. More specifically, in a recent work (A Financial Service Chatbot based on Deep Bidirectional Transformers, 2020), a chatbot called “AVA: A Vanguard Assistant” specially created for assistance in the financial sector is introduced. Being a chatbot created for the purpose of creating customer service, it is obviously a task-oriented chatbot, belonging to the category of closed domain chatbot, which is used within the call centre. The operating scheme is elementary: during a phone call, the operator interacts with the customer and as long as he has knowledge of it he answers all the questions that have been submitted to him. During the phone call, AVA is listening and, when the operator is in difficulty, it queries an internal repository of documents to provide quick assistance. Figure 1 shows the operating diagram.

The advantage is certainly marked as traditionally, in the event that the operator does not know the answer to a question, he must interrogate the expert putting the customer on hold. A Chatbot, compared to a human being, is certainly faster in providing an answer and always available to the user. All of this takes customer service to another level by making it much faster, cheaper and smoother. In other words, through Deep Learning, a process that has traditionally been stopped for years is made more efficient. The question arises at this point: how does AVA work? The core of this work is a deep learning model, released by Google, called BERT which will be explored in the next paragraph. For more details on how BERT was used and on the type of question that such a chatbot can answer, instead, refer to the complete article (A Financial Service Chatbot based on Deep Bidirectional Transformers, 2020) and to the GitHub repository of the cyberyu project / ava (


BERT, anagram of Bidirectional Encoder Representation from Transformers, is a new language representation model designed by Google researchers. The power of BERT can be basically summarized in two key points: the first concerns training, the second the self-attention mechanism. BERT training proceeds in two phases, namely a pre-training carried out in an unsupervised manner and a fine-tuning one to adapt it to the specific task. Briefly, the training is based on a principle called “Masked Language Model” in which a token (word) of the sentence is masked and then made the prediction of the latter. The second pivotal point intrinsically concerns the way in which BERT was conceived as, unlike many other models such as GPT [2] or XLMNet, it operates bidirectionally on the input by sliding it from right to left and from left to right. ; in doing so he succeeds in giving importance to what precedes and follows a single word. These aspects, deliberately mentioned for now, will subsequently be dealt with in more detail. BERT, therefore, turns out to be a very generic model adaptable to multiple situations with a minimum effort and capable of achieving exciting performances in tasks such as Question Answering, the classification of sentences and so on.

Structure of BERT

The internal structure of BERT is made up of different Transformers layers: a particular encoder-decoder type structure that makes use of the self-attention mechanism to drastically improve performance. The purpose of an encoder is that, given in input a sequence of symbols in discrete space, represent these symbols in a continuous space. In a similarly inverse way, however, the purpose of a decoder is to transform this continuous representation back into a representation of symbols in a discrete space. As previously said BERT is a model to represent a language therefore, of Transformers, only the encoding part is used, excluding the decoders. Taking into consideration only vanilla BERT there are two variants: BERT Base and BERT Large differentiated exclusively by the number of encoder layers. Let’s now deal with the structure of a single encoding layer. Briefly, BERT’s input is one or more sentences; a sentence is made up of words and they individually, through a technique called Word Embedding (BERT in particular uses WordPiece (Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 2016)), are transformed into multiple dimension vectors fixed (xi). These vectors will enter the first encoder, consisting of two parts. The first part is the Self Attention mechanism which will return as many vectors (z i) that will be input to a FeedForward Neural Network whose output will in turn be three other vectors (r i). The final output in turn will go into input to the other encoders until it reaches the last encoder. The encoders in total are 12 for BERT base and 24 for BERT Large.

Figure 2: Diagram of the encoder layer used in BERT

The peculiarity, however, is the very cutting-edge Self Attention component. Through the definition of three matrices, called Query, Key and Value, and the variation of their values ​​during the training phase, it is possible to contextualize and better understand a sentence. For more details, see the full paper (Attention Is All You Need, 2017). At this point, however, the question arises: why if there is already a mechanism upstream that allows you to transform words into vectors, is it necessary to do all this processing? The answer is easier than it seems. Let us take an edge case as an example. The phrase “The peach fell from the tree” and the phrase “I went fishing for trout” have one word in common, namely -fishing-, which however has two diametrically different meanings. A system that takes into account only the single word to transform it into a representation that can be digested by a machine does not grasp this essential difference, BERT does as it takes into account the context around a word.

BERT training

BERT training proceeds in two phases: the first is of pre-training and the second of fine-tuning. In the first sentence, it is a bit like trying to make the model learn the language on which it will operate. Unsupervised training is then done on as much text as possible and proceeds in two further steps. During the first part of the pretraining a procedure called “Masked Language Model” is used in which with a certain probability one or more words of a sentence are masked and then a prediction is made. The second phase, on the other hand, is called “Next Sentence Prediction” in which you give two sentences the model performs a binary classification on them. The classes object of the prediction are “isNext” and “notNext” and represent respectively the possibility, or not, that the second sentence is the continuation of the first. The BERT fine-tuning phase, on the other hand, is very basic and serves to adapt the model to the specific task it will perform. For example, to create a question answering system, it is sufficient to provide BERT with question-answer pairs. For more details on BERT’s training phase, see the full article (BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2018).


Deep Learning can completely revolutionize customer support and make the conversation of a telephone operator and a client much easier. In particular, the activity can be more efficient in terms of:

  • Costs: the simplest interactions can be delegated to chatbots and voice assistants, leaving the operator focused on interactions with greater added value;
  • Effectiveness: artificial intelligence tools can find a lot of information and make it available to the operator and the client;
  • Speed: waiting times can lengthen considerably at certain times of the day or during particular events. Chatbots and vocal assistants can lighten the load by fulfilling the simplest requests and performing “standard” tasks such as collecting data prior to a request.

On the other hand, there are all those disadvantages that occur when operating with a deep learning system, in particular a massive demand for data and the need for hardware (computers and video cards) to train the models.


  • Lower operating costs
  • Speed
  • 24/7 availability


  • Expensive hardware
  • Data collection

A Financial Service Chatbot based on Deep Bidirectional Transformers. Shi Yu, Yuxin Chen, Hussain Zaidi. 2020. 2020.

Attention Is All You Need. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. 2017. 2017.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2018. 2018.

Generative Encoder-Decoder Models for Task-Oriented Spoken Dialog Systems with Chatting Capability. Tiancheng Zhao, Allen Lu, Kyusong Lee and Maxine Eskenazi. 2017. 2017.

Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, H. 2016. 2016.


  1. ↑
  2. GPT e XLMNet sono due modelli, rilasciati rispettivamente da OpenAI e Facebook, basati anch’essi sui layer Transformer. ↑
Smart Agriculture & Deep Learning

Smart Agriculture & Deep Learning

Smart Agriculture, or rather, precision agriculture, represents an innovative model for managing agricultural activities.It, through the use of sensors, translators and hardware/software systems, allows the reduction of the use of pesticides and the consumption of...