Anurag Sarkar
Anurag Sarkar
Biomedical Engineering Enthusiast, ML Enthusiast, NLP Enthusiast.

Making Alexa Talk: Application of LSTM in Natural Language Processing

Making Alexa Talk: Application of LSTM in Natural Language Processing

Imagine having this conversation with the Alexa that you bought in the Independence Day sale on Amazon:

“Me: Hey Alexa, play the song with Richie Blackmore and Dio that has some Babylon reference.

Alexa: You must be talking about The Gates of Babylon.”

But how did Alexa know that? How did it predict so well? Some will say that it uses AI. That is kind of correct, but not entirely. A more precise answer would be NLP or Natural Language Processing.

The following in-depth analysis will reveal what structure it follows. It is a convoluted structure that may seem simple but has deeper systems embedded in it. Let us discuss the core machinery, i.e. the Text Processing unit.


Alexa has the same structure as a computing system. An Input, namely the voice command from the user, an Interpreter or encoder which converts the voice input to text, a processing unit which is an LSTM Network, a decoder which converts the text output back to voice and the output of Alexa’s voice. When you talk to Alexa, there is a mic that captures your voice and feeds that audio input to a text converter, which is a Neural Network Model built using millions of audio clips and their corresponding text.


What is a Neural Network you might be wondering; simply put, it is a replica of our brain written in Python. Just like specific neurons fire in a specific order to send some specific signal in our brains, neural networks consist of ‘nodes’ or objects which connect to specific objects to fetch a particular output. After receiving the voice clip from you, Alexa sends it to a remote server containing the neural network. The input voice clip is then processed by that system which converts it to text. Likewise, the decoder has just the opposite network which converts the text output from the LSTM unit to a voice clip and sends back to the Alexa device.

Fig: Basic Structure of Alexa


Let’s talk about the LSTM unit that processes the signal received from the encoder and outputs a message to the decoder. Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. It can process not only single data points (such as images) but also entire sequences of data (such as speech or video). For example, LSTM applies to tasks such as unsegmented, connected handwriting recognition, speech recognition and anomaly detection in network traffic or IDSs (intrusion detection systems). A standard LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The battery remembers values over arbitrary time intervals, and the three gates regulate the flow of information into and out of the cell.

LSTM networks are well-suited to classifying, processing and making predictions based on time series data since there can be lags of unknown duration between essential events in a time series. LSTMs were developed to deal with the vanishing gradient problem that can be encountered when training traditional RNNs. Relative insensitivity to gap length is an advantage of LSTM over RNNs, hidden Markov models and other sequence learning methods in numerous applications.

Fig: A Typical LSTM Structure

The excellent prediction capabilities of LSTM are used by Alexa. Processing is easy; read the data (the novels we want to use as inputs), create a dictionary of words, and create a list of sentences. These serve as the inputs of our neural network. Using these we can build the neural network, train the neural network, and generate new sentences!


Now, we have to create the training dataset for our LSTM. We create two lists: sequences: this list contains the series of words (i.e. a list of names) used to train the model, next words: this list includes the following terms for each sequence of the sequences list. However, these lists cannot be used “as is”. If we want them to be ingested by the LSTM, we have to transform them. Indeed, “text” will not be understood by a neural net; we have to use digits. However, we cannot only map a word to its index in the vocabulary, because this value does not “represent” the word. We have to reorganize a sequence of words as a matrix of Booleans (1 or 0). we create the matrices X and y to be the data inputs of our model:

X: a matrix of the following dimensions:
  • Number of sequences,
  • Number of words in sequences,
  • The number of words in the vocabulary.
Y: a matrix of the following dimensions:
  • Number of sequences,
  • The number of terms in the glossary.

For each word, we retrieve its index in the vocabulary, and we set to 1 its position in the matrix. X and y are our training data.

Upon feeding this data into the network and training it, the network will know which word should be at what position, from the index. By feeding this sequential data into the layers, we are letting our model see how text is formed. We are mentoring the model just like we were taught. The data that is received from the user is now fed into the system in the same way, and the LSTM(Bi or Bidirectional LSTM to be precise) model has to generate a new sequence of indices from the experience it has gained. Then from the Dictionary, it retrieves those words.

After the sentence is successfully generated, the decoder starts working. It converts that into a voice using another neural network system which has been trained on the text as input and voice as output. After translating those words to an audio clip, Alexa can talk back to you.


This complete procedure of processing and predicting comes under an umbrella of data science named Natural Language Processing.NLP is very new to this market but it has already secured a great place for further research work. The discovery of LSTM and Bidirectional LSTM or Transformers NLP tasks has become very simple over the years. With lots of future work coming for this LSTM was the first step towards the revolution. It set a status quo for the NLP tasks just for its excellent score in every aspect of text processing. From ANN to LSTM, the journey that started as a simple correction to the backpropagation error and losses made its way to the best place it can hold for Natural Language Processing.





Note : Utmost care has been taken to credit the original authors/sources and to make these as apt as possible.

comments powered by Disqus