Next Word Prediction

I developed this project as part of the need to learn more about the NPL. This is the result of my self-study through Keras documentation, Data Science forums, and educational content on web.

Particularly, use of neural networks to analyze text draws my attention, since it opens up a world of possibilities for the analysis of data that can be found, for example, in social networks, forums, newspapers, etc. That’s why I decided to venture a little into text processing with neural networks through this project. Personally, this work helped me to become more interested and involved in Deep Learning applications to solve problems with text data.

Project Description

Through use of natural language processing and deep learning methods, I made a model that is capable of predicting the next word of a particular sentence. From the reading of the text from the book Metamorphosis by Franz Kafka found in Project Gutenberg, the construction of a deep neural network was used with recurrent neural networks (LSTM).

Data preprocessing

I started the data preprocessing taking the original text and removing extra and unnecessary information for training, saving it in a new file. Then I read the text using utf8 encoding.

I added each text lines in a list and joined the lines of text removing related elements of unnecessary lines of text

And made sure I don’t have repeated words

Tokenization

I used Tokenizer to vectorize the text and then transform it into a sequence of numbers

Then I found the size of the word index to later use it as a criterion to convert the output data into categorical variables and next I created the input and output data of the model with the text sequences before the creation of data sequences

Finally I converted the output data to categorical variables

Model Creation

The model architecture is the following:

The model has 15,7M training parameters aprox and you can download it here Finally, I obtained a model with 57% accuracy in prediction and with a categorical error of 0.6163. You can see detaily the codes and documentation in this Repo. For the prediction of the model I created a notebook that has the following characteristics:

Thanks for reading 😄