Predicting the Sentiment of a Text with Machine Learning

Suhas Maddali
5 min readAug 9, 2021

We see machine learning applications being developed and there is a huge scope and demand for it. There are many organizations leveraging the use of machine learning and they are looking forward to implement them in different scenarios where the algorithms would take decision on their own based on the past data that is labelled by humans.

Photo by Lukas Blazek on Unsplash

There could be a few trivial tasks where machine learning could replace humans so that the latter could spend time in more productive tasks which would increase the revenue generated by companies respectively.

One of the coolest applications of machine learning is to understand the sentiment of a given text. It would be really handy if there are algorithms that would classify the text as positive or negative without human intervention. If this is done, a lot of time and effort could be saved if taken seriously.

Consider the case of a random text and let us consider that there is a user who has written a negative comment about a product in some websites such as Amazon. If we were to understand the text and then label it based on our understanding of the grammar and the semantics, it would take a lot of time as there are a lot of posts and comments given by different users from different parts of the world. Therefore, automation is needed which would classify the texts as positive or negative based on the past data that is already being labeled by humans.

One thing to consider when performing the machine learning analysis is to realize that whatever data that is given to the machine learning models should be in the form of numbers. Under the hood, it is all about performing some mathematical computations and results that are also numerical in nature. On the surface, we just see the decisions taken by the machine learning models rather than focusing on what is happening under the hood. Hence, it is also important to understand how the machine learning models give the respective output. Taking into consideration the hyperparameter tuning which would alter the performance of the machine learning models, one could understand the importance of mathematics behind machine learning. If you want to refer to the mathematics behind machine learning and some of the important components of it, feel free to refer to my earlier article on hyperparameter tuning. The link is provided below.

In sentiment analysis, the polarity of the text would be detected whether it is positive or negative. First, the text is converted into a mathematical vector. There can be different vectorization techniques such as Bag of Words (BOW) and Tfidf Vectorizer respectively. Using these techniques to vectorize the words and convert them to numerical values ensures that machine learning models understand them and perform computations so that we get the right outcomes.

Photo by Stephen Dawson on Unsplash

When we deal with machine learning, we have to ensure that we divide the data into 3 parts namely the training data, cross-validation data and the test data respectively. Usually the split ratio would be about 70 percent for the training data, 15 percent for the cross-validation data and 15 percent for the test data respectively. In the training data, we are going to train the machine learning models with data and see how it performs by taking different metrics into consideration such as accuracy in the case of classification and mean squared error or mean absolute error in the case of regression respectively. Once the model is trained on the training data, we have to use the cross-validation data to tune some hyperparameters which is explained in the above link. Once we get the best accuracy or the mean squared error, we are going test the metric values for the test set and see how well the model does in the case of test data. We consider the performance in the test set and hypothesize that the model would behave similarly on the real-world data as well.

Now it is time to deal with the text data. When dealing with the text data, we have to first remove all the stopwords — words that frequently occur in the text and don’t give a lot of meaning to it such as “and” and “or” respectively. Later, we have to lowercase the letters to ensure that the same words that are uppercase are not different from the lowercase words. If we don’t lowercase the letters, the vectorizers would consider the uppercase words and lowercase words to be different, leading to them being marked as different words respectively. Finally, we are going to be using different machine learning techniques that would ensure that there are some predictions for the test set respectively.

The last step that we are going to take it to give the machine learning models the data that is converted into mathematical vector and see how the model does on the test set respectively. We would be using the cross-validation data to see how well the model is doing and also tuning the hyperparameters which are important so that there could be good accuracy or low mean squared error on the test set.

Conclusion

All in all, we saw how we could use the text data and convert it into mathematical vectors that would ensure that we get the best output values on the test set or the data that the models didn’t see before for the machine learning predictions. Feel free to share your thoughts. Thanks.

Below is my LinkedIn account if you want to connect so that we have further duscuss about machine learning and artificial intelligence.

LinkedIn: https://www.linkedin.com/in/suhas-maddali-b9b146136/

Github: suhasmaddali (Suhas Maddali ) (github.com)

--

--

Suhas Maddali

🚖 Data Scientist @ NVIDIA 📘 15k+ Followers (LinkedIn) 📝 Author @ Towards Data Science 📹 YouTuber 🤖 200+ GitHub Followers 👨‍💻 Views are my own.