Exploratory Data Analysis (EDA) on Airline Sentiment Tweets

Learn various ways to explore airline sentiment data and understand key factors that contribute to the overall sentiment of tweets on various airlines

10 min readJan 31, 2023

When we browse the internet, we tend to see a lot of advances and innovations taking place in the field of machine learning and data science. Some of the most interesting applications of applying artificial intelligence are in the medical industries where they help in diagnosing whether a person would be suffering from various diseases based on a particular set of indicators or features.

Furthermore, there are numerous other industries that we often see using artificial intelligence such as the self-driving industry building cars that can take us to destinations without the aid of a driver. We also see pharmaceutical industries that understand the value of AI in improving patient lives. Furthermore, there are retail industries that are relying on AI for building better shopping experiences and manufacturing industries that use AI to detect defects in steel.

Therefore, updating ourselves with the latest state-of-the-art techniques in this field can help us build a better society.

Now that we have understood how artificial intelligence can be used in a large variety of industries, we will look at one of the most important yet neglected steps that must be followed before building ML models for prediction: Exploratory Data Analysis.

The first step in machine learning is to explore the data and understand meaningful relationships between them. Having a firm understanding of the data would ensure that we gain domain knowledge along with identifying whether there is bias in the data.

In this article, we will use exploratory data analysis (EDA) for one of the machine learning tasks of predicting the sentiment of customers in the airline industry (https://www.kaggle.com/datasets/thedevastator/sentiment-analysis-of-us-airline-twitter-data). We would also be implementing the code along with visualization plots to help us understand the demand for airlines.

When we formulate this as a machine learning problem, one of the best ways to make use of this dataset would be to predict the sentiment of users for various airlines based on their tweets. This comes under natural language processing (NLP) where we are devising methods to use the text information and converting it into the form of vectors before giving it to our ML models for predicting the sentiment of tweets.

Now that we have formulated our ML problem, we would be taking various steps to implement machine learning and follow procedural steps to mainly focus on exploratory data analysis and its usefulness. Let us now begin by reading the data which is the first step.

Reading the Data

The first step that we follow when looking at an ML problem would be to use the dataset and read it in a format so that we can perform operations and visualizations on it. We would be using pandas to load the dataset below.

We see that the data is stored in the ‘.csv’ file format. Pandas is a library that could be used to read the ‘.csv’ files. Therefore, we use pandas for reading the data in our format.

Understanding the Data

After the first step is completed, we would take time to understand the data by using various methods to get to know the statistical properties. In this way, we can also expect to find a correlation or underlying patterns present in the data. Below are some ways in which we can better understand our data in Python.

This code cell above would display the first 5 rows in the dataset. In addition to this, the names of the features are also mentioned which can be seen below.

It is to be noted that not all the features are present in the result below. We can scroll horizontally to observe the remaining features considered for machine learning in our Jupyter notebooks.

The code cell above displays the sum of total missing or null values for individual features.

These are all the list of features that we have considered so far for our problem. It could be seen that we have quite a lot of missing values for some features such as ‘airline_sentiment_gold’ and ‘negativereason’ along with others.

A lot of machine learning models perform sub optimally when data consists of missing values, especially from the important features that would help ML models to determine the outcome. Hence, it is a good idea to remove missing values from the data by either performing imputation or discarding them completely.

Performing Exploratory Data Analysis (EDA)

Now that we have thoroughly understood the data and also added or replaced a few rows or columns depending on how appropriate they can be to our machine learning problem, we are now going to perform exploratory data analysis (EDA).

It is a step where we plot the relationships between features, understand their distribution and learn more about the statistical properties. If there are some insights that could be gained as a result of this approach, we can explain this to the business so that they take action and generate revenues. Also, it will guide us in determining the right feature engineering technique to use for the data based on their properties and relationships between the features and the output.

Based on the above code, we are mostly interested in knowing the overall sentiment of customers towards the flights. We use the ‘value_counts()’ to determine the presence of various categories per feature and determine their count.

Once we determine the count using ‘value_counts()’, the pertaining data is plotted in the form of bar charts to represent the total number of sentiments per category. It can also be a good practice to save the final figure so that it could be viewed later in a local machine.

After running the above code snippet, we get the barplot as shown below.

Based on the figure, we see that a large number of people in our data have negative comments and tweets about the flights. If we were to use ML models on top of this data, it would be accurately able to predict the negative sentiments of passengers.

In general, when we are building interesting ML applications for use cases such as credit fraud detection and heart disease prediction, we tend to see that we have fewer instances of positive classes as compared to an overwhelming majority of negative classes. In such cases, it is advisable to apply oversampling methods for minority classes so that they have a good representation in our ML models that they predict accurately.

In addition to this, when we are using metrics such as accuracy in the above use cases, there is a higher probability that our models might be having good accuracy despite the fact that they are not performing well in the minority class that is less prevalent in the data. Other metrics such as precious, recall, and F1-score must be taken into account as these metrics would also give us information about how well the ML models did on the minority classes.

Pie plots are quite useful for representing the presence of various categories per feature and their proportion to the overall number of categories.

From the above code snippet, it could be seen that we are considering all the categories for the feature ‘airline’ and storing the values before giving them to our pie plot arguments.

Similarly, the actual values or size for each category is stored in another variable (sizes) before passing it to our pie plot argument. Finally, we give titles to our plot and display them in our Jupyter notebooks. Below is how the plot looks once we run the code snippet.

From the plot above, we see that most of the flight companies were United Airlines followed by US Airways. When performing machine learning analysis, there is a higher likelihood that our models would be performing quite well on these categories of flights as we have more information in our data.

We have only 3.4 percent of flights from Virgin Airlines which basically means that our models might not be able to capture trends from these airlines as we do not have as much information as compared to that of the other flight categories.

One of the best ways to get around this would be to artificially create new samples pertaining to minority classes (Virgin Airlines) so that our models would be able to accurately predict the sentiment of users based on their tweets for these categories.

Since we are mostly dealing with text data, it is quite handy to take a look at the list of all the possible words that were present in our tweets. A useful library called ‘wordcloud’ can be used for this task of representing the presence of various words in our list of possible tweets.

The size of the words represents the frequency of occurrence. Based on this information, we see that the words such as flight, thank, and plane are used most often. This is true because we are dealing with flight tweets and most people are likely to tweet about them.

Positive Sentiment WordCloud (Image from Author)

Now that we are interested to know the frequently occurring words in our corpus of text, we would also find out the most occurring words for positive sentiment. This is done with the help of filtering the positive texts from our entire dataframe.

After performing this step, we generate a wordcloud only for the positive tweets to find the underlying pattern. As can be seen from the wordcloud plot, the occurrence of words such as thank and great are prevalent (judging by their size in the wordcloud). This shows that most likely positive tweets contain these words and they play an important role in the ML model to make predictions.

Similarly, we also take a look at the negative tweets and understand their occurrence in great detail and how useful these features could be for our ML application of building a tweet classifier.

Negative Sentiment Wordcloud (Image from author)

We have now plotted the negative sentiment wordcloud that shows the occurrence of words that make up for the overall negative sentiment. The words such as flight canceled, and a delay occur most often when dealing with negative sentiments.

After exploring the data, we found that these features can be useful especially when we use representations such as Bag of Words (BOW) to convert the text (tweets) into a vectorial representation. The occurrence of words such as great and canceled is of great help to our ML model in predicting the airline sentiment.

As we have found interesting trends and patterns from the data, we can share these insights with various flight carriers to determine the overall sentiment of passengers for various multinational flight carriers.

Had we started with data preparation without exploring the data, we would not get to know how useful our data was in model predictions. By performing exploratory data analysis (EDA), we tend to give more importance to understanding the quality of data before even using it in our ML predictions.

Key Takeaways

Exploratory data analysis (EDA) is a useful step in the field of machine learning. Performing this step ensures that we get to determine the quality of the data before feeding it to our ML models. Furthermore, it would also help us gain key insights and opportunities for growth in the company by delivering them to key players in an organization. Therefore, it would make the life of a data scientist or a machine learning engineer easier as they would be identifying the quality of data before they feed it to ML models for training and deployment.

Below are the ways where you could contact me or take a look at my work.

GitHub: suhasmaddali (Suhas Maddali ) (github.com)

YouTube: https://www.youtube.com/channel/UCymdyoyJBC_i7QVfbrIs-4Q

LinkedIn: (1) Suhas Maddali, Northeastern University, Data Science | LinkedIn

Medium: Suhas Maddali — Medium