Using Machine Learning to Measure User Sentiment towards Climate Change

Climate change is the change in the distribution of weather patterns. While man-made climate change is commonly accepted as science, there are also skeptics who believe otherwise.

By applying machine learning techniques to tweets discussing climate change, we can better understand the effect of sociopolitical changes on these different perspectives.

To do this, we need to gather relevant data and build a predictive model which will help us determine sentiments of incoming tweets.

Here is the link to the repository containing all of the code shared in this article.

Building the Data set

Using the Tweepy library from Python, we retrieved all tweets containing at least one of these phrases: “climate change”, “global warming” and “warming planet”. Human annotators independently classified each tweet into one of the following categories:

  • 2 — The tweet links to a factual news article reporting climate change
  • 1 — The user believes in man-made climate change
  • 0 — The user does not indicate explicit belief or disbelief towards man-made climate change
  • -1 — The user does not believe in man-made climate change

A sample of climate change tweets shared between 2016–2018 and their corresponding label can be found here. This will be used as the data to train our model.

Examining the distribution of the user sentiments reveal that the majority of the tweets support the belief of man-made climate change.

Our classes are quite unbalanced. However, since we do not have many data points (relatively speaking), I will not be downsampling the larger class.

Cleaning the Data

Twitter data often contains emojis, foreign symbols and and its own unique lingo which makes processing such data rather challenging. Before we can build the model, we need to get rid of special symbols and URLs. We also remove any duplicate tweets.

Train/Test Split

We split the data into 90% training and 10% testing.

Building the Classifer

Using the cleaned data set, we proceed to train a classifier to predict the sentiment of subsequent tweets. We will be using a recurrent neural network with GRU units to do this.

Rationale for the Algorithm: recurrent neural networks are almost the obvious choice for text analysis as they can capture the sequential nature of the text. GRU units were chosen in favor of LSTM units due to the short nature of tweets.

Embedding the Tweets

Using the pre-trained glove.twitter.27B.200d embedding vectors, we can construct a representation of language which accurately reflects Twitter lingo.

Since Twitter data often contains many misspellings, we will be restricting the model to learn the top 20,000 most frequently occurring tokens. The maximum length of each text is restricted to 150 tokens: anything longer will be truncated and anything shorter will be padded (side note: 150 tokens might be an overkill). Lastly, we specify that each vector has 200 coordinates.

Next, we fit a tokenizer using the training data.

Then, we prepare the embedding matrix and find the vector representations for each tweet using the pre-trained embedding model and the fitted tokenizer.

Training the Classifier

We let the model train for a maximum of 20 epochs, stipulate a batch size of 128 data points and a patience of 4 epochs. The patience parameter will be used in conjunction with early stopping to determine when/if we should prematurely end training.

The labels also need to be one hot encoded.

We build our network with one Bi-directional GRU layer. The Bi-directional component allows the network to learn each tweet by reading it from both directions. The GRU layer is followed by pooling layers, which help reduce the variance and computational complexity.

sample training output

Testing the Classifier

After loading the best model weights obtained from the above step, we can test the performance of our classifier.

Overall, the classifier achieved an accuracy of 70.9%. From the confusion matrix, we see that it is able to distinguish the classes reasonably well.

Displayed units are in percentages

Analyzing Sentiment Over Time

A working dashboard which portrays the distribution of Twitter sentiment over time can be found here.

The dashboard uses a more comprehensive classifier (one that is trained over a larger data set and ensembled over 4 different models) to do the predictions. The data used in this dashboard was gathered over a longer period of time and supplemented by data purchased from third party suppliers.

By analyzing sentiment over time, we can spot trends in the way that Twitter users perceive climate change. For instance, since 2016, there has been a steady decline in tweets which report climate change news.

If we overlay this data with major events, we can gain a better understanding of how sociopolitical changes influence climate change perspectives.

Thank you for reading. Stay tuned for more.

Check out my website for learning Data Science: https://www.dscrashcourse.com/