Sentiment Analysis using R – Lego Batman Movie

April 29, 2017
R Twitter

The Lego Batman Movie is a 3D computer-animated superhero comedy film released in 2017. The movie had 430,000+ followers on Facebook (as on Apr 2017). The movie was released on 17 Feb 2017 so we would like to analyse the sentiment of this movie post release.

I will be using two kind of approaches for text-mining tweets

Approach 1: Using ‘tm’ package – this is a basic text -mining framework package available in R
Approach 2: Using ‘syuzhet‘ package – this package is built on top of core NLP package that comes with advanced sentiment extraction tools

Tweet Extraction

The tweets were read through Twitter API directly in R. You can refer to great tutorial by 'Credera: Twitter Analytics Using R Part 1: Extract Tweets' for extracting tweets.

Tweet Pre-processing

Loading Libraries

You'll need the following libraries to pre process the tweets.

stringr - gives access to chararcter manipulation, whitespace tools, pattern matching and case transformation functions
readr - for reading rectangular datasets
wordcloud - to generate a beautiful wordcloud.
tm - the weighlifting text mining package
SnowballC - to implement Porter’s word stemming. Stemming refers to find a common root the words like, running, run are have run as their root
RSentiment - reads sentence sentiments in English in five categories namely Positive, Negative, very Positive, very negative, Neutral

library('stringr')
library('readr')
library('wordcloud')
library('tm')
library('SnowballC')
library('RWeka')
library('RSentiment')
library(DT)
library(ggplot2)

Loading Dataset

The scraped tweets were saved as .csv file. Each tweet items has features like name, retweet, tweet text, URL etc. For the purpose of sentiment analysis, we will use tweet text only.

tweets = read_csv("tweets.csv")
tweets_text = as.character(tweets$text)

Pre-Processing

The tweet text contains links, numbers, punctuation, emojis, white space, and related words that can be stemmed. The operations are done in the following code.

sample = sample(tweets_text, (length(tweets_text)))
corpus = Corpus(VectorSource(list(sample)))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, stripWhitespace)
corpus = tm_map(corpus, removeWords, stopwords('english'))
corpus = tm_map(corpus, stemDocument)
dtm_up = DocumentTermMatrix(VCorpus(VectorSource(corpus[[1]]$content)))
freq_up =

Computing Sentiments

sentiments_up = calculate_sentiment(names(freq_up))
sentiments_up = cbind(sentiments_up, as.data.frame(freq_up))
sent_pos_up = sentiments_up[sentiments_up$sentiment == 'Positive',]
sent_neg_up = sentiments_up[sentiments_up$sentiment == 'Negative',]

Plotting Positive Sentiments

For aesthetics I used a Windows font

windowsFonts(Lato.Regular = windowsFont('Lato Regular'))
 
 # PLotting the positive sentiments
 ggplot(aes(x = freq_up, y = text, label = freq_up, color=freq_up), 
 data = sent_pos_up[sent_pos_up$freq_up>5,]) + 
 geom_point(size=4) +
 geom_text(size=3, 
           position = position_nudge(x = 4), 
           family = "Lato.Regular") +
 theme_gray(base_family = "Lato.Regular") +
 ylab("Positive Sentiments") +
 xlab("Frequencies")

The output shows that win is the most frequent word. The words in the graph below are limited by their frequencies. Words that appear more than 5 times in the term-document matrix are plotted here.

Word Cloud

## Wordcloud positive sentiments
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
wordcloud(sent_pos_up$text,sent_pos_up$freq,min.freq=5,colors=brewer.pal(6,"Dark2"))

Plotting Negative Sentiments

ggplot(aes(x = freq_up, y = text, label = freq_up, color=freq_up), 
       data = sent_neg_up[sent_neg_up$freq_up>5,]) + 
  geom_point(size=4) +
  geom_text(size=3, position = position_nudge(x = 4), family = "Lato.Regular") +
  theme_gray(base_family = "Lato.Regular") +
  ylab("Negative Sentiments") +
  xlab("Frequencies")

Wordcloud of Negative Sentiments

layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
set.seed(100)
wordcloud(sent_neg_up$text,sent_neg_up$freq, min.freq=5,colors=brewer.pal(6 ,"Dark2"))

Approach 2: Using 'syuzhet' package

‘Syuzhet’ essentially means the manner in which the elements of the story are organised. The package in R does so by revealing the latent structure using sentiment analysis. It considers 9 emotional expressions excluding positive and negative sentiments.

So start we need to first load the .csv file and extract the text only.

text = as.character(tweets$text)

Data Pre-processing

Now the data is loaded we need to take care of HTML tags, punctuations, username, retweets and numbers. Let’s remove them one by one.

##removing Retweets
some_txt = gsub("(RT|via)((?:\\b\\w*@\\w+)+)","",text)
##let's clean html links
some_txt = gsub("http[^[:blank:]]+","",some_txt)
##let's remove people names
some_txt = gsub("@\\w+","",some_txt)
##let's remove punctuations
some_txt = gsub("[[:punct:]]"," ",some_txt)
##let's remove number (alphanumeric)
some_txt = gsub("[^[:alnum:]]"," ",some_txt)

Calculating Sentiment score

Once the data is cleaned, we are ready to calculate sentiment score for each emotion

library(syuzhet)
mysentiment = get_nrc_sentiment((some_txt))
mysentiment.positive =sum(mysentiment$positive)
mysentiment.anger =sum(mysentiment$anger)
mysentiment.anticipation =sum(mysentiment$anticipation)
mysentiment.disgust =sum(mysentiment$disgust)
mysentiment.fear =sum(mysentiment$fear)
mysentiment.joy =sum(mysentiment$joy)
mysentiment.sadness =sum(mysentiment$sadness)
mysentiment.surprise =sum(mysentiment$surprise)
mysentiment.trust =sum(mysentiment$trust)
mysentiment.negative =sum(mysentiment$negative)

Sentiment Plot

In the plot, there are a higher number of positive sentiments forllowed by anticipation, owing the the fact that the tweets were taken just a few days after the movie was released and hence people were anticipating the movie would be good. The movie was intented to be funny, which is reflected in the chart as 'Joy' being the third leading sentiment.

Conclusion

Using both ‘tm’ and ‘syuzhet’ package it turns out that lego batman movie had a higher number of positive sentiments than the negative ones. This means that movie was well received by the audience. Also one could also notice that most negative sentiments were also related to the joker

As it turns out that movie is not looking for positive sentiments only, one can define the success of the movie by scoring the negative sentiments too. Because in the end, that is what lego batman movie wanted to achieve. The higher number of negative sentiments on joker is a true indicator of the success of the movie.

Previous Facebook Scraping with Netvizz

Next Complete Guide to Matplotlib | Jupyter Notebook