.
Lovecraft with Natural Language Processing — Part 1: Rule-Based Sentiment Analysis
I’ve been considering doing a Natural Language Processing project for a while now, and I finally decided to do a comprehensive analysis of a corpus taken from literature. I think classical literature is a really interesting application of NLP, you can showcase a wide array of topics from word counts and sentiment analysis to neural network text generation.
I picked H. P. Lovecraft’s stories as a subject of my analysis. He was an American writer from the early 20th century, famous for his weird and cosmic horror fiction, with a huge influence on modern pop culture. If you want to read more about why I think he is a great choice for a project like this, have a look at my previous post in which I also describe how I prepared the text we are going to analyse now.
In the first post of the NLP series, we are going to have a closer look at rule-based sentiment analysis. The wider scope of sentiment analysis is definitely not the best first step to take in the world of NLP, however, the good thing about rule-based sentiment analysis is that it requires minimal pre-work. I wanted to start the project with something that can immediately be played with, without being bogged down by the technical details of tokenisations and lemmas and whatnot.
We are going to apply VADER-Sentiment-Analysis to rank the writings of Lovecraft, from most negative to most positive. According to its GitHub page, VADER is “specifically attuned to sentiments expressed in social media”. In other words, the library is trained on a modern, slang-heavy corpus. This should be interesting because there were three things Lovecraft abhorred from the bottom of his heart:
- eldritch beings from other dimensions;
- people who are not well-educated Anglo-Saxon males;
- slang.
Sentiment Analysis
Let’s start with a quick summary. Sentiment analysis is a type of NLP, and deals with the classification of emotions based on text data. You input a text, and the algorithm tells you how positive or negative it is. There are two main approaches one can take:
machine learning: The algorithm learns from a data. For example, you have 10,000 movie reviews, with the number of stars the user gave, and the text reviews. You can train a classification model with the text as features and the number of stars as target variables. (Of course, transforming the unstructured text data into measurable features is going to be a challenge, but it’s fortunately not in the scope of this post.)
rule-based: The algorithm calculates the sentiment score from a set of manually created rules. For example, you can count the number someone used “great” in their review, and increase the estimated sentiment for each. It sounds very simplistic, but that is basically what’s happening, just on a much larger scale.
We are going with the rule-based approach in our current project, let’s have a closer look at our library next!
.
.