RoboTech: June 2020

Using spaCy, a Python NLP library, to analyse word usage in H. P. Lovecraft’s stories.

This is the second blog post in my series in which I analyse the works of H. P. Lovecraft through the lens of Natural Language Processing. In the first post, we kicked things off by analysing the overall sentiment of the stories. We can’t postpone this any longer, we have to talk about the basis of all NLP analysis: tokenisation. I thought it would be fun to pair this with a particular question, so in this post, we are going to find which words Lovecraft used the most in each story, and in his literary work combined.

One of the things you remember the most about reading Lovecraft is the odd language. He tends to use lots of negative words, horror, madness, depression, especially with adjectives: unnamable, indescribable, unspeakable horrors everywhere, right? There is something about this section from The Outsider (1921):

God knows it was not of this world — or no longer of this world — yet to my horror I saw in its eaten-away and bone-revealing outlines a leering, abhorrent travesty on the human shape; and in its mouldy, disintegrating apparel an unspeakable quality that chilled me even more.

that screams Lovecraft. On Arkham Archivist (the website from where I downloaded the text), there is a separate page dedicated to word counts, people submitted suggestions and the person who collected the stories counted them. I was curious to see if this notion can be observed “scientifically”. Are the stories as negative as we thought? What are the most used adjectives, are they “horrible” and “unknown” and “ancient”? Are verbs about knowledge and/or going crazy? Does he ever use the word “woman”? Well, let’s find out!

We are going to start with a quick theory background, apply some practical prep-work on the text, have a look at a couple of examples in spaCy, and then finally get on with word counting.

Theory

First, let’s have a look at a few concepts that we are going to use in this project.

tokenisation: a kind of document segmentation technique that breaks unstructured (text) data in small pieces of data that can be counted as discrete elements. In our analysis, individual tokens are going to be words, but that’s not necessarily the case, a token can be a paragraph, a sentence, a part of the word, or even characters.
bag-of-words: an unordered aggregated representation of a larger volume of text, which can be a document, chapter, paragraph, sentence, etc…. The grammar, punctuation, and word order from the original text are ignored, the only thing that’s kept is the unique words and a number attached to them. That number can be the frequency with which the word occurs in the text, or it can be a binary 0 or 1, simply measuring whether the word is in the text or not.
normalisation: in the NLP token context, normalisation is a group of techniques that take the original tokens as they appear in the text, and convert them to another form. This form can be the root of the word, for example, when we count the word “house” in a text, we probably want to also include “houses” and “House”, if it’s capitalised because it is at the beginning of a sentence. There are many different approaches, the one that we are going to use is lemmatisation, for example stemming and even taking the lower-case version of the text are also normalisations.
lemmatisation: the process of finding the lemma of the words. Lemmas are basically the dictionary forms of words. For example, the lemma of “houses” is “house”, “better” becomes “good”, “thought” becomes “think”, etc.
stop words: words that are so frequent in a language that their presence in a text does not contain much meaning regarding the topic of the document. For example, “a”, “why”, “do”, etc. For certain NLP tasks (like ours), it makes sense to ignore these words.

https://towardsdatascience.com/lovecraft-with-natural-language-processing-part-2-tokenisation-and-word-counts-f970f6ff5690

Lovecraft with Natural Language Processing — Part 1: Rule-Based Sentiment Analysis

I’ve been considering doing a Natural Language Processing project for a while now, and I finally decided to do a comprehensive analysis of a corpus taken from literature. I think classical literature is a really interesting application of NLP, you can showcase a wide array of topics from word counts and sentiment analysis to neural network text generation.

I picked H. P. Lovecraft’s stories as a subject of my analysis. He was an American writer from the early 20th century, famous for his weird and cosmic horror fiction, with a huge influence on modern pop culture. If you want to read more about why I think he is a great choice for a project like this, have a look at my previous post in which I also describe how I prepared the text we are going to analyse now.

In the first post of the NLP series, we are going to have a closer look at rule-based sentiment analysis. The wider scope of sentiment analysis is definitely not the best first step to take in the world of NLP, however, the good thing about rule-based sentiment analysis is that it requires minimal pre-work. I wanted to start the project with something that can immediately be played with, without being bogged down by the technical details of tokenisations and lemmas and whatnot.

We are going to apply VADER-Sentiment-Analysis to rank the writings of Lovecraft, from most negative to most positive. According to its GitHub page, VADER is “specifically attuned to sentiments expressed in social media”. In other words, the library is trained on a modern, slang-heavy corpus. This should be interesting because there were three things Lovecraft abhorred from the bottom of his heart:

eldritch beings from other dimensions;
people who are not well-educated Anglo-Saxon males;
slang.

Sentiment Analysis

Let’s start with a quick summary. Sentiment analysis is a type of NLP, and deals with the classification of emotions based on text data. You input a text, and the algorithm tells you how positive or negative it is. There are two main approaches one can take:

machine learning: The algorithm learns from a data. For example, you have 10,000 movie reviews, with the number of stars the user gave, and the text reviews. You can train a classification model with the text as features and the number of stars as target variables. (Of course, transforming the unstructured text data into measurable features is going to be a challenge, but it’s fortunately not in the scope of this post.)

rule-based: The algorithm calculates the sentiment score from a set of manually created rules. For example, you can count the number someone used “great” in their review, and increase the estimated sentiment for each. It sounds very simplistic, but that is basically what’s happening, just on a much larger scale.

We are going with the rule-based approach in our current project, let’s have a closer look at our library next!

https://towardsdatascience.com/lovecraft-with-natural-language-processing-part-1-rule-based-sentiment-analysis-5727e774e524

Lovecraft with NLP 2 : Tokenisation and Word Counts

Theory

Lovecraft with NLP 1 : Rule-Based Sentiment Analysis