RoboTech: Lovecraft with NLP 2 : Tokenisation and Word Counts

Using spaCy, a Python NLP library, to analyse word usage in H. P. Lovecraft’s stories.

This is the second blog post in my series in which I analyse the works of H. P. Lovecraft through the lens of Natural Language Processing. In the first post, we kicked things off by analysing the overall sentiment of the stories. We can’t postpone this any longer, we have to talk about the basis of all NLP analysis: tokenisation. I thought it would be fun to pair this with a particular question, so in this post, we are going to find which words Lovecraft used the most in each story, and in his literary work combined.

One of the things you remember the most about reading Lovecraft is the odd language. He tends to use lots of negative words, horror, madness, depression, especially with adjectives: unnamable, indescribable, unspeakable horrors everywhere, right? There is something about this section from The Outsider (1921):

God knows it was not of this world — or no longer of this world — yet to my horror I saw in its eaten-away and bone-revealing outlines a leering, abhorrent travesty on the human shape; and in its mouldy, disintegrating apparel an unspeakable quality that chilled me even more.

that screams Lovecraft. On Arkham Archivist (the website from where I downloaded the text), there is a separate page dedicated to word counts, people submitted suggestions and the person who collected the stories counted them. I was curious to see if this notion can be observed “scientifically”. Are the stories as negative as we thought? What are the most used adjectives, are they “horrible” and “unknown” and “ancient”? Are verbs about knowledge and/or going crazy? Does he ever use the word “woman”? Well, let’s find out!

We are going to start with a quick theory background, apply some practical prep-work on the text, have a look at a couple of examples in spaCy, and then finally get on with word counting.

Theory

First, let’s have a look at a few concepts that we are going to use in this project.

tokenisation: a kind of document segmentation technique that breaks unstructured (text) data in small pieces of data that can be counted as discrete elements. In our analysis, individual tokens are going to be words, but that’s not necessarily the case, a token can be a paragraph, a sentence, a part of the word, or even characters.
bag-of-words: an unordered aggregated representation of a larger volume of text, which can be a document, chapter, paragraph, sentence, etc…. The grammar, punctuation, and word order from the original text are ignored, the only thing that’s kept is the unique words and a number attached to them. That number can be the frequency with which the word occurs in the text, or it can be a binary 0 or 1, simply measuring whether the word is in the text or not.
normalisation: in the NLP token context, normalisation is a group of techniques that take the original tokens as they appear in the text, and convert them to another form. This form can be the root of the word, for example, when we count the word “house” in a text, we probably want to also include “houses” and “House”, if it’s capitalised because it is at the beginning of a sentence. There are many different approaches, the one that we are going to use is lemmatisation, for example stemming and even taking the lower-case version of the text are also normalisations.
lemmatisation: the process of finding the lemma of the words. Lemmas are basically the dictionary forms of words. For example, the lemma of “houses” is “house”, “better” becomes “good”, “thought” becomes “think”, etc.
stop words: words that are so frequent in a language that their presence in a text does not contain much meaning regarding the topic of the document. For example, “a”, “why”, “do”, etc. For certain NLP tasks (like ours), it makes sense to ignore these words.

https://towardsdatascience.com/lovecraft-with-natural-language-processing-part-2-tokenisation-and-word-counts-f970f6ff5690