RoboTech: 2020

The Robotic Future of Artificial Intelligence and Natural Language Processing (NLP)

The main goal for sharing AI research is to program machines and software to mirror the ability and improvisation of the human mind. When it comes to business processes, the underlying tasks for achieving human-like cognition is to automate repetitive tasks and lower priority work. This allows employees to focus on more critical tasks and implement long-term strategies and management. Due to the time allotted for more important work, organizations have been utilizing machines powered with artificial intelligence to accomplish routine jobs.

What Can Come From the Future of AI In NLP

As artificial intelligence becomes more equipped to comprehend human communication, more businesses will adopt this technology for areas that require communication where Natural Language Processing (NLP) would make a difference. AI technology is already being used in these areas:

Customer Service

The expansion of customer service to include emotionally intelligent chatbots has been growing exponentially for some time. Chatbots are capable of understanding text written in natural language and responding with answers to basic questions and problem resolutions. Some bots are equipped with NLP so intelligent that humans can’t distinguish if they are humans or robots. With further advancements, natural language processing will allow AI empowered virtual customer service representatives and voice assistants to vocally communicate to solve complex problems. Potentially, the bots could be used for technical support, providing responses and services and recording notes for field staff.

Smart Home and Office Assistants

We’ve all had experience interacting with virtual assistants via web and mobile devices. They're becoming more and more intelligent with completing basic operations by listening and understanding common voice commands. As AI technology continues to advance, soon we will have in-vehicle voice assistants with the power to perform tasks for various vehicle operations and other complex commands.

Homes furnished with smart amenities have in-home assistants that work with natural language processing to recognize commands. This technology and the advancement of the technology in voice assistants like Alexa will be capable of understanding young children who have not developed perfect speech, and people from different regions of the world who may have accents or speak multiple languages. They will not only be able to listen to voice commands but will respond in an innate manner.

Healthcare Filling and Recording

Healthcare physicians spend more time filling health record documents than they they do consulting with patients. The medical industry serves serves billions of people a year. To prioritize time, prevent burnout for physicians and provide patients with better healthcare, AI technology powered by natural language processing can assist with dictating observations and details that will automate filling in the EHR.

Human Robotics

Just a few years ago, robots that can move, think and speak like humans seemed out of reach, but they will become familiar soon. Humanoid robots that can function like humans are being developed to assist organizations with tasks that are time consuming and predominately unsafe for employees, including manufacturing. To achieve this, robots will need the ability to perfectly comprehend human speech, making natural language processing more important than ever. Without NLP being perfected, misinterpreted commands can lead to the robot performing an unwanted action.

There are many aspects of artificial intelligence and natural language processing that can be implemented in various areas of our everyday lives and everyday processes within organizations. Considering the current levels of competency within AI and machine learning-with continuous advancements in AI paired with NLP- the possibility of having machines that can listen and comprehend written and spoken language like humans make the future of AI exciting.

Source:

https://blog.vsoftconsulting.com/blog/the-robotic-future-of-artificial-intelligence-and-natural-language-processing-nlp

Artificial intelligence (AI), the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings.

The term is frequently applied to the project of developing systems endowed with the intellectual processes characteristic of humans, such as the ability to reason, discover meaning, generalize, or learn from past experience.

Since the development of the digital computer in the 1940s, it has been demonstrated that computers can be programmed to carry out very complex tasks—as, for example, discovering proofs for mathematical theorems or playing chess—with great proficiency.

Still, despite continuing advances in computer processing speed and memory capacity, there are as yet no programs that can match human flexibility over wider domains or in tasks requiring much everyday knowledge.

On the other hand, some programs have attained the performance levels of human experts and professionals in performing certain specific tasks, so that artificial intelligence in this limited sense is found in applications as diverse as medical diagnosis, computer search engines, and voice or handwriting recognition.

https://www.britannica.com/technology/artificial-intelligence

One-Hot Encoding vs Word Embedding

Machine learning and deep learning models, like those in Keras, require all input and output variables to be numeric.

This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.

The two most popular techniques are an integer encoding and a one hot encoding, although a newer technique called learned embedding may provide a useful middle ground between these two methods.

In this tutorial, you will discover how to encode categorical data when developing neural network models in Keras.

After completing this tutorial, you will know:

(1) The challenge of working with categorical data when using machine learning and deep learning models.

(2) How to integer encode and one hot encode categorical variables for modeling.

(3) How to learn an embedding distributed representation as part of a neural network for categorical variables.

https://machinelearningmastery.com/how-to-prepare-categorical-data-for-deep-learning-in-python/

Top 4 Sentence Embedding Techniques using Python!

What is Word Embedding?
Introduction to Sentence Embedding
Doc2Vec
SentenceBERT
InferSent
Universal Sentence Encoder

https://www.analyticsvidhya.com/blog/2020/08/top-4-sentence-embedding-techniques-using-python/

Using spaCy, a Python NLP library, to analyse word usage in H. P. Lovecraft’s stories.

This is the second blog post in my series in which I analyse the works of H. P. Lovecraft through the lens of Natural Language Processing. In the first post, we kicked things off by analysing the overall sentiment of the stories. We can’t postpone this any longer, we have to talk about the basis of all NLP analysis: tokenisation. I thought it would be fun to pair this with a particular question, so in this post, we are going to find which words Lovecraft used the most in each story, and in his literary work combined.

One of the things you remember the most about reading Lovecraft is the odd language. He tends to use lots of negative words, horror, madness, depression, especially with adjectives: unnamable, indescribable, unspeakable horrors everywhere, right? There is something about this section from The Outsider (1921):

God knows it was not of this world — or no longer of this world — yet to my horror I saw in its eaten-away and bone-revealing outlines a leering, abhorrent travesty on the human shape; and in its mouldy, disintegrating apparel an unspeakable quality that chilled me even more.

that screams Lovecraft. On Arkham Archivist (the website from where I downloaded the text), there is a separate page dedicated to word counts, people submitted suggestions and the person who collected the stories counted them. I was curious to see if this notion can be observed “scientifically”. Are the stories as negative as we thought? What are the most used adjectives, are they “horrible” and “unknown” and “ancient”? Are verbs about knowledge and/or going crazy? Does he ever use the word “woman”? Well, let’s find out!

We are going to start with a quick theory background, apply some practical prep-work on the text, have a look at a couple of examples in spaCy, and then finally get on with word counting.

Theory

First, let’s have a look at a few concepts that we are going to use in this project.

tokenisation: a kind of document segmentation technique that breaks unstructured (text) data in small pieces of data that can be counted as discrete elements. In our analysis, individual tokens are going to be words, but that’s not necessarily the case, a token can be a paragraph, a sentence, a part of the word, or even characters.
bag-of-words: an unordered aggregated representation of a larger volume of text, which can be a document, chapter, paragraph, sentence, etc…. The grammar, punctuation, and word order from the original text are ignored, the only thing that’s kept is the unique words and a number attached to them. That number can be the frequency with which the word occurs in the text, or it can be a binary 0 or 1, simply measuring whether the word is in the text or not.
normalisation: in the NLP token context, normalisation is a group of techniques that take the original tokens as they appear in the text, and convert them to another form. This form can be the root of the word, for example, when we count the word “house” in a text, we probably want to also include “houses” and “House”, if it’s capitalised because it is at the beginning of a sentence. There are many different approaches, the one that we are going to use is lemmatisation, for example stemming and even taking the lower-case version of the text are also normalisations.
lemmatisation: the process of finding the lemma of the words. Lemmas are basically the dictionary forms of words. For example, the lemma of “houses” is “house”, “better” becomes “good”, “thought” becomes “think”, etc.
stop words: words that are so frequent in a language that their presence in a text does not contain much meaning regarding the topic of the document. For example, “a”, “why”, “do”, etc. For certain NLP tasks (like ours), it makes sense to ignore these words.

https://towardsdatascience.com/lovecraft-with-natural-language-processing-part-2-tokenisation-and-word-counts-f970f6ff5690

Lovecraft with Natural Language Processing — Part 1: Rule-Based Sentiment Analysis

I’ve been considering doing a Natural Language Processing project for a while now, and I finally decided to do a comprehensive analysis of a corpus taken from literature. I think classical literature is a really interesting application of NLP, you can showcase a wide array of topics from word counts and sentiment analysis to neural network text generation.

I picked H. P. Lovecraft’s stories as a subject of my analysis. He was an American writer from the early 20th century, famous for his weird and cosmic horror fiction, with a huge influence on modern pop culture. If you want to read more about why I think he is a great choice for a project like this, have a look at my previous post in which I also describe how I prepared the text we are going to analyse now.

In the first post of the NLP series, we are going to have a closer look at rule-based sentiment analysis. The wider scope of sentiment analysis is definitely not the best first step to take in the world of NLP, however, the good thing about rule-based sentiment analysis is that it requires minimal pre-work. I wanted to start the project with something that can immediately be played with, without being bogged down by the technical details of tokenisations and lemmas and whatnot.

We are going to apply VADER-Sentiment-Analysis to rank the writings of Lovecraft, from most negative to most positive. According to its GitHub page, VADER is “specifically attuned to sentiments expressed in social media”. In other words, the library is trained on a modern, slang-heavy corpus. This should be interesting because there were three things Lovecraft abhorred from the bottom of his heart:

eldritch beings from other dimensions;
people who are not well-educated Anglo-Saxon males;
slang.

Sentiment Analysis

Let’s start with a quick summary. Sentiment analysis is a type of NLP, and deals with the classification of emotions based on text data. You input a text, and the algorithm tells you how positive or negative it is. There are two main approaches one can take:

machine learning: The algorithm learns from a data. For example, you have 10,000 movie reviews, with the number of stars the user gave, and the text reviews. You can train a classification model with the text as features and the number of stars as target variables. (Of course, transforming the unstructured text data into measurable features is going to be a challenge, but it’s fortunately not in the scope of this post.)

rule-based: The algorithm calculates the sentiment score from a set of manually created rules. For example, you can count the number someone used “great” in their review, and increase the estimated sentiment for each. It sounds very simplistic, but that is basically what’s happening, just on a much larger scale.

We are going with the rule-based approach in our current project, let’s have a closer look at our library next!

https://towardsdatascience.com/lovecraft-with-natural-language-processing-part-1-rule-based-sentiment-analysis-5727e774e524

Language modeling (LM) is the use of various statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence. Language models analyze bodies of text data to provide a basis for their word predictions. They are used in natural language processing (NLP) applications, particularly ones that generate text as an output. Some of these applications include , machine translation and question answering.

How language modeling works

Language models determine word probability by analyzing text data. They interpret this data by feeding it through an algorithm that establishes rules for context in natural language. Then, the model applies these rules in language tasks to accurately predict or produce new sentences. The model essentially learns the features and characteristics of basic language and uses those features to understand new phrases.

There are several different probabilistic approaches to modeling language, which vary depending on the purpose of the language model. From a technical perspective, the various types differ by the amount of text data they analyze and the math they use to analyze it. For example, a language model designed to generate sentences for an automated Twitter bot may use different math and analyze text data in a different way than a language model designed for determining the likelihood of a search query.

Some common statistical language modeling types are:

N-gram. N-grams are a relatively simple approach to language models. They create a probability distribution for a sequence of n The n can be any number, and defines the size of the "gram", or sequence of words being assigned a probability. For example, if n = 5, a gram might look like this: "can you please call me." The model then assigns probabilities using sequences of n size. Basically, n can be thought of as the amount of context the model is told to consider. Some types of n-grams are unigrams, bigrams, trigrams and so on.

Unigram. The unigram is the simplest type of language model. It doesn't look at any conditioning context in its calculations. It evaluates each word or term independently. Unigram models commonly handle language processing tasks such as information retrieval. The unigram is the foundation of a more specific model variant called the query likelihood model, which uses information retrieval to examine a pool of documents and match the most relevant one to a specific query.

Bidirectional. Unlike n-gram models, which analyze text in one direction (backwards), bidirectional models analyze text in both directions, backwards and forwards. These models can predict any word in a sentence or body of text by using every other word in the text. Examining text bidirectionally increases result accuracy. This type is often utilized in machine learning and speech generation applications. For example, Google uses a bidirectional model to process search queries.

Exponential. Also known as maximum entropy models, this type is more complex than n-grams. Simply put, the model evaluates text using an equation that combines feature functions and n-grams. Basically, this type specifies features and parameters of the desired results, and unlike n-grams, leaves analysis parameters more ambiguous -- it doesn't specify individual gram sizes, for example. The model is based on the principle of entropy, which states that the probability distribution with the most entropy is the best choice. In other words, the model with the most chaos, and least room for assumptions, is the most accurate. Exponential models are designed maximize cross entropy, which minimizes the amount statistical assumptions that can be made. This enables users to better trust the results they get from these models.

Continuous space. This type of model represents words as a non-linear combination of weights in a neural network. The process of assigning a weight to a word is also known as word embedding. This type becomes especially useful as data sets get increasingly large, because larger datasets often include more unique words. The presence of a lot of unique or rarely used words can cause problems for linear model like an n-gram. This is because the amount of possible word sequences increases, and the patterns that inform results become weaker. By weighting words in a non-linear, distributed way, this model can "learn" to approximate words and therefore not be misled by any unknown values. Its "understanding" of a given word is not as tightly tethered to the immediate surrounding words as it is in n-gram models.

The models listed above are more general statistical approaches from which more specific variant language models are derived. For example, as mentioned in the n-gram description, the query likelihood model is a more specific or specialized model that uses the n-gram approach. Model types may be used in conjunction with one another.

The models listed also vary significantly in complexity. Broadly speaking, more complex language models are better at NLP tasks, because language itself is extremely complex and always evolving. Therefore, an exponential model or continuous space model might be better than an n-gram for NLP tasks, because they are designed to account for ambiguity and variation in language.

A good language model should also be able to process long-term dependencies, handling words that may derive their meaning from other words that occur in far-away, disparate parts of the text. An LM should be able to understand when a word is referencing another word from a long distance, as opposed to always relying on proximal words within a certain fixed history. This requires a more complex model.

Importance of language modeling

Language modeling is crucial in modern NLP applications. It is the reason that machines can understand qualitative information. Each language model type, in one way or another, turns qualitative information into quantitative information. This allows people to communicate with machines as they do with each other to a limited extent.

It is used directly in a variety of industries including tech, finance, healthcare, transportation, legal, military and government. Additionally, it's likely most people reading this have interacted with a language model in some way at some point in the day, whether it be through Google search, an autocomplete text function or engaging with a voice assistant.

The roots of language modeling as it exists today can be traced back to 1948. That year, Claude Shannon published a paper titled "A Mathematical Theory of Communication." In it, he detailed the use of a stochastic model called the Markov chain to create a statistical model for the sequences of letters in English text. This paper had a large impact on the telecommunications industry, laid the groundwork for information theory and language modeling. The Markov model is still used today, and n-grams specifically are tied very closely to the concept.

Uses and examples of language modeling

Language models are the backbone of natural language processing (NLP). Below are some NLP tasks that use language modeling, what they mean, and some applications of those tasks:

Speech recognition -- involves a machine being able to process speech audio. This is commonly used by voice assistants like Siri and Alexa.

Machine translation -- involves the translation of one language to another by a machine. Google Translate and Microsoft Translator are two programs that do this. SDL Government is another, which is used to translate foreign social media feeds in real time for the U.S. government.

Parts-of-speech tagging -- involves the markup and categorization of words by certain grammatical characteristics. This is utilized in the study of linguistics, first and perhaps most famously in the study of the Brown Corpus, a body of composed of random English prose that was designed to be studied by computers. This corpus has been used to train several important language models, including one used by Google to improve search quality.

Parsing -- involves analysis of any string of data or sentence that conforms to formal grammar and syntax rules. In language modeling, this may take the form of sentence diagrams that depict each word's relationship to the others. Spell checking applications use language modeling and parsing.

Sentiment analysis -- involves determining the sentiment behind a given phrase. Specifically, it can be used to understand opinions and attitudes expressed in a text. Businesses can use this to analyze product reviews or general posts about their product, as well as analyze internal data like employee surveys and customer support chats. Some services that provide sentiment analysis tools are Repustate and Hubspot's ServiceHub. Google's NLP tool -- called Bidirectional Encoder Representations from Transformers (BERT) -- is also used for sentiment analysis.

Optical character recognition -- involves the use of a machine to convert images of text into machine encoded text. The image may be a scanned document or document photo, or a photo with text somewhere in it -- on a sign, for example. It is often used in data entry when processing old paper records that need to be digitized. In can also be used to analyze and identify handwriting samples.

Information retrieval -- involves searching in a document for information, searching for documents in general, and searching for metadata that corresponds to a document. Web browsers are the most common information retrieval applications.

From:

https://www.techtarget.com/searchenterpriseai/definition/language-modeling

Everything you always wanted to know about a dataset: Studies in data summarisation

Summarising data as text helps people make sense of it. It also improves data discovery, as search algorithms can match this text against keyword queries. In this paper, we explore the characteristics of text summaries of data in order to understand how meaningful summaries look like. We present two complementary studies: a data-search diary study with 69 students, which offers insight into the information needs of people searching for data; and a summarisation study, with a lab and a crowdsourcing component with overall 80 data-literate participants, who produced summaries for 25 datasets. In each study we carried out a qualitative analysis to identify key themes and commonly mentioned dataset attributes, which people consider when searching and making sense of data. The results helped us design a template to create more meaningful textual representations of data, alongside guidelines for improving data-search experience overall.

https://www.sciencedirect.com/science/article/pii/S1071581918306153

Researchers have collected Twitter data to study a wide range of topics. This growing body of literature, however, has not yet been reviewed systematically to synthesize Twitter-related papers. The existing literature review papers have been limited by constraints of traditional methods to manually select and analyze samples of topically related papers. The goals of this retrospective study are to identify dominant topics of Twitter-based research, summarize the temporal trend of topics, and interpret the evolution of topics withing the last ten years. This study systematically mines a large number of Twitter-based studies to characterize the relevant literature by an efficient and effective approach. This study collected relevant papers from three databases and applied text mining and trend analysis to detect semantic patterns and explore the yearly development of research themes across a decade. We found 38 topics in more than 18,000 manuscripts published between 2006 and 2019. By quantifying temporal trends, this study found that while 23.7% of topics did not show a significant trend ( $P => 0.05$ ), 21% of topics had increasing trends and 55.3% of topics had decreasing trends that these hot and cold topics represent three categories: application, methodology, and technology. The contributions of this paper can be utilized in the growing field of Twitter-based research and are beneficial to researchers, educators, and publishers.

Like the politics-related twitter studies, sentiment analysis has a positive slope indicating increasing research activities over time

https://ieeexplore.ieee.org/ielx7/6287639/8948470/09047963.pdf

The Robotic Future of Artificial Intelligence and Natural Language Processing (NLP)

What Can Come From the Future of AI In NLP

Customer Service

Smart Home and Office Assistants

Healthcare Filling and Recording

Human Robotics

What is Artificial Intelligence?

StatQuest Revealed That Neural Network Is Just A Big Fancy Squiggle Fitting Machines (BFSFMs) Algorithm!

One-Hot Encoding vs Word Embedding

Top 4 Sentence Embedding Techniques using Python!

Lovecraft with NLP 2 : Tokenisation and Word Counts

Theory

Lovecraft with NLP 1 : Rule-Based Sentiment Analysis

What is Language Modeling?

Everything you always wanted to know about a dataset: Studies in data summarisation

Twitter and Research: A Systematic Literature Review Through Text Mining

Natural Language Processing Tutorial

Sentiment Analysis