Showing posts with label NLP. Show all posts

The Robotic Future of Artificial Intelligence and Natural Language Processing (NLP)

The main goal for sharing AI research is to program machines and software to mirror the ability and improvisation of the human mind. When it comes to business processes, the underlying tasks for achieving human-like cognition is to automate repetitive tasks and lower priority work. This allows employees to focus on more critical tasks and implement long-term strategies and management. Due to the time allotted for more important work, organizations have been utilizing machines powered with artificial intelligence to accomplish routine jobs.

What Can Come From the Future of AI In NLP

As artificial intelligence becomes more equipped to comprehend human communication, more businesses will adopt this technology for areas that require communication where Natural Language Processing (NLP) would make a difference. AI technology is already being used in these areas:

Customer Service

The expansion of customer service to include emotionally intelligent chatbots has been growing exponentially for some time. Chatbots are capable of understanding text written in natural language and responding with answers to basic questions and problem resolutions. Some bots are equipped with NLP so intelligent that humans can’t distinguish if they are humans or robots. With further advancements, natural language processing will allow AI empowered virtual customer service representatives and voice assistants to vocally communicate to solve complex problems. Potentially, the bots could be used for technical support, providing responses and services and recording notes for field staff.

Smart Home and Office Assistants

We’ve all had experience interacting with virtual assistants via web and mobile devices. They're becoming more and more intelligent with completing basic operations by listening and understanding common voice commands. As AI technology continues to advance, soon we will have in-vehicle voice assistants with the power to perform tasks for various vehicle operations and other complex commands.

Homes furnished with smart amenities have in-home assistants that work with natural language processing to recognize commands. This technology and the advancement of the technology in voice assistants like Alexa will be capable of understanding young children who have not developed perfect speech, and people from different regions of the world who may have accents or speak multiple languages. They will not only be able to listen to voice commands but will respond in an innate manner.

Healthcare Filling and Recording

Healthcare physicians spend more time filling health record documents than they they do consulting with patients. The medical industry serves serves billions of people a year. To prioritize time, prevent burnout for physicians and provide patients with better healthcare, AI technology powered by natural language processing can assist with dictating observations and details that will automate filling in the EHR.

Human Robotics

Just a few years ago, robots that can move, think and speak like humans seemed out of reach, but they will become familiar soon. Humanoid robots that can function like humans are being developed to assist organizations with tasks that are time consuming and predominately unsafe for employees, including manufacturing. To achieve this, robots will need the ability to perfectly comprehend human speech, making natural language processing more important than ever. Without NLP being perfected, misinterpreted commands can lead to the robot performing an unwanted action.

There are many aspects of artificial intelligence and natural language processing that can be implemented in various areas of our everyday lives and everyday processes within organizations. Considering the current levels of competency within AI and machine learning-with continuous advancements in AI paired with NLP- the possibility of having machines that can listen and comprehend written and spoken language like humans make the future of AI exciting.

Source:

https://blog.vsoftconsulting.com/blog/the-robotic-future-of-artificial-intelligence-and-natural-language-processing-nlp

In layman's terms, the self-attention mechanism allows the inputs to interact with each other (“self”) and find out who they should pay more attention to (“attention”). The outputs are aggregates of these interactions and attention scores.

https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a

Illustrated: Self-Attention

A step-by-step guide to self-attention with illustrations and code

Raimi Karim

Nov 18, 2019·8 min read

The illustrations are best viewed on the Desktop. A Colab version can be found here (thanks to Manuel Romero!).

Changelog:
12 Jan 2022 — Improve clarity
5 Jan 2022 — Fix typos and improve clarity

What do BERT, RoBERTa, ALBERT, SpanBERT, DistilBERT, SesameBERT, SemBERT, SciBERT, BioBERT, MobileBERT, TinyBERT and CamemBERT all have in common? And I’m not looking for the answer “BERT” 🤭.

Answer: self-attention 🤗. We are not only talking about architectures bearing the name “BERT’ but, more correctly, Transformer-based architectures. Transformer-based architectures, which are primarily used in modelling language understanding tasks, eschew recurrence in neural networks and instead trust entirely on self-attention mechanisms to draw global dependencies between inputs and outputs. But what’s the math behind this?

That’s what we’re going to find out today. The main content of this post is to walk you through the mathematical operations involved in a self-attention module. By the end of this article, you should be able to write or code a self-attention module from scratch.

This article does not aim to provide the intuitions and explanations behind the different numerical representations and mathematical operations in the self-attention module. It also does not seek to demonstrate the why’s and how-exactly’s of self-attention in Transformers (I believe there’s a lot out there already). Note that the difference between attention and self-attention is also not detailed in this article.

Content

Now let’s get on to it!

0. What is self-attention?

If you think that self-attention is similar, the answer is yes! They fundamentally share the same concept and many common mathematical operations.

A self-attention module takes in n inputs and returns n outputs. What happens in this module? In layman’s terms, the self-attention mechanism allows the inputs to interact with each other (“self”) and find out who they should pay more attention to (“attention”). The outputs are aggregates of these interactions and attention scores.

1. Illustrations

The illustrations are divided into the following steps:

Prepare inputs
Initialise weights
Derive key, query and value
Calculate attention scores for Input 1
Calculate softmax
Multiply scores with values
Sum weighted values to get Output 1
Repeat steps 4–7 for Input 2 & Input 3

Note
In practice, the mathematical operations are vectorised, i.e. all the inputs undergo the mathematical operations together. We’ll see this later in the Code section.

Step 1: Prepare inputs

Fig. 1.1: Prepare inputs

We start with 3 inputs for this tutorial, each with dimension 4.

Input 1: [1, 0, 1, 0] 
Input 2: [0, 2, 0, 2]
Input 3: [1, 1, 1, 1]

Step 2: Initialise weights

Every input must have three representations (see diagram below). These representations are called key (orange), query (red), and value (purple). For this example, let’s take that we want these representations to have a dimension of 3. Because every input has a dimension of 4, each set of the weights must have a shape of 4×3.

Note
We’ll see later that the dimension of value is also the output dimension.

Fig. 1.2: Deriving key, query and value representations from each input

To obtain these representations, every input (green) is multiplied with a set of weights for keys, a set of weights for querys (I know that’s not the correct spelling), and a set of weights for values. In our example, we initialise the three sets of weights as follows.

Weights for key:

[[0, 0, 1],
 [1, 1, 0],
 [0, 1, 0],
 [1, 1, 0]]

Weights for query:

[[1, 0, 1],
 [1, 0, 0],
 [0, 0, 1],
 [0, 1, 1]]

Weights for value:

[[0, 2, 0],
 [0, 3, 0],
 [1, 0, 3],
 [1, 1, 0]]

Notes
In a neural network setting, these weights are usually small numbers, initialised randomly using an appropriate random distribution like Gaussian, Xavier and Kaiming distributions. This initialisation is done once before training.

Step 3: Derive key, query and value

Now that we have the three sets of weights, let’s obtain the key, query and value representations for every input.

Key representation for Input 1:

               [0, 0, 1]
[1, 0, 1, 0] x [1, 1, 0] = [0, 1, 1]
               [0, 1, 0]
               [1, 1, 0]

Use the same set of weights to get the key representation for Input 2:

               [0, 0, 1]
[0, 2, 0, 2] x [1, 1, 0] = [4, 4, 0]
               [0, 1, 0]
               [1, 1, 0]

Use the same set of weights to get the key representation for Input 3:

               [0, 0, 1]
[1, 1, 1, 1] x [1, 1, 0] = [2, 3, 1]
               [0, 1, 0]
               [1, 1, 0]

A faster way is to vectorise the above operations:

               [0, 0, 1]
[1, 0, 1, 0]   [1, 1, 0]   [0, 1, 1]
[0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0]
[1, 1, 1, 1]   [1, 1, 0]   [2, 3, 1]

Fig. 1.3a: Derive key representations from each input

Let’s do the same to obtain the value representations for every input:

               [0, 2, 0]
[1, 0, 1, 0]   [0, 3, 0]   [1, 2, 3] 
[0, 2, 0, 2] x [1, 0, 3] = [2, 8, 0]
[1, 1, 1, 1]   [1, 1, 0]   [2, 6, 3]

Fig. 1.3b: Derive value representations from each input

and finally the query representations:

               [1, 0, 1]
[1, 0, 1, 0]   [1, 0, 0]   [1, 0, 2]
[0, 2, 0, 2] x [0, 0, 1] = [2, 2, 2]
[1, 1, 1, 1]   [0, 1, 1]   [2, 1, 3]

Fig. 1.3c: Derive query representations from each input

Notes
In practice, a bias vector may be added to the product of matrix multiplication.

Step 4: Calculate attention scores for Input 1

Fig. 1.4: Calculating attention scores (blue) from query 1

To obtain attention scores, we start with taking a dot product between Input 1’s query (red) with all keys (orange), including itself. Since there are 3 key representations (because we have 3 inputs), we obtain 3 attention scores (blue).

            [0, 4, 2]
[1, 0, 2] x [1, 4, 3] = [2, 4, 4]
            [1, 0, 1]

Notice that we only use the query from Input 1. Later we’ll work on repeating this same step for the other querys.

Note
The above operation is known as dot product attention, one of the several score functions. Other score functions include scaled dot product and additive/concat.

Step 5: Calculate softmax

Fig. 1.5: Softmax the attention scores (blue)

Take the softmax across these attention scores (blue).

softmax([2, 4, 4]) = [0.0, 0.5, 0.5]

Note that we round off to 1 decimal place here for readability.

Step 6: Multiply scores with values

Fig. 1.6: Derive weighted value representation (yellow) from multiply value (purple) and score (blue)

The softmaxed attention scores for each input (blue) is multiplied by its corresponding value (purple). This results in 3 alignment vectors (yellow). In this tutorial, we’ll refer to them as weighted values.

1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]
2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]
3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]

Step 7: Sum weighted values to get Output 1

Fig. 1.7: Sum all weighted values (yellow) to get Output 1 (dark green)

Take all the weighted values (yellow) and sum them element-wise:

  [0.0, 0.0, 0.0]
+ [1.0, 4.0, 0.0]
+ [1.0, 3.0, 1.5]
-----------------
= [2.0, 7.0, 1.5]

The resulting vector [2.0, 7.0, 1.5] (dark green) is Output 1, which is based on the query representation from Input 1 interacting with all other keys, including itself.

Step 8: Repeat for Input 2 & Input 3

Now that we’re done with Output 1, we repeat Steps 4 to 7 for Output 2 and Output 3. I trust that I can leave you to work out the operations yourself 👍🏼.

Fig. 1.8: Repeat previous steps for Input 2 & Input 3

Notes
The dimension of query and key must always be the same because of the dot product score function. However, the dimension of value may be different from query and key. The resulting output will consequently follow the dimension of value.

2. Code

Here is the code in PyTorch 🤗, a popular deep learning framework in Python. To enjoy the APIs for @ operator, .T and None indexing in the following code snippets, make sure you’re on Python≥3.6 and PyTorch 1.3.1. Just follow along and copy-paste these in a Python/IPython REPL or Jupyter Notebook.

Step 1: Prepare inputs

Step 2: Initialise weights

Step 3: Derive key, query and value

Step 4: Calculate attention scores

Step 5: Calculate softmax

Step 6: Multiply scores with values

Step 7: Sum weighted values

Note
PyTorch has provided an API for this called nn.MultiheadAttention. However, this API requires that you feed in key, query and value PyTorch tensors. Moreover, the outputs of this module undergo a linear transformation.

3. Extending to Transformers

So, where do we go from here? Transformers! Indeed we live in exciting times of deep learning research and high compute resources. The transformer is the incarnation from Attention Is All You Need, originally born to perform neural machine translation. Researchers picked up from here, reassembling, cutting, adding and extending the parts, and extending its usage to more language tasks.

Here I will briefly mention how we can extend self-attention to a Transformer architecture.

Within the self-attention module:

Dimension
Bias

Inputs to the self-attention module:

Embedding module
Positional encoding
Truncating
Masking

Adding more self-attention modules:

Multihead
Layer stacking

Modules between self-attention modules:

Linear transformations
LayerNorm

That’s all folks! Hope you find the content easy to digest. Is there something that you think I should add or elaborate on further in this article? Do drop a comment! Also, do check out an illustration I created for attention below

References

Attention Is All You Need (arxiv.org)

The Illustrated Transformer (jalammar.github.io)

Attn: Illustrated Attention (towardsdatascience.com)

If you like my content and haven’t already subscribed to Medium, subscribe via my referral link here! NOTE: A portion of your membership fees will be apportioned to me as referral fees.

Special thanks to Xin Jie, Serene, Ren Jie, Kevin and Wei Yih for ideas, suggestions and corrections to this article.

Follow me on Twitter @remykarem for digested articles and other tweets on AI, ML, Deep Learning and Python.

Raimi Karim

🇸🇬 Software Engineer at GovTech • MComp AI at NUS

Classifying semantic relations between entity pairs in sentences is an important task in natural language processing (NLP).

Most previous models applied to relation classification rely on high-level lexical and syntactic features obtained by NLP tools such as WordNet, the dependency parser, part-of-speech (POS) tagger, and named entity recognizers (NER).

In addition, state-of-the-art neural models based on attention mechanisms do not fully utilize information related to the entity, which may be the most crucial feature for relation classification.

To address these issues, we propose a novel end-to-end recurrent neural model that incorporates an entity-aware attention mechanism with a latent entity typing (LET) method.

Our model not only effectively utilizes entities and their latent types as features, but also builds word representations by applying self-attention based on symmetrical similarity of a sentence itself.

Moreover, the model is interpretable by visualizing applied attention mechanisms.

Experimental results obtained with the SemEval-2010 Task 8 dataset, which is one of the most popular relation classification tasks, demonstrate that our model outperforms existing state-of-the-art models without any high-level features.

https://www.mdpi.com/2073-8994/11/6/785

https://www.mdpi.com/2073-8994/11/6/785/htm

Natural Language Processing: part 1 of lecture notes

Lecture Synopsis

Aims

This course aims to introduce the fundamental techniques of natural language processing, to develop an understanding of the limits of those techniques and of current research issues, and evaluate some current and potential applications.

• Introduction. Brief history of NLP research, current applications, generic NLP system architecture, knowledgebased versus probabilistic approaches.

• Finite state techniques. Inflectional and derivational morphology, finite-state automata in NLP, finite-state transducers.

• Prediction and part-of-speech tagging. Corpora, simple N-grams, word prediction, stochastic tagging, evaluating system performance.

• Parsing and generation I. Generative grammar, context-free grammars, parsing and generation with contextfree grammars, weights and probabilities.

• Parsing and generation II. Constraint-based grammar, unification, simple compositional semantics.

• Lexical semantics. Semantic relations, WordNet, word senses, word sense disambiguation.

• Discourse. Anaphora resolution, discourse relations.

• Applications. Machine translation, email response, spoken dialogue systems.

https://www.cl.cam.ac.uk/teaching/2002/NatLangProc/nlp1-4.pdf

The Robotic Future of Artificial Intelligence and Natural Language Processing (NLP)

What Can Come From the Future of AI In NLP

Customer Service

Smart Home and Office Assistants

Healthcare Filling and Recording

Human Robotics

Natural Language Processing Tutorial

Meaning of self attention mechanism

Illustrated: Self-Attention

A step-by-step guide to self-attention with illustrations and code

Content

0. What is self-attention?

1. Illustrations

2. Code

3. Extending to Transformers

References

Related Articles

Raimi Karim

Semantic Relation Classification via Bidirectional LSTM Networks with Entity-Aware Attention Using Latent Entity Typing

Natural Language Processing: part 1 of lecture notes