Meaning of self attention mechanism

In layman's terms, the self-attention mechanism allows the inputs to interact with each other (“self”) and find out who they should pay more attention to (“attention”). The outputs are aggregates of these interactions and attention scores.

https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a

Illustrated: Self-Attention

A step-by-step guide to self-attention with illustrations and code

Raimi Karim

Nov 18, 2019·8 min read

The illustrations are best viewed on the Desktop. A Colab version can be found here (thanks to Manuel Romero!).

Changelog:
12 Jan 2022 — Improve clarity
5 Jan 2022 — Fix typos and improve clarity

What do BERT, RoBERTa, ALBERT, SpanBERT, DistilBERT, SesameBERT, SemBERT, SciBERT, BioBERT, MobileBERT, TinyBERT and CamemBERT all have in common? And I’m not looking for the answer “BERT” 🤭.

Answer: self-attention 🤗. We are not only talking about architectures bearing the name “BERT’ but, more correctly, Transformer-based architectures. Transformer-based architectures, which are primarily used in modelling language understanding tasks, eschew recurrence in neural networks and instead trust entirely on self-attention mechanisms to draw global dependencies between inputs and outputs. But what’s the math behind this?

That’s what we’re going to find out today. The main content of this post is to walk you through the mathematical operations involved in a self-attention module. By the end of this article, you should be able to write or code a self-attention module from scratch.

This article does not aim to provide the intuitions and explanations behind the different numerical representations and mathematical operations in the self-attention module. It also does not seek to demonstrate the why’s and how-exactly’s of self-attention in Transformers (I believe there’s a lot out there already). Note that the difference between attention and self-attention is also not detailed in this article.

Content

Now let’s get on to it!

0. What is self-attention?

If you think that self-attention is similar, the answer is yes! They fundamentally share the same concept and many common mathematical operations.

A self-attention module takes in n inputs and returns n outputs. What happens in this module? In layman’s terms, the self-attention mechanism allows the inputs to interact with each other (“self”) and find out who they should pay more attention to (“attention”). The outputs are aggregates of these interactions and attention scores.

1. Illustrations

The illustrations are divided into the following steps:

Prepare inputs
Initialise weights
Derive key, query and value
Calculate attention scores for Input 1
Calculate softmax
Multiply scores with values
Sum weighted values to get Output 1
Repeat steps 4–7 for Input 2 & Input 3

Note
In practice, the mathematical operations are vectorised, i.e. all the inputs undergo the mathematical operations together. We’ll see this later in the Code section.

Step 1: Prepare inputs

Fig. 1.1: Prepare inputs

We start with 3 inputs for this tutorial, each with dimension 4.

Input 1: [1, 0, 1, 0] 
Input 2: [0, 2, 0, 2]
Input 3: [1, 1, 1, 1]

Step 2: Initialise weights

Every input must have three representations (see diagram below). These representations are called key (orange), query (red), and value (purple). For this example, let’s take that we want these representations to have a dimension of 3. Because every input has a dimension of 4, each set of the weights must have a shape of 4×3.

Note
We’ll see later that the dimension of value is also the output dimension.

Fig. 1.2: Deriving key, query and value representations from each input

To obtain these representations, every input (green) is multiplied with a set of weights for keys, a set of weights for querys (I know that’s not the correct spelling), and a set of weights for values. In our example, we initialise the three sets of weights as follows.

Weights for key:

[[0, 0, 1],
 [1, 1, 0],
 [0, 1, 0],
 [1, 1, 0]]

Weights for query:

[[1, 0, 1],
 [1, 0, 0],
 [0, 0, 1],
 [0, 1, 1]]

Weights for value:

[[0, 2, 0],
 [0, 3, 0],
 [1, 0, 3],
 [1, 1, 0]]

Notes
In a neural network setting, these weights are usually small numbers, initialised randomly using an appropriate random distribution like Gaussian, Xavier and Kaiming distributions. This initialisation is done once before training.

Step 3: Derive key, query and value

Now that we have the three sets of weights, let’s obtain the key, query and value representations for every input.

Key representation for Input 1:

               [0, 0, 1]
[1, 0, 1, 0] x [1, 1, 0] = [0, 1, 1]
               [0, 1, 0]
               [1, 1, 0]

Use the same set of weights to get the key representation for Input 2:

               [0, 0, 1]
[0, 2, 0, 2] x [1, 1, 0] = [4, 4, 0]
               [0, 1, 0]
               [1, 1, 0]

Use the same set of weights to get the key representation for Input 3:

               [0, 0, 1]
[1, 1, 1, 1] x [1, 1, 0] = [2, 3, 1]
               [0, 1, 0]
               [1, 1, 0]

A faster way is to vectorise the above operations:

               [0, 0, 1]
[1, 0, 1, 0]   [1, 1, 0]   [0, 1, 1]
[0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0]
[1, 1, 1, 1]   [1, 1, 0]   [2, 3, 1]

Fig. 1.3a: Derive key representations from each input

Let’s do the same to obtain the value representations for every input:

               [0, 2, 0]
[1, 0, 1, 0]   [0, 3, 0]   [1, 2, 3] 
[0, 2, 0, 2] x [1, 0, 3] = [2, 8, 0]
[1, 1, 1, 1]   [1, 1, 0]   [2, 6, 3]

Fig. 1.3b: Derive value representations from each input

and finally the query representations:

               [1, 0, 1]
[1, 0, 1, 0]   [1, 0, 0]   [1, 0, 2]
[0, 2, 0, 2] x [0, 0, 1] = [2, 2, 2]
[1, 1, 1, 1]   [0, 1, 1]   [2, 1, 3]

Fig. 1.3c: Derive query representations from each input

Notes
In practice, a bias vector may be added to the product of matrix multiplication.

Step 4: Calculate attention scores for Input 1

Fig. 1.4: Calculating attention scores (blue) from query 1

To obtain attention scores, we start with taking a dot product between Input 1’s query (red) with all keys (orange), including itself. Since there are 3 key representations (because we have 3 inputs), we obtain 3 attention scores (blue).

            [0, 4, 2]
[1, 0, 2] x [1, 4, 3] = [2, 4, 4]
            [1, 0, 1]

Notice that we only use the query from Input 1. Later we’ll work on repeating this same step for the other querys.

Note
The above operation is known as dot product attention, one of the several score functions. Other score functions include scaled dot product and additive/concat.

Step 5: Calculate softmax

Fig. 1.5: Softmax the attention scores (blue)

Take the softmax across these attention scores (blue).

softmax([2, 4, 4]) = [0.0, 0.5, 0.5]

Note that we round off to 1 decimal place here for readability.

Step 6: Multiply scores with values

Fig. 1.6: Derive weighted value representation (yellow) from multiply value (purple) and score (blue)

The softmaxed attention scores for each input (blue) is multiplied by its corresponding value (purple). This results in 3 alignment vectors (yellow). In this tutorial, we’ll refer to them as weighted values.

1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]
2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]
3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]

Step 7: Sum weighted values to get Output 1

Fig. 1.7: Sum all weighted values (yellow) to get Output 1 (dark green)

Take all the weighted values (yellow) and sum them element-wise:

  [0.0, 0.0, 0.0]
+ [1.0, 4.0, 0.0]
+ [1.0, 3.0, 1.5]
-----------------
= [2.0, 7.0, 1.5]

The resulting vector [2.0, 7.0, 1.5] (dark green) is Output 1, which is based on the query representation from Input 1 interacting with all other keys, including itself.

Step 8: Repeat for Input 2 & Input 3

Now that we’re done with Output 1, we repeat Steps 4 to 7 for Output 2 and Output 3. I trust that I can leave you to work out the operations yourself 👍🏼.

Fig. 1.8: Repeat previous steps for Input 2 & Input 3

Notes
The dimension of query and key must always be the same because of the dot product score function. However, the dimension of value may be different from query and key. The resulting output will consequently follow the dimension of value.

2. Code

Here is the code in PyTorch 🤗, a popular deep learning framework in Python. To enjoy the APIs for @ operator, .T and None indexing in the following code snippets, make sure you’re on Python≥3.6 and PyTorch 1.3.1. Just follow along and copy-paste these in a Python/IPython REPL or Jupyter Notebook.

Step 1: Prepare inputs

Step 2: Initialise weights

Step 3: Derive key, query and value

Step 4: Calculate attention scores

Step 5: Calculate softmax

Step 6: Multiply scores with values

Step 7: Sum weighted values

Note
PyTorch has provided an API for this called nn.MultiheadAttention. However, this API requires that you feed in key, query and value PyTorch tensors. Moreover, the outputs of this module undergo a linear transformation.

3. Extending to Transformers

So, where do we go from here? Transformers! Indeed we live in exciting times of deep learning research and high compute resources. The transformer is the incarnation from Attention Is All You Need, originally born to perform neural machine translation. Researchers picked up from here, reassembling, cutting, adding and extending the parts, and extending its usage to more language tasks.

Here I will briefly mention how we can extend self-attention to a Transformer architecture.

Within the self-attention module:

Dimension
Bias

Inputs to the self-attention module:

Embedding module
Positional encoding
Truncating
Masking

Adding more self-attention modules:

Multihead
Layer stacking

Modules between self-attention modules:

Linear transformations
LayerNorm

That’s all folks! Hope you find the content easy to digest. Is there something that you think I should add or elaborate on further in this article? Do drop a comment! Also, do check out an illustration I created for attention below

References

Attention Is All You Need (arxiv.org)

The Illustrated Transformer (jalammar.github.io)

Attn: Illustrated Attention (towardsdatascience.com)

If you like my content and haven’t already subscribed to Medium, subscribe via my referral link here! NOTE: A portion of your membership fees will be apportioned to me as referral fees.

Special thanks to Xin Jie, Serene, Ren Jie, Kevin and Wei Yih for ideas, suggestions and corrections to this article.

Follow me on Twitter @remykarem for digested articles and other tweets on AI, ML, Deep Learning and Python.

Raimi Karim

🇸🇬 Software Engineer at GovTech • MComp AI at NUS

Sentiment Analysis using Partial Textual Entailment

The internet is an ocean of raw data in the form of social text messages, tweets, blogs etc. - arbitrary and unconnected (at the first glance).

Only after processing, organizing and structuring of online text snippets, the data becomes information.

From the Literature survey, we observed that the present sentiment analysis tools are slow, bulky and computationally heavy which makes the task at hand inefficient.

Therefore, to overcome the aforementioned problem of analyzing sentiments efficiently, a new method is proposed in the present paper to drive the task of Sentiment Analysis by exploiting the idea of Partial Textual Entailment.

We propose to use Partial Textual Entailment for measuring semantic similarity between the tweets so as to group similar tweets together.

The method is anticipated to reduce the burden of sentiment analyzer and makes the processing faster.

Moreover, we also propose a modification in an existing method of Partial Textual Entailment which can be further adopted for many Natural Language Processing applications.

https://ieeexplore.ieee.org/document/8862241

conf

What is One Hot Encoding?

Machine learning algorithms cannot work with categorical data directly.

Categorical data must be converted to numbers.

In this tutorial, you will discover how to convert your input or output sequence data to a one hot encoding for use in sequence classification problems with deep learning in Python.

After completing this tutorial, you will know:

(1) What an integer encoding and one hot encoding are and why they are necessary in machine learning.

(2) How to calculate an integer encoding and one hot encoding by hand in Python.

(3) How to use the scikit-learn and Keras libraries to automatically encode your sequence data in Python.

https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/

Classifying semantic relations between entity pairs in sentences is an important task in natural language processing (NLP).

Most previous models applied to relation classification rely on high-level lexical and syntactic features obtained by NLP tools such as WordNet, the dependency parser, part-of-speech (POS) tagger, and named entity recognizers (NER).

In addition, state-of-the-art neural models based on attention mechanisms do not fully utilize information related to the entity, which may be the most crucial feature for relation classification.

To address these issues, we propose a novel end-to-end recurrent neural model that incorporates an entity-aware attention mechanism with a latent entity typing (LET) method.

Our model not only effectively utilizes entities and their latent types as features, but also builds word representations by applying self-attention based on symmetrical similarity of a sentence itself.

Moreover, the model is interpretable by visualizing applied attention mechanisms.

Experimental results obtained with the SemEval-2010 Task 8 dataset, which is one of the most popular relation classification tasks, demonstrate that our model outperforms existing state-of-the-art models without any high-level features.

https://www.mdpi.com/2073-8994/11/6/785

https://www.mdpi.com/2073-8994/11/6/785/htm

Google shut down Allo (TheVerge: 5th December 2018).

Allo was one of Google’s numerous attempts at creating an instant messaging app able to compete with the giants on the market – Apple’s iMessage and Facebook’s Messenger and WhatsApp.

Allo used phone numbers for identifying users and didn’t require emails or social media accounts. It introduced several additions to the world of messaging, such as Selfie Stickers, Smart Reply, and Google Assistant

Google has been trying to find a foothold in the messaging market for years. Starting with Google Talk, going through Hangouts and its ever-growing text, voice, and video features, and finishing with newer apps like Allo, Google Duo, and Google Messenger – Google seems to have tried it all.

When Allo came out in September 2016, things looked promising. The app used phone numbers as identifiers, which was good for users who needed a texting app that wasn’t connected to their social media or email.

While Allo offered users many entertaining tools to enhance their messaging experience, it just wasn’t doing all that well in terms of numbers. The peak of downloads was the first 12 weeks after launch when it reached around 10 million downloads, and by the time Google decided to discontinue it, the app had had less than 50 million downloads.

That might seem like a lot, but it simply wasn’t enough compared to the more than a billion people using Facebook’s Messenger monthly or the 2 billion using WhatsApp.

One recurring criticism of the service was that when Allo launched in 2016, it was available only on one device since it was connected to the user’s phone number. That didn’t help attract more users, and even though in 2017 Google added the option to have the app on one mobile device and use it on the web, it might have been a bit too late.

Finally, the lack of SMS support was a big hindrance. People could use Hangouts to talk on the web and via SMS at the same time, so it was a legitimate question why people would prefer to switch to Allo.

Having this in mind, it’s not much of a surprise that Google decided to redirect time and resources elsewhere. In April 2018, Anil Sabharwal, the new head of Google’s communications group, announced that the tech giant was “pausing” the development of the Allo project and would be focusing on something called Android Messages and the new RCS standards and Chat.

In December 2018, Google announced Allo was to be officially discontinued in March 2019, and users were given the option to save their data beforehand.

Source: https://www.failory.com/google/allo

ABSTRACT:

Information in the legal domain is often stored as text in relatively unstructured forms. For example, statutes, judgments and commentaries are typically stored as free documents. Discovering knowledge by the automatic analysis of free text is a field of research that is evolving from information retrieval and is often called text mining.

Dozier states that text mining is a new field and there is still debate about what its definition should be. Jockson observe that text mining involves discovering something interesting about the relationship between the text and the world. Hearst proposes that text mining involves discovering relationships between the content of multiple texts and linking this information together to create new information.

Text information retrieval and data mining has thus become increasingly important. In this paper various information retrieval techniques based on Text mining have been presented.

KEYWORDS:

Information Retrieval, Information Extraction and Indexing Techniques

Start reading the literature.

http://scholar.google.com is your friend. I’ve also found Microsoft Academic is useful. Type in the topic area, whatever it is.

Begin looking at the links that are returned. Read the abstracts. Download the papers that interest you - when I download them I add the data of publication and the title of the paper as the name (because often the names give you no contextual clue as to the contents). It is also a good practice to capture the biblographic data (I grab the BibTeX since I write my CS papers in LaTeX) because you will need it when you start writing.

Start reading the interesting papers. Usually by the time you are done reading the introduction you will know if you want to read further. If you really liked the paper, then use the search engines to find papers that reference the paper you like.

If you are looking for the latest research, restrict your search to the past four years. I usually work backwards from the more recent work to older work. Sometimes I find survey articles, which can be great, usually I will find a few papers that are heavily referenced (100+ references) and that is often indicative of importance of the work.

Summarize those papers. For the ones you thought most relevant try to note what you learned from the paper. What did you like? What did you not like? Were there any areas they failed to address? Did they conflict with other work in the field? If so, why?

If you are looking for a research question try to identify something the paper didn’t do: maybe a technique you think they could have used, or a dataset, or specific equipment. Look to see if any of the other papers have done that. If not, you now have a potential research question: “What happens if we use X when addressing problem Y.” Then think up ways that you can “fail fast” - how can you see if there is merit to that approach. If not, you want to know quickly so you can discard that approach before you spend too much time on it. If it looks promising, then push further and see where your research takes you.

As you do this work, write it up. If you are really disciplined, you will start writing your research paper before you have done the research. That will help you think about how you want to present your findings, which in turn helps you focus on what you need to do to generate the necessary data. Of course, you are likely to find out things don’t work the way that you wanted and you’ll have to rewrite your paper. This way you do have a history of what you did as well. When you’re done, you’ll have a research paper.

That’s the point at which you’ll have to tear it apart. Criticize it. Find its weaknesses. Think about how you will fix those weaknesses or address them (“we did not investigate X and leave it for future work…”) Then ask other people to read it - they won’t have any of your insight so if you aren’t explaining it so they can understand why your work is important (the abstract and introduction!) then you need to go back an rewrite those sections.

At some point you decide you are done with the paper: the time allotted to it has expired, the work has been accepted for publication, you’re sick of it and want to do anything else.

Good luck!

Tony Mason, PhD from The University of British Columbia (2022)

https://www.quora.com/How-do-I-prepare-for-writing-a-research-paper-on-any-topic-in-computer-science/answer/Tony-Mason-10

Meaning of self attention mechanism

Illustrated: Self-Attention

A step-by-step guide to self-attention with illustrations and code

Content

0. What is self-attention?

1. Illustrations

2. Code

3. Extending to Transformers

References

Related Articles

Raimi Karim

Sentiment Analysis using Partial Textual Entailment

What is One Hot Encoding?

Semantic Relation Classification via Bidirectional LSTM Networks with Entity-Aware Attention Using Latent Entity Typing

What Happened To Google Allo?

A Study on Information Retrieval Methods in Text Mining

ABSTRACT:

KEYWORDS:

PUBLINK:

DOCLINK:

SEMLINK:

How to prepare for writing a research paper on any topic in computer science?