Twitter and Research: A Systematic Literature Review Through Text Mining

Researchers have collected Twitter data to study a wide range of topics. This growing body of literature, however, has not yet been reviewed systematically to synthesize Twitter-related papers. The existing literature review papers have been limited by constraints of traditional methods to manually select and analyze samples of topically related papers. The goals of this retrospective study are to identify dominant topics of Twitter-based research, summarize the temporal trend of topics, and interpret the evolution of topics withing the last ten years. This study systematically mines a large number of Twitter-based studies to characterize the relevant literature by an efficient and effective approach. This study collected relevant papers from three databases and applied text mining and trend analysis to detect semantic patterns and explore the yearly development of research themes across a decade. We found 38 topics in more than 18,000 manuscripts published between 2006 and 2019. By quantifying temporal trends, this study found that while 23.7% of topics did not show a significant trend ( $P => 0.05$ ), 21% of topics had increasing trends and 55.3% of topics had decreasing trends that these hot and cold topics represent three categories: application, methodology, and technology. The contributions of this paper can be utilized in the growing field of Twitter-based research and are beneficial to researchers, educators, and publishers.

Like the politics-related twitter studies, sentiment analysis has a positive slope indicating increasing research activities over time

https://ieeexplore.ieee.org/ielx7/6287639/8948470/09047963.pdf

Why are neural networks always described as so complicated you can't even use it?

I don’t think they’re always described like that, but they are complex yes.

What’s your source for that description?

Because a bunch of people say it's going to be a mess and the most important thing about mental networks and the next generation of signal processing and software is that ai is going to be built on vr and nobody knows that and then they brag about it

A2A: I don’t think that neural networks are “always” described that way. People can and do use various forms of artificial neural network technology for many kinds of useful applications.

The complexity in using these things comes from a lack of good theory to tell the user what size and shape of network will be best (or good enough) for a given problem. By that I mean the number of layers, number of units in each layer, activation functions, hyper-parameters such as the learning rate, when and where to use features such as convolutional or recurrent layers, and so on.

So generally you guess what kind of network will do best on a problem, and you usually end up trying a lot of different guesses before you find one that works. The guessing gets better (at least a little bit) as the user gains experience, but it’s still a black art.

It also is very hard to understand what is going on inside a network, so if it exhibits some anomalous behavior for certain inputs, it’s hard to figure out the reason or what the fix would be.

Because we are talking about using our neurology system to function as servers and workstations plus hundreds of intermediary equipment without neurologically support infrastructures. Experiments of using neurology as a mean of communication have been conducted more than ten years ago but still we have not seen any of its application till today all due to such challenges. But there was some type of experiments in the same idea taken placed in the Soviet Union at the height of the Cold War. It was distant relevant to the idea so the tech had sometimes failed. Soviet spies were equipped with special methods of sending /receiving information through interaction with signs, objects, verbal sentences (codes/scripted text). Each of them was attached with a hidden sensor to decode hidden texts and he can view them inside his mind. They also used a technique of brain collaboration by exchanging electrical pulses which got decoded by sensor converting them into images.

Well - People are using neural networks, so clearly there is some understanding.

Neural Networks are only complicated, poorly understood, or hard to interpret if the underlying training data is poorly understood. NNs are useful because the geometry of the feature space is complex, multimodal and confused. In fact, if there are sub-populations, are separate, clear explanations that may be applied to particular groups, it is generally a good idea to train to those groups in subnets. It should be understood that if you have a handle on the geometry and complexity of the decision space, the required number of layers and nodes can be estimated.

The other part of the complexity is the training and convergence to optimum performance. This can get mind-bogglingly complex, since depending on the complexity, order of presentation, initial conditions, and character of the confusion boundaries, convergence can vary tremendously, and may be mathematically chaotic. Such is stochastic convergence.

They are not so complicated and it is still much more easier to use them than design them. One can use existing models even outside the deep learning frameworks, e.g. networks models for vision can be used in OpenCV. Of course for each model one has to know how to prepare suitable input to the network and how interpret output and these processes are a bit complicated. See e.g. www.agentspace.org

I don’t think they are usually described as too complicated to use. From what I have heard, neural networks are described as so complicated largely because there’s still a lot we are learning about them theoretically to understand how to best design and use them.

There’s still a lot of questions about how to construct the networks so that you can have a good generalizing model for whatever dataset you are investigating. How should the neural network architecture be? What activation function(s) should you use? What optimization scheme should you use? These questions and more ultimately influence the function space you will optimize over for some problem and there’s a lot we have yet to learn about these things.

Larger neural networks can have hundreds, thousands, even millions of neurons, and we have yet to fully understand what exactly is going on inside those networks. As a result we can’t bug check them, as we can’t identify if the neural network is working properly without testing them, meaning certain bugs that didn’t appear in testing could present themselves during official use. For example: you could train a neural network to drive a car. during testing it might seem to be doing fine, so you let it out onto the real road. Then, out of nowhere, it drives straight into a Starbucks, and you have no idea why, as you don’t fully understand how the thing works.

https://www.quora.com/Why-are-neural-networks-always-described-as-so-complicated-you-cant-even-use-it/log

In layman's terms, the self-attention mechanism allows the inputs to interact with each other (“self”) and find out who they should pay more attention to (“attention”). The outputs are aggregates of these interactions and attention scores.

https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a

Illustrated: Self-Attention

A step-by-step guide to self-attention with illustrations and code

Raimi Karim

Nov 18, 2019·8 min read

The illustrations are best viewed on the Desktop. A Colab version can be found here (thanks to Manuel Romero!).

Changelog:
12 Jan 2022 — Improve clarity
5 Jan 2022 — Fix typos and improve clarity

What do BERT, RoBERTa, ALBERT, SpanBERT, DistilBERT, SesameBERT, SemBERT, SciBERT, BioBERT, MobileBERT, TinyBERT and CamemBERT all have in common? And I’m not looking for the answer “BERT” 🤭.

Answer: self-attention 🤗. We are not only talking about architectures bearing the name “BERT’ but, more correctly, Transformer-based architectures. Transformer-based architectures, which are primarily used in modelling language understanding tasks, eschew recurrence in neural networks and instead trust entirely on self-attention mechanisms to draw global dependencies between inputs and outputs. But what’s the math behind this?

That’s what we’re going to find out today. The main content of this post is to walk you through the mathematical operations involved in a self-attention module. By the end of this article, you should be able to write or code a self-attention module from scratch.

This article does not aim to provide the intuitions and explanations behind the different numerical representations and mathematical operations in the self-attention module. It also does not seek to demonstrate the why’s and how-exactly’s of self-attention in Transformers (I believe there’s a lot out there already). Note that the difference between attention and self-attention is also not detailed in this article.

Content

Now let’s get on to it!

0. What is self-attention?

If you think that self-attention is similar, the answer is yes! They fundamentally share the same concept and many common mathematical operations.

A self-attention module takes in n inputs and returns n outputs. What happens in this module? In layman’s terms, the self-attention mechanism allows the inputs to interact with each other (“self”) and find out who they should pay more attention to (“attention”). The outputs are aggregates of these interactions and attention scores.

1. Illustrations

The illustrations are divided into the following steps:

Prepare inputs
Initialise weights
Derive key, query and value
Calculate attention scores for Input 1
Calculate softmax
Multiply scores with values
Sum weighted values to get Output 1
Repeat steps 4–7 for Input 2 & Input 3

Note
In practice, the mathematical operations are vectorised, i.e. all the inputs undergo the mathematical operations together. We’ll see this later in the Code section.

Step 1: Prepare inputs

Fig. 1.1: Prepare inputs

We start with 3 inputs for this tutorial, each with dimension 4.

Input 1: [1, 0, 1, 0] 
Input 2: [0, 2, 0, 2]
Input 3: [1, 1, 1, 1]

Step 2: Initialise weights

Every input must have three representations (see diagram below). These representations are called key (orange), query (red), and value (purple). For this example, let’s take that we want these representations to have a dimension of 3. Because every input has a dimension of 4, each set of the weights must have a shape of 4×3.

Note
We’ll see later that the dimension of value is also the output dimension.

Fig. 1.2: Deriving key, query and value representations from each input

To obtain these representations, every input (green) is multiplied with a set of weights for keys, a set of weights for querys (I know that’s not the correct spelling), and a set of weights for values. In our example, we initialise the three sets of weights as follows.

Weights for key:

[[0, 0, 1],
 [1, 1, 0],
 [0, 1, 0],
 [1, 1, 0]]

Weights for query:

[[1, 0, 1],
 [1, 0, 0],
 [0, 0, 1],
 [0, 1, 1]]

Weights for value:

[[0, 2, 0],
 [0, 3, 0],
 [1, 0, 3],
 [1, 1, 0]]

Notes
In a neural network setting, these weights are usually small numbers, initialised randomly using an appropriate random distribution like Gaussian, Xavier and Kaiming distributions. This initialisation is done once before training.

Step 3: Derive key, query and value

Now that we have the three sets of weights, let’s obtain the key, query and value representations for every input.

Key representation for Input 1:

               [0, 0, 1]
[1, 0, 1, 0] x [1, 1, 0] = [0, 1, 1]
               [0, 1, 0]
               [1, 1, 0]

Use the same set of weights to get the key representation for Input 2:

               [0, 0, 1]
[0, 2, 0, 2] x [1, 1, 0] = [4, 4, 0]
               [0, 1, 0]
               [1, 1, 0]

Use the same set of weights to get the key representation for Input 3:

               [0, 0, 1]
[1, 1, 1, 1] x [1, 1, 0] = [2, 3, 1]
               [0, 1, 0]
               [1, 1, 0]

A faster way is to vectorise the above operations:

               [0, 0, 1]
[1, 0, 1, 0]   [1, 1, 0]   [0, 1, 1]
[0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0]
[1, 1, 1, 1]   [1, 1, 0]   [2, 3, 1]

Fig. 1.3a: Derive key representations from each input

Let’s do the same to obtain the value representations for every input:

               [0, 2, 0]
[1, 0, 1, 0]   [0, 3, 0]   [1, 2, 3] 
[0, 2, 0, 2] x [1, 0, 3] = [2, 8, 0]
[1, 1, 1, 1]   [1, 1, 0]   [2, 6, 3]

Fig. 1.3b: Derive value representations from each input

and finally the query representations:

               [1, 0, 1]
[1, 0, 1, 0]   [1, 0, 0]   [1, 0, 2]
[0, 2, 0, 2] x [0, 0, 1] = [2, 2, 2]
[1, 1, 1, 1]   [0, 1, 1]   [2, 1, 3]

Fig. 1.3c: Derive query representations from each input

Notes
In practice, a bias vector may be added to the product of matrix multiplication.

Step 4: Calculate attention scores for Input 1

Fig. 1.4: Calculating attention scores (blue) from query 1

To obtain attention scores, we start with taking a dot product between Input 1’s query (red) with all keys (orange), including itself. Since there are 3 key representations (because we have 3 inputs), we obtain 3 attention scores (blue).

            [0, 4, 2]
[1, 0, 2] x [1, 4, 3] = [2, 4, 4]
            [1, 0, 1]

Notice that we only use the query from Input 1. Later we’ll work on repeating this same step for the other querys.

Note
The above operation is known as dot product attention, one of the several score functions. Other score functions include scaled dot product and additive/concat.

Step 5: Calculate softmax

Fig. 1.5: Softmax the attention scores (blue)

Take the softmax across these attention scores (blue).

softmax([2, 4, 4]) = [0.0, 0.5, 0.5]

Note that we round off to 1 decimal place here for readability.

Step 6: Multiply scores with values

Fig. 1.6: Derive weighted value representation (yellow) from multiply value (purple) and score (blue)

The softmaxed attention scores for each input (blue) is multiplied by its corresponding value (purple). This results in 3 alignment vectors (yellow). In this tutorial, we’ll refer to them as weighted values.

1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]
2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]
3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]

Step 7: Sum weighted values to get Output 1

Fig. 1.7: Sum all weighted values (yellow) to get Output 1 (dark green)

Take all the weighted values (yellow) and sum them element-wise:

  [0.0, 0.0, 0.0]
+ [1.0, 4.0, 0.0]
+ [1.0, 3.0, 1.5]
-----------------
= [2.0, 7.0, 1.5]

The resulting vector [2.0, 7.0, 1.5] (dark green) is Output 1, which is based on the query representation from Input 1 interacting with all other keys, including itself.

Step 8: Repeat for Input 2 & Input 3

Now that we’re done with Output 1, we repeat Steps 4 to 7 for Output 2 and Output 3. I trust that I can leave you to work out the operations yourself 👍🏼.

Fig. 1.8: Repeat previous steps for Input 2 & Input 3

Notes
The dimension of query and key must always be the same because of the dot product score function. However, the dimension of value may be different from query and key. The resulting output will consequently follow the dimension of value.

2. Code

Here is the code in PyTorch 🤗, a popular deep learning framework in Python. To enjoy the APIs for @ operator, .T and None indexing in the following code snippets, make sure you’re on Python≥3.6 and PyTorch 1.3.1. Just follow along and copy-paste these in a Python/IPython REPL or Jupyter Notebook.

Step 1: Prepare inputs

Step 2: Initialise weights

Step 3: Derive key, query and value

Step 4: Calculate attention scores

Step 5: Calculate softmax

Step 6: Multiply scores with values

Step 7: Sum weighted values

Note
PyTorch has provided an API for this called nn.MultiheadAttention. However, this API requires that you feed in key, query and value PyTorch tensors. Moreover, the outputs of this module undergo a linear transformation.

3. Extending to Transformers

So, where do we go from here? Transformers! Indeed we live in exciting times of deep learning research and high compute resources. The transformer is the incarnation from Attention Is All You Need, originally born to perform neural machine translation. Researchers picked up from here, reassembling, cutting, adding and extending the parts, and extending its usage to more language tasks.

Here I will briefly mention how we can extend self-attention to a Transformer architecture.

Within the self-attention module:

Dimension
Bias

Inputs to the self-attention module:

Embedding module
Positional encoding
Truncating
Masking

Adding more self-attention modules:

Multihead
Layer stacking

Modules between self-attention modules:

Linear transformations
LayerNorm

That’s all folks! Hope you find the content easy to digest. Is there something that you think I should add or elaborate on further in this article? Do drop a comment! Also, do check out an illustration I created for attention below

References

Attention Is All You Need (arxiv.org)

The Illustrated Transformer (jalammar.github.io)

Attn: Illustrated Attention (towardsdatascience.com)

If you like my content and haven’t already subscribed to Medium, subscribe via my referral link here! NOTE: A portion of your membership fees will be apportioned to me as referral fees.

Special thanks to Xin Jie, Serene, Ren Jie, Kevin and Wei Yih for ideas, suggestions and corrections to this article.

Follow me on Twitter @remykarem for digested articles and other tweets on AI, ML, Deep Learning and Python.

Raimi Karim

🇸🇬 Software Engineer at GovTech • MComp AI at NUS

Sentiment Analysis using Partial Textual Entailment

The internet is an ocean of raw data in the form of social text messages, tweets, blogs etc. - arbitrary and unconnected (at the first glance).

Only after processing, organizing and structuring of online text snippets, the data becomes information.

From the Literature survey, we observed that the present sentiment analysis tools are slow, bulky and computationally heavy which makes the task at hand inefficient.

Therefore, to overcome the aforementioned problem of analyzing sentiments efficiently, a new method is proposed in the present paper to drive the task of Sentiment Analysis by exploiting the idea of Partial Textual Entailment.

We propose to use Partial Textual Entailment for measuring semantic similarity between the tweets so as to group similar tweets together.

The method is anticipated to reduce the burden of sentiment analyzer and makes the processing faster.

Moreover, we also propose a modification in an existing method of Partial Textual Entailment which can be further adopted for many Natural Language Processing applications.

https://ieeexplore.ieee.org/document/8862241

conf

What is One Hot Encoding?

Machine learning algorithms cannot work with categorical data directly.

Categorical data must be converted to numbers.

In this tutorial, you will discover how to convert your input or output sequence data to a one hot encoding for use in sequence classification problems with deep learning in Python.

After completing this tutorial, you will know:

(1) What an integer encoding and one hot encoding are and why they are necessary in machine learning.

(2) How to calculate an integer encoding and one hot encoding by hand in Python.

(3) How to use the scikit-learn and Keras libraries to automatically encode your sequence data in Python.

https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/

Twitter and Research: A Systematic Literature Review Through Text Mining

Natural Language Processing Tutorial

Sentiment Analysis

Why are neural networks always described as so complicated you can't even use it?

Meaning of self attention mechanism

Illustrated: Self-Attention

A step-by-step guide to self-attention with illustrations and code

Content

0. What is self-attention?

1. Illustrations

2. Code

3. Extending to Transformers

References

Related Articles

Raimi Karim

Sentiment Analysis using Partial Textual Entailment

What is One Hot Encoding?