RoboTech

Tools and Techniques for Text Mining and Visualization

This chapter covers 19 popular open-access text mining and visualization tools, including R, Topic-Modeling-Tool, RapidMiner, WEKA, Orange, Voyant Tools, Gephi, Tableau Public, Infogram, and Microsoft Power BI, among others, with their applications, pros, and cons. As there are many text mining and visualization tools available, we covered only those open-source tools that have a simple GUI so that information professionals who are new to these tools can learn to use and implement them in their daily work.

https://link.springer.com/chapter/10.1007/978-3-030-85085-2_10

References

Yang Q, Zhang X, Du X, Bielefield A, Liu Y (2016) Current market demand for core competencies of librarianship—a text mining study of American Library Association’s Advertisements from 2009 through 2014. Appl Sci 6(2):48. https://doi.org/10.3390/app6020048
CrossRef Google Scholar
Lee J, Lapira E, Bagheri B, Kao H (2013) Recent advances and trends in predictive manufacturing systems in big data environment. Manuf Lett 1(1):38–41. https://doi.org/10.1016/j.mfglet.2013.09.005
CrossRef Google Scholar
Noh Y (2015) Imagining Library 4.0: creating a model for future libraries. J Acad Librariansh 41(6):786–797. https://doi.org/10.1016/j.acalib.2015.08.020
CrossRef Google Scholar
Abinaya G, Winster SG (2014) Event identification in social media through latent dirichlet allocation and named entity recognition. In: Proceedings of IEEE international conference on computer communication and systems ICCCS14, pp 142–146. https://doi.org/10.1109/ICCCS.2014.7068182
Google Code Archive. Long-term storage for Google Code Project Hosting. https://code.google.com/archive/p/topic-modeling-tool/wikis/TopicModelingTool.wiki. Accessed 12 Aug 2020
Nguyen G, Dlugolinsky S, Bobák M, Tran V, López García Á, Heredia I, Malík P, Hluchý L (2019) Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif Intell Rev 52:77–124. https://doi.org/10.1007/s10462-018-09679-z
CrossRef Google Scholar
Sci2 Team (2009) Science of Science (Sci2) Tool. Indiana University and SciTech Strategies. https://sci2.cns.iu.edu
LancsBox Manual (2020) http://corpora.lancs.ac.uk/lancsbox/docs/pdf/LancsBox_5.1_manual.pdf. Accessed 2 Apr 2021
Gephi (2017) https://gephi.org/users/download/. Accessed 12 Aug 2020
Power BI Tutorial (2021) https://data-flair.training/blogs/power-bi-tutorial/. Accessed 2 Apr 2021
RAWGraphs (2021) https://rawgraphs.io/. Accessed 2 Apr 2021

Word embeddings tutorial

This tutorial contains an introduction to word embeddings. You will train your own word embeddings using a simple Keras model for a sentiment classification task, and then visualize them in the Embedding Projector

https://www.tensorflow.org/text/guide/word_embeddings

Recurrent Neural Network problems and how Transformers solve the problems

1. Challenges with RNNs and how Transformer models can help overcome those challenges

1.1 RNN problem 1 — Suffers issues with long-range dependencies. RNNs do not work well with long text documents.

Transformer Solution —Transformer networks almost exclusively use attention blocks. Attention helps to draw connections between any parts of the sequence, so long-range dependencies are not a problem anymore. With transformers, long-range dependencies have the same likelihood of being taken into account as any other short-range dependencies.

1.2. RNN problem 2 — Suffers from gradient vanishing and gradient explosion.

Transformer Solution — There is little to no gradient vanishing or explosion problem. In Transformer networks, the entire sequence is trained simultaneously, and to build on that only a few more layers are added. So gradient vanishing or explosion is rarely an issue.

1.3. RNN problem 3 — RNNs need larger training steps to reach a local/global minima. RNNs can be visualized as an unrolled network that is very deep. The size of the network depends on the length of the sequence. This gives rise to many parameters, and most of these parameters are interlinked with one another. As a result, the optimization requires a longer time to train and a lot of steps.

Transformer Solution — Requires fewer steps to train than an RNN.

1.4. RNN problem 4 — RNNs do not allow parallel computation. GPUs help to achieve parallel computation. But RNNs work as sequence models, that is, all the computation in the network occurs sequentially and can not be parallelized.

Transformer Solution — No recurrence in the transformer networks allows parallel computation. So computation can be done in parallel for every step.

https://towardsdatascience.com/all-you-need-to-know-about-attention-and-transformers-in-depth-understanding-part-1-552f0b41d021

https://www.kdnuggets.com/2019/08/deep-learning-transformers-attention-mechanism.html

Highlights

• We analyze a dataset of 309,229 WhatsApp instant messages (N = 226).

• We identify age- and gender-linked variations in emoji, emoticon, and language usage.

• We use machine learning algorithms to significantly predict age and gender.

• We identify the most predictive language features.

• We discuss implications for user privacy in instant messaging.

Abstract

Text is one of the most prevalent types of digital data that people create as they go about their lives.

Digital footprints of people's language usage in social media posts were found to allow for inferences of their age and gender.

However, the even more prevalent and potentially more sensitive text from instant messaging services has remained largely uninvestigated.

We analyze language variations in instant messages with regard to individual differences in age and gender by replicating and extending the methods used in prior research on social media posts.

Using a dataset of 309,229 WhatsApp messages from 226 volunteers, we identify unique age- and gender-linked language variations.

We use cross-validated machine learning algorithms to predict volunteers' age (MAEMd = 3.95, rMd = 0.81, R2Md = 0.49) and gender (AccuracyMd = 85.7%, F1Md = 0.67, AUCMd = .82) significantly above baseline-levels and identify the most predictive language features.

We discuss implications for psycholinguistic theory, present opportunities for application in author profiling, and suggest methodological approaches for making predictions from small text data sets.

Given the recent trend towards the dominant use of private messaging and increasingly weaker user data protection, we highlight rising threats to individual privacy rights in instant messaging.

Keywords

Age, Gender, Author profiling, Instant messages, Machine learning, Digital footprints

https://www.sciencedirect.com/science/article/pii/S0747563221003137

Highlights

• Sentence comprehension does not suffer when emojis replace words.

• Emojis within sentences are processed like pictures.

• In special circumstances, emojis can activate an entire lexical entry, including phonological information.

Abstract

In computer-mediated communication, emojis can be used for various purposes.

As small graphical images, many emojis depict abstract or concrete objects ideogrammatically.

We report on a self-paced reading experiment of sentences containing emojis.

We tested to what extent emojis encode lexical meanings when used in a sentence context.

First, we confirm earlier findings that sentence comprehension does not suffer when emojis replace words.

Second, we show that in addition to the graphically encoded concept, emojis in some cases enable the retrieval of an entire lexical entry, including the phonological value of the associated word.

This means that even emojis showing a homophonous noun to the target word, such as “palm (tree)” for “palm (of hand)” can be interpreted correctly in context.

Based on measured differences in the reading times between words, emojis depicting the intended target referent, and emojis depicting a homophonous noun, we propose a context dependent account of emoji interpretation.

Keywords

Emojis, Self-paced reading, Lexical ambiguity, Homonymy, Processing

https://www.sciencedirect.com/science/article/pii/S074756322100399X

Development and validation of the ‘Lebender emoticon PANAVA’ scale (LE-PANAVA) for digitally measuring positive and negative activation, and valence via emoticons

Highlights

• ‘Experience Sampling Method’ (ESM) requires short and validated non-verbal scales.

• The non-verbal ‘Lebender Emoticon PANAVA’ scale (LE-PANAVA) is presented.

• LE-PANAVA captures positive and negative activation (PA/NA), and valence (VA).

• LE-PANAVA is available for future ESM research and practical application.

Abstract

Positive and negative activation (PA/NA) represent two general activation systems of affect that are of importance for studying personality.

Hereby, many studies focus on state assessment of PA and NA in everyday situations, using the ‘Experience Sampling Method’ (ESM) performed via mobile devices.

ESM studies require short, reliable and validated non-verbal scales for immediate and fast capturing of personality and situation characteristics.

In this study we present the non-verbal ‘Lebender Emoticon PANAVA’ scale (LE-PANAVA), consisting of five items capturing PA, NA, and valence (VA).

LE-PANAVA is based on the 10-item verbal PANAVA-KS scale developed by Schallberger (2005).

The development of LE-PANAVA consisted of a three step process: The graphical development and selection of a set of emoticons (study 1), the validation of the set of emoticons and corresponding adjustments to the scale (study 2), and validation of the final scale (study 3).

We conclude from the results that LE-PANAVA captures the two factors PA and NA, but are aware that they are closely interrelated.

Additional to LE-PANAVA, an ultra-short version was derived, that is, a forced choice 2 × 2 matrix of emoticons – the ‘Lebender Emoticon PANA Matrix’ (LE-PANA-M). Both LE-PANAVA and LE-PANA-M are available for future research and practical application.

Keywords

Positive activation, Negative activation, Valence, PANAVA, Emoticon, Experience sampling method (ESM)

https://www.sciencedirect.com/science/article/pii/S0191886920301124

A comparison of five methodological variants of emoji questionnaires for measuring product elicited emotional associations: An application with seafood among Chinese consumers

Highlights

• Emoji were used to measure emotional associations to seafood product names.

• Emoji profiles for mussels, lobster, squid and abalone differed among Chinese consumers.

• Emoji product profiles did not largely vary with question wording.

• Higher emoji citation frequency were found with forced Yes/No and RATA than CATA questions.

• RATA improved discrimination among the product stimuli compared to CATA.

Abstract

Product insights beyond hedonic responses are increasingly sought and include emotional associations.

Various word-based questionnaires for direct measurement exist and an emoji variant was recently proposed.

Herein, emotion words are replaced with emoji conveying a range of emotions.

Further assessment of emoji questionnaires is needed to establish their relevance in food-related consumer research.

Methodological research contributes hereto and in the present research the effects of question wording and response format are considered.

Specifically, a web study was conducted with Chinese consumers (n = 750) using four seafood names as stimuli (mussels, lobster, squid and abalone).

Emotional associations were elicited using 33 facial emoji.

Explicit reference to “how would you feel?” in the question wording changed product emoji profiles minimally.

Consumers selected only a few emoji per stimulus when using CATA (check-all-that-apply) questions, and layout of the CATA question had only a small impact on responses.

A comparison of CATA questions with forced yes/no questions and RATA (rate-all-that-apply) questions revealed an increase in frequency of emoji use for yes/no questions, but not a corresponding improvement in sample discrimination.

For the stimuli in this research, which elicited similar emotional associations, RATA was probably the best methodological choice, with 8.5 emoji being used per stimulus, on average, and increased sample discrimination relative to CATA (12% vs. 6–8%).

The research provided additional support for the potential of emoji surveys as a method for measurement of emotional associations to foods and beverages and began contributing to development of guidelines for implementation.

Keywords

Emotion measurement, Emoticons, Research methods, Consumers, ChinaSeafood

https://www.sciencedirect.com/science/article/abs/pii/S0963996917301898

Tools and Techniques for Text Mining and Visualization

References

Word embeddings tutorial

Recurrent Neural Network problems and how Transformers solve the problems

Age and gender in language, emoji, and emoticon usage in instant messages

The processing of emoji-word substitutions: A self-paced-reading study

Development and validation of the ‘Lebender emoticon PANAVA’ scale (LE-PANAVA)

A comparison of five methodological variants of emoji questionnaires for measuring product elicited emotional associations