.

1. Challenges with RNNs and how Transformer models can help overcome those challenges

1.1 RNN problem 1 — Suffers issues with long-range dependencies. RNNs do not work well with long text documents.

Transformer Solution —Transformer networks almost exclusively use attention blocks. Attention helps to draw connections between any parts of the sequence, so long-range dependencies are not a problem anymore. With transformers, long-range dependencies have the same likelihood of being taken into account as any other short-range dependencies.

1.2. RNN problem 2 — Suffers from gradient vanishing and gradient explosion.

Transformer Solution — There is little to no gradient vanishing or explosion problem. In Transformer networks, the entire sequence is trained simultaneously, and to build on that only a few more layers are added. So gradient vanishing or explosion is rarely an issue.

1.3. RNN problem 3 — RNNs need larger training steps to reach a local/global minima. RNNs can be visualized as an unrolled network that is very deep. The size of the network depends on the length of the sequence. This gives rise to many parameters, and most of these parameters are interlinked with one another. As a result, the optimization requires a longer time to train and a lot of steps.

Transformer Solution — Requires fewer steps to train than an RNN.

1.4. RNN problem 4 — RNNs do not allow parallel computation. GPUs help to achieve parallel computation. But RNNs work as sequence models, that is, all the computation in the network occurs sequentially and can not be parallelized.

Transformer Solution — No recurrence in the transformer networks allows parallel computation. So computation can be done in parallel for every step.

.

.

https://www.kdnuggets.com/2019/08/deep-learning-transformers-attention-mechanism.html