Some of the most powerful NLP models like BERT and GPT-2 have one particular component in common. They all use the transformer architecture. But deep learning transformers are a very powerful tool that can performs well across several domains. The machine learning community widely deploy transformers already. Moreover, the common component to all transformer architectures is the self-attention mechanism.
The most powerful aspect of self-attention is that it is completely unsupervised. In fact, one does not require labelled data to apply self-attention. But, as in any unsupervised method, one just need the input data.
Moreover, the self-attention mechanism is based on a very simple linear algebra operation that modern CPUs (and GPUs) can perform very quickly. Such operation is the dot product between any two matrices.
deep learning transformers
In this episode I explain some technical aspects about these methods. Moreover, I elucidate the reasons behind their effectiveness. Why do they actually work so well?
With a very simple example that comes from the field of recommender systems, I explain the nuts and bolts of both self-attention and the transformer architecture built on top. Needless to say, such approach can be used to many domains out there. From images, to sound, text and of course numerical samples.
Don’t forget to subscribe to our Newsletter from which you will receive our updates straight to your inbox (spam is not included!).
Wouldn’t it be great to discuss about previous episodes? How about proposing new ones. We are sure there is that one topic you would like to know more about. Come join the discussion on our Discord server and chat with us.
deep learning transformers
References
- Attention is all you need https://arxiv.org/abs/1706.03762
- The illustrated transformer https://jalammar.github.io/illustrated-transformer
- Self-attention for generative models http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture14-transformers.pdf