Outer Web | Explore

Explore >> Select a destination

You are here		blog.eleuther.ai Rotary Embeddings: A Relative Revolution \| EleutherAI Blog
\|	\|	harvardnlp.github.io The Annotated Transformer	3.1 parsecs away Travel
\|	\|	[AI summary] The provided code is a comprehensive implementation of the Transformer model, including data loading, model architecture, training, and visualization. It also includes functions for decoding and visualizing attention mechanisms across different layers of the model. The code is structured to support both training and inference, with examples provided for running the model and visualizing attention patterns.	3.1 parsecs away Travel
\|	\|	peterbloem.nl Transformers from scratch \| peterbloem.nl	4.7 parsecs away Travel
\|	\|	[AI summary] The text provides an in-depth overview of the Transformer architecture, its evolution, and its applications. It begins by introducing the Transformer as a foundational model for sequence modeling, highlighting its ability to handle long-range dependencies through self-attention mechanisms. The text then explores various extensions and improvements, such as the introduction of positional encodings, the development of models like Transformer-XL and Sparse Transformers to address the quadratic complexity of attention, and the use of techniques like gradient checkpointing and half-precision training to scale up model size. It also discusses the generality of the Transformer, its potential in multi-modal learning, and its future implications across d...	4.7 parsecs away Travel
\|	\|	teddykoker.com Performers: The Kernel Trick, Random Fourier Features, and Attention \| Teddy Koker	4.0 parsecs away Travel
\|	\|	Google AI recently released a paper, Rethinking Attention with Performers (Choromanski et al., 2020), which introduces Performer, a Transformer architecture which estimates the full-rank-attention mechanism using orthogonal random features to approximate the softmax kernel with linear space and time complexity. In this post we will investigate how this works, and how it is useful for the machine learning community.	4.0 parsecs away Travel
\|	\|	bdtechtalks.com Machine learning: What is the transformer architecture? - TechTalks	17.0 parsecs away Travel
\|		The transformer model has become one of the main highlights of advances in deep learning and deep neural networks.	17.0 parsecs away Travel