Outer Web | Explore

Explore >> Select a destination

You are here		nlp.seas.harvard.edu The Annotated Transformer
\|	\|	teddykoker.com NLP from Scratch: Annotated Attention \| Teddy Koker	1.8 parsecs away Travel
\|	\|	This post is the first in a series of articles about natural language processing (NLP), a subfield of machine learning concerning the interaction between computers and human language. This article will be focused on attention, a mechanism that forms the backbone of many state-of-the art language models, including Googles BERT (Devlin et al., 2018), and OpenAIs GPT-2 (Radford et al., 2019).	1.8 parsecs away Travel
\|	\|	jalammar.github.io The Illustrated Transformer - Jay Alammar - Visualizing machine learning one concept at a time.	2.7 parsecs away Travel
\|	\|	Discussions: Hacker News (65 points, 4 comments), Reddit r/MachineLearning (29 points, 3 comments) Translations: Arabic, Chinese (Simplified) 1, Chinese (Simplified) 2, French 1, French 2, Italian, Japanese, Korean, Persian, Russian, Spanish 1, Spanish 2, Vietnamese Watch: MIT's Deep Learning State of the Art lecture referencing this post Featured in courses at Stanford, Harvard, MIT, Princeton, CMU and others Update: This post has now become a book! Check out LLM-book.com which contains (Chapter 3) an updated and expanded version of this post speaking about the latest Transformer models and how they've evolved in the seven years since the original Transformer (like Multi-Query Attention and RoPE Positional embeddings). In the previous post, we looked at Att...	2.7 parsecs away Travel
\|	\|	comsci.blog Transformers Unfolded: A Layered Approach to Implementation \| ML and robotics notes	2.6 parsecs away Travel
\|	\|	In this tutorial, we will implement transformers step-by-step and understand their implementation. There are other great tutorials on the implementation of transformers, but they usually dive into the complex parts too early, like they directly start implementing additional parts like masks and multi-head attention, but it is not very intuitional without first building the core part of the transformers.	2.6 parsecs away Travel
\|	\|	saturncloud.io Speeding up Neural Network Training With Multiple GPUs and Dask \| Saturn Cloud Blog	13.3 parsecs away Travel
\|		By combining Dask and PyTorch you can easily speed up training a model across a cluster of GPUs. But how much of a benefit does that bring? This blog post finds out!	13.3 parsecs away Travel