Outer Web | Explore

Explore >> Select a destination

You are here		blog.evjang.com Eric Jang: Meta-Learning in 50 Lines of JAX
\|	\|	teddykoker.com Learning to Learn with JAX \| Teddy Koker	7.7 parsecs away Travel
\|	\|	Gradient-descent-based optimizers have long been used as the optimization algorithm of choice for deep learning models. Over the years, various modifications to the basic mini-batch gradient descent have been proposed, such as adding momentum or Nesterovs Accelerated Gradient (Sutskever et al., 2013), as well as the popular Adam optimizer (Kingma & Ba, 2014). The paper Learning to Learn by Gradient Descent by Gradient Descent (Andrychowicz et al., 2016) demonstrates how the optimizer itself can be replac...	7.7 parsecs away Travel
\|	\|	questionableengineering.com Numpy LeNet 5 with ADAM \| Questionable Engineering	11.3 parsecs away Travel
\|	\|	John W Grun AbstractIn this paper, a manually implemented LeNet-5 convolutional NN with an Adam optimizer written in Numpy will be presented. This paper will also cover a description of the data use	11.3 parsecs away Travel
\|	\|	iclr-blogposts.github.io How to compute Hessian-vector products? \| ICLR Blogposts 2024	9.9 parsecs away Travel
\|	\|	The product between the Hessian of a function and a vector, the Hessian-vector product (HVP), is a fundamental quantity to study the variation of a function. It is ubiquitous in traditional optimization and machine learning. However, the computation of HVPs is often considered prohibitive in the context of deep learning, driving practitioners to use proxy quantities to evaluate the loss geometry. Standard automatic differentiation theory predicts that the computational complexity of an HVP is of the same order of magnitude as the complexity of computing a gradient. The goal of this blog post is to provide a practical counterpart to this theoretical result, showing that modern automatic differentiation frameworks, JAX and PyTorch, allow for efficient computation of these HVPs in standard deep learning cost functions.	9.9 parsecs away Travel
\|	\|	windowsontheory.org Yet another backpropagation tutorial - Windows On Theory	47.9 parsecs away Travel
\|		(Updated and expanded 12/17/2021) I am teaching deep learning this week in Harvard's CS 182 (Artificial Intelligence) course. As I'm preparing the back-propagation lecture, Preetum Nakkiran told me about Andrej Karpathy's awesome micrograd package which implements automatic differentiation for scalar variables in very few lines of code. I couldn't resist using this to show how...	47.9 parsecs away Travel