I’m a third-year PhD student in Computer Science at NYU, where I’m advised by Joan Bruna and Chris Musco. My degree is supported by an NSF Graduate Research Fellowship. I’m affiliated with the Algorithms and Foundations, MaD, and CILVR groups. I also help organize the ML-NYC speaker series at the Flatiron Institute.
I study algorithms mathematically. The main themes in my research are linear algebra, approximation theory, and deep learning. I’m broadly interested in scientific computing and optimization too.
Previously, I was a research engineer at Reservoir Labs (now Qualcomm) working on modeling congestion control in communication networks. Before that, I was an undergrad at Yale, where I worked with Bob Frank as part of the CLAY lab and was advised by Dan Spielman. I’ve spent summers at Adobe Research, Polymathic, and at the Weizmann Institute, the latter working on phylogenetic inference with Boaz Nadler and Ariel Jaffe.
Computing the polar decomposition and the related matrix sign function has been a well-studied problem in numerical analysis for decades. Recently, it has emerged as an important subroutine within the Muon algorithm for training deep neural networks. However, the requirements of this application differ sharply from classical settings: deep learning demands GPU-friendly algorithms that prioritize high throughput over high precision. We introduce Polar Express, a new method for computing the polar decomposition. Like Newton-Schulz and other classical polynomial methods, our approach uses only matrix-matrix multiplications, making it very efficient on GPUs. Inspired by earlier work of Chen & Chow and Nakatsukasa & Freund, Polar Express adapts the update rule at each iteration by solving a minimax optimization problem. We prove that this strategy minimizes error in a worst-case sense, allowing Polar Express to converge as rapidly as possible both in the early iterations and asymptotically. We also address finite-precision issues, making it practical to use in bfloat16. When integrated into the Muon training framework, our method leads to consistent improvements in validation loss when training a GPT-2 model on one billion tokens from the FineWeb dataset, outperforming recent alternatives across a range of learning rates.
ICLR
Quality over Quantity in Attention Layers: When Adding More Heads Hurts
Attention-based mechanisms are widely used in machine learning, most prominently in transformers. However, hyperparameters such as the rank of the attention matrices and the number of heads are scaled nearly the same way in all realizations of this architecture, without theoretical justification. In this work we show that there are dramatic trade-offs between the rank and number of heads of the attention mechanism. Specifically, we present a simple and natural target function that can be represented using a single full-rank attention head for any context length, but that cannot be approximated by low-rank attention unless the number of heads is exponential in the embedding dimension, even for short context lengths. Moreover, we prove that, for short context lengths, adding depth allows the target to be approximated by low-rank attention. For long contexts, we conjecture that full-rank attention is necessary. Finally, we present experiments with off-the-shelf transformers that validate our theoretical findings.
NeurIPS
Nearly Optimal Approximation of Matrix Functions by the Lanczos Method
We study the Lanczos method for approximating the action of a symmetric matrix function $f(\mathbf A)$ on a vector $\mathbf b$ (Lanczos-FA). For the function $\mathbf A^{-1}$, it is known that the error of Lanczos-FA after $k$ iterations matches the error of the best approximation from the Krylov subspace of degree $k$ when $\mathbf A$ is positive definite. We prove that the same holds, up to a multiplicative approximation factor, when $f$ is a rational function with no poles in the interval containing $\mathbf A$'s eigenvalues. The approximation factor depends the degree of $f$'s denominator and the condition number of $\mathbf A$, but not on the number of iterations $k$. Experiments confirm that our bound accurately predicts the convergence of Lanczos-FA. Moreover, we believe that our result provides strong theoretical justification for the excellent practical performance that has long by observed of the Lanczos method, both for approximating rational functions and functions like $\mathbf A^{1/2} \mathbf b$ that are well approximated by rationals.
Miscellaneous
I have many interesting interests. I am relatable, and I like to have fun.
Contact
Please send me an email if you want to get in touch. If you're in my department, use Slack or drop by my desk at 60 5th Avenue, Room 442.