GPU Optimization Worshop

(T) I attended last month, an excellent GPU optimization workshop organized by Chip Huyen. Chip is well-known and very active in the Silicon Valley machine learning system design community. She taught a class Machine Learning Systems Design at Stanford, and is the author of two books, Designing Machine Learning Systems, and AI Engineering.

Here are the presentations, reading materials, and the recording of the workshop:

Presentations

Crash course on GPU optimization (Mark Saroufim @ Meta)

Mark is a PyTorch core developer and cofounder of CUDA MODE. He also ran the really fun NeurIPS LLM Efficiency challenge last year. Previously, he was at Graphcore and Microsoft.

Mark will give an overview of why GPUs, the metrics that matter, and different GPU programming models (thread-based CUDA and block-based Triton). He promises this will be a painless guide to writing CUDA/Triton kernels! This talk will give us the basics to understand the rest of the workshop.

High-performance LLM serving on GPUs (Sharan Chetlur @ NVIDIA)

Sharan is a principal engineer working on TensorRT-LLM at NVIDIA. He’s been working on CUDA since 2012, optimizing the performance of deep learning models from a single GPU to a full data center scale. Previously, he was the Director of Engineering at Cerebras.

Sharan will discuss how to build performant, flexible solutions to optimize LLM serving given the rapid evolution of new models and techniques. The talk will cover optimization techniques such as token concatenation, different strategies for batching, and cache.

Block-based GPU Programming with Triton (Philippe Tillet @ OpenAI)

Philippe is currently leading the Triton team at OpenAI. Previously, he was at pretty much all major chip makers including NVIDIA, AMD, Intel, and Nervana.

Philippe will explain how Triton works and how its block-based programming model differs from the traditional single instruction, multiple threads (SIMT) programming model that CUDA follows. Triton aims to be higher-level than CUDA while being more expressive (lower-level) than common graph compilers like XLA and Torch-Inductor.

Scaling data processing from CPU to distributed GPUs (William Malpica @ Voltron Data)

William is a co-founder of Voltron Data and the creator of BlazingSQL. He helped scale Theseus, a GPU-native query engine, to handle 100TB queries!

Most people today use GPUs for training and inference. A category of workloads that GPUs excel at but are underutilized for is data processing. In this talk, William will discuss why large-scale data processing should be done on GPUs instead of CPUs and how different tools like cuDF, RAPIDS, and Theseus leverage GPUs for data processing.

Reading materials

Tools that were discussed in the workshop:

  1. Development repository for the Triton language and compiler
    1. Introducing Triton: Open-source GPU programming for neural networks
    2. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations
  2. TensorRT and TensorRT-LLM
  3. Check out Mark’s lecture on profiling CUDA in PyTorch –  Model Inference Optimization ChecklistAccelerating Generative AI with PyTorch: Segment Anything, Fast
  4. rapidsai/cudf – GPU DataFrame Library
  5. Benchmarking Report: Theseus Engine | Voltron Data

Recommended resources:

  1. How CUDA Programming Works – Stephen Jones, NVIDIA (great lecture)
  2. The Best GPUs for Deep Learning in 2023 — An In-depth Analysis (Tim Dettmers)
  3. CUDA MODE Discord. They have a great lecture series on GPU optimization.”

Recording

Note: The picture above is a flower arrangement at Town & Country Village in Palo Alto.

Copyright © 2005-2024 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com.

Categories: Uncategorized