By Serge-Paul Carrasco on March 9, 2024

Approaches and Silicon for Parallel Training and Inference of LLMs

(T) Even with all the annoying and disturbing hype in AI, I feel that Silicon Valley venture capitalists have not invested in the silicon required to build large scale machine learning data pipelines. Or may be I am wrong, and it is not a lack of VCs interests and money, but probably a lack of entrepreneurs, who deeply masters not only the silicon but also the full stack from modeling to ML libraries to data pipelines, and in particular self supervised and autoregressive models which have changed so much the landscape of model training and inference over the last two years.

On the training side, the only start-up that I have known from a long time is Graphcore that I discovered at the third Scaled Machine Learning conference @ Stanford University in 2018 where Graphcore CTO Simon Knowles gave a presentation “Scaling Throughput Processors for Machine Intelligence“. And, in 2020 also at the fifth Scaled Machine Learning conference, I attended a presentation from Sean Lie about Cerebras Wafer Scale Clusters.

On the inference side, there have been recently two emerging start-ups, Groq and Modular. Modular is focusing on developing a model inference runtime and API library – Max Engine – that executes any types of models on any CPUs or GPUs while Groq is focusing on developing a chip – a language processing unit (LPU) to deploy LLM models in inference. Groq will compete with GPUs and CPUs vendors for inference pipeline while Modular will not.

In parallel, two Silicon Valley start-ups SambaNova and Together, loosely rooted into Stanford University, have developed a combination of hardware and software offerings for enterprise developments of generative AI. SambaNova has developed its own chip – the SN40L – that is used in its platform both for training and inference. That platform – the Datascale SN30 can be deployed on premises or be used as a cloud service. Together re-uses Nvidia H100 and A100 GPUs for model training and fine-tuning. Both companies offer the most popular open sourced LLMs.

Samba Nova and Together are part of a broader group of start-ups that are offering various stacks and APIs such as Anyscale, Mosaic/Databrick, and Perplexity.

In order to get ready for the sessions of the upcoming Nvidia GTC conference, following are a few notes on the latest chips for machine learning data pipelines, and the present state-of-the art techniques for parallel training and inference for large language models (LLMs).

Parallel training – key approaches:

Tensor parallelism: parallelize computation across multiple workers/GPUs within an operation such as matrix-matrix multiplication.
Data parallelism: data is split across by multiple workers/GPUs, that are trained with the same model (e.g. same parameters) at the same time. If training is done for instance with synchronous stochastic gradient descent (SGD), back propagation gradients are aggregated, and parameters should be updated on each worker.
Model parallelism: distribute the training of the entire model over multiple workers/GPUs. Each worker is trained with a subset (e.g. a fraction of the parameters) of the entire model.
Pipeline parallelism: combine model parallelism with data parallelism. While tensor parallelism can be considered as intra-layer parallelism, pipeline parallelism can be considered as inter-layer parallelism.
Mixture-of-expert (MoEs): ensemble learning architecture where multiple “experts” can be used in parallel and a gating model decides which expert to use.
References:
- Lilian Weng: “How to train really large models on many GPUs?“
- Yannick’s video: “Sparse Expert Models (Switch Transformers, GLAM, and more… w/ the Authors“
- Albert Q. Jiang, et al: “Mixtral of Experts“
- Yi Tay: “Training great LLMs entirely from ground up in the wilderness as a startup“

Inference – key approaches:

Model optimization: model compression techniques
- Quantization: reduce the precision of the model’s weights and activations
- Sparsity: model pruning
- Distillation: train a smaller model (called a student) to mimic the behavior of a larger model (a teacher).
References:
- Lilian Weng: “Large Transformer Model Inference Optimization“

LLMs Inference – key approaches:

Model serving:
- In-flight batching: execute multiple different requests (e.g. generating the LLM outputs) at the same time.
- Speculative inference: execute multiple different steps of the sequence (required to generate the LLM outputs) in parallel to try to save time.
References:
- Nvidia: “Mastering LLM Techniques: Inference Optimization“
- Pierre Lienhart: “LLM Inference Series“

Latest Trends – smaller models and more efficient training and inference of larger models:

It goes without saying that Meta and Mistral have both launched highly performant smaller models that only requires 7 billion parameters while matching some of the capabilities of OpenAI’s GPT-3 of 175 billion parameters or Google’s PalM-2 of 540 billion parameters. A trend that Google is now embracing with its recent launch of its Gemma’s open source models.

But while the trend is toward smaller models for specific use cases, larger closed models are becoming more efficient to train and compute.

Google’s Gemini 1.5 Pro achieves comparable performance to Gemini 1.0 Ultra while using significantly less training compute, and being significantly more efficient to serve. It is trained like Gemini 1.0 on multiple 4096-chip pods of Google’s TPUv4 accelerators (Note that Google PaML was trained over the Pathways Systems with 6,144 TPUv4 over two TPU pods).

Gemini 1.5 with its multimodal mixture-of-experts model is more compute efficient both in training and in inference. And, its increase of the context windows to one million token (10 millions in research) has not resulted in more compute requirements (for more details, see Gemini 1.5 Pro technical report).

Use case – Meta’s GenAI infrastructure:

Meta’s engineering has recently published a blog article on its infrastructure for generative AI:

“Marking a major investment in Meta’s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training.
We are strongly committed to open compute and open source. We built these clusters on top of Grand Teton, OpenRack, and PyTorch and continue to push open innovation across the industry.
This announcement is one step in our ambitious infrastructure roadmap. By the end of 2024, we’re aiming to continue to grow our infrastructure build-out that will include 350,000 NVIDIA H100 GPUs as part of a portfolio that will feature compute power equivalent to nearly 600,000 H100s.”

Nvidia’s GPUs:

Nvidia H200 Tensor Core GPU based on Nvidia Hopper architecture – available in NVIDIA HGX H200 server boards with 4 and 8-way configurations – An 8 HGX H200 provides over 32 petaflops of FP8 deep learning compute and 1.1 TB of aggregate high-bandwidth memory
Nvidia EoS – data center scale supercomputer – built with 576 NVIDIA DGX H100 systems, NVIDIA Quantum-2 InfiniBand networking and software, providing a total of 18.4 exaflops of FP8 AI performance. This system is a sister to a separate Eos DGX SuperPOD with 10,752 NVIDIA H100 GPUs