Key Takeaways from Jensen Huang’s Nvidia GTC Keynote

(T) Just finished to attend today Jensen Huang‘s keynote at Nvidia GTC conference at the SAP Center in San Jose. When making the announcements of Nvidia new GPU Blackwell to the world, Mr. Huang was quite emotional on the stage. Some of his announcements are technically quite impressive but even more are the breadth and depth of Nvidia’s ecosystems and partnerships. It seems that the technology world and the leading technology companies are all rotating around Nvidia.

Not only Nvidia is selling its GPU infrastructure for model training and inference to large cloud providers (AWS, Microsoft, Google…) but also directly to enterprises.

An enterprise can access pre-trained models from Nvidia through Nvidia NIM (Nvidia Inference Microservices). NIM is a containerized inference microservice that provides optimized inference engines for many open source models, including domain-specific codes, and can be accessed through Nvidia enterprise APIs.

But that is not all…

An enterprise can train and deploy its own customized generative AI models with Nvidia NeMo. NeMo provides APIs for different stages of the training pipeline: curator (datasets), customizer (fine-tuning and alignment), evaluator (benchmarks), retriever (RAG), and guardails (dialog management). NeMo leverages NIM for inference.

Here are my key takeways from the keynote which was mostly focused on Blackwell and Nvidia new system and application software. Many parts of the keynote had beautiful state-of-the art computer generated videos.

On a side note, the keynote was a two hour performance given only by Mr. Huang, who is sixty one years old, without anyone else. And probably without any break after the keynote, Mr. Huang was explaining how accelerated computing works to Jim Cramer on CNBC. Impressive!

Mr. Huang’s Key Premise:

  • GPT-4 ~ Assumed 1.8 trillion parameters model trained on several trillion tokens
    • (Note that OpenAI has never publicly disclosed the number of parameters of GPT-4, and the only official document about GPT-4 is its technical report)
  • Computing scale required:
    • 30 to 50 billion quadrillions floating point operations per second (you need to multiply the number of parameters by the number of tokens)
    • A quadrillion ~ a petaflop
    • A petaflop GPU will need 30 billion seconds to train that model ~ that is 1,000 years!
    • 1,000 years is it worth it 🙂 ?
    • To do it next week => more powerful GPUs connected together…
    • So here is how to migrate from 4 petaflops with the H100 to 20 petaflops with the…

New GB200 Grace Blackwell Superchip (named after UC Berkely Mathematician David Blackwell):

The Nvidia GB200 Grace Blackwell Superchip connects two Nvidia B200 Blackwell Tensor Core GPUs to the Nvidia Grace CPU over a 900GB/s ultra-low-power NVLink chip-to-chip interconnect.

Six key features:

  • World’s Most Powerful Chip â€” Packed with 208 billion transistors, Blackwell-architecture GPUs are manufactured using a custom-built 4NP TSMC process with two-reticle limit GPU dies connected by 10 TB/second chip-to-chip link into a single, unified GPU.
  • Second-Generation Transformer Engine â€” Fueled by new micro-tensor scaling support and Nvidia’s advanced dynamic range management algorithms integrated into Nvidia TensorRT™-LLM and NeMo Megatron frameworks, Blackwell will support double the compute and model sizes with new 4-bit floating point AI inference capabilities.
  • Fifth-Generation NVLink â€” To accelerate performance for multitrillion-parameter and mixture-of-experts AI models, the latest iteration of Nvidia NVLink® delivers groundbreaking 1.8TB/s bidirectional throughput per GPU, ensuring seamless high-speed communication among up to 576 GPUs for the most complex LLMs.
  • RAS Engine â€” Blackwell-powered GPUs include a dedicated engine for reliability, availability and serviceability. Additionally, the Blackwell architecture adds capabilities at the chip level to utilize AI-based preventative maintenance to run diagnostics and forecast reliability issues. This maximizes system uptime and improves resiliency for massive-scale AI deployments to run uninterrupted for weeks or even months at a time and to reduce operating costs.
  • Secure AI â€” Advanced confidential computing capabilities protect AI models and customer data without compromising performance, with support for new native interface encryption protocols, which are critical for privacy-sensitive industries like healthcare and financial services.
  • Decompression Engine â€” A dedicated decompression engine supports the latest formats, accelerating database queries to deliver the highest performance in data analytics and data science. In the coming years, data processing, on which companies spend tens of billions of dollars annually, will be increasingly GPU-accelerated.”

Nvidia invested $10 billion into the design the GB200 Superchip that will be sold between $30,000 and $40,000 per unit (see from CNBC, Nvidia CEO on the next generation of semiconductors and computing and Nvidia CEO Jensen Huang goes one-on-one with Jim Cramer).

New Grace Blackwell-Powered DGX SuperPOD for AI data centers:

  • The Grace Blackwell-powered DGX SuperPOD features eight or more DGX GB200 systems and can scale to tens of thousands of GB200 Superchips connected via Nvidia Quantum InfiniBand
  • Each DGX GB200 system features 36 Nvidia GB200 Superchips connected as one supercomputer via fifth-generation Nvidia NVLink and provides 11.5 exaflops at FP4 precision and 240 terabytes of fast memory
  • The DGX SuperPOD features intelligent predictive-management capabilities to continuously monitor thousands of data points across hardware and software to predict and intercept sources of downtime and inefficiency 

New Data Center Switches:

New Applications Based on the GB200 and the DGX Cloud:

The Full Keynote on Youtube:

References:

Note: The picture above is Nvidia’s new GB200 GPU.

Copyright © 2005-2024 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com.