Let’s pretrain a 3B LLM from scratch: on 16+ H100 GPUs, no detail skipped.

We learn to pretrain a 3B parameter LLM across multiple H100 machines from scratch skipping no details. Learn to handle OOM errors, how to develop on cheap GPUs before scaling to multi-GPU. Finally, we end with running multinode with FSDP and explain how to take the model beyond 3B params. This is a full lecture with no edits or details skipped. At the end of this lecture you will improve your set of skills and intuition needed for pretraining and scaling LLMs beyond a simple demo. We start tuning and developing on cheap A10G GPUs. Then we run on 8 H100 GPUs and finally scale it to 2 machines, for a total of 16 H100 GPUs. This workflow saves a ton in cloud costs. I start at 1B parameters and scale it to 3B. To go beyond 3B, simply use the same process but with more machines. 01:40 Run the Llama template. 02:19 Llama template overview 05:00 Run the template on 1 GPU (A10G) 06:20 Monitor GPU memory usage 06:40 Code walkthrough 10:30 How to handle OOM (out of memory) e
Back to Top