HKUST-COMP4551-2026spring

Large-Scale Machine Learning for Foundation Models

Teaching Assistant: Xu Xu and Jiayi Cheng

Overview

In recent years, foundation models have fundamentally revolutionized the state-of-the-art of artificial intelligence. Thus, the computation in the training or inference of the foundation model could be one of the most important workflows running on top of modern computer systems. This course unravels the secrets of the efficient deployment of such workflows from the system perspective. Specifically, we will i) explain how a modern machine learning system (i.e., PyTorch) works; ii) understand the performance bottleneck of machine learning computation over modern hardware (e.g., Nvidia GPUs); iii) discuss four main parallel strategies in foundation model training (data-, pipeline-, tensor model-, optimizer- parallelism, etc.); iv) real-world deployment of foundation model including efficient inference and fine-tuning.

Syllabus

Date	Topic
W1 - 02/03,02/05	- Introduction and Logistics [Slides] - ML Preliminary [Slides]
W2 - 02/10,02/12	- Stochastic Gradient Descent [Slides] - Automatic Differentiation [Slides]
W3 - 02/17,02/19	- Spring Festival
W4 - 02/24,02/26	- Language Model Architecture [Slides] - Large Scale Pretrain Overview [Slides]
W5 - 03/03,03/05	- Nvidia GPU Performance [Slides] - Collective Communication Library [Slides]
W6 - 03/10,03/12	- Data-, Pipeline- Parallel Training [Slides] - Tensor Model-, Optimizer- Parallel Training [Slides]
W7 - 03/17,03/19	- Sequence-, MoE- parallelism [Slides] - Mid-Term Review [Slides]
W8 – 03/24,03/26	- Mid-Term Exam ✔️ - Generative Inference [Slides]
W9 - 03/31,04/02	- Inference Alogirhtm Optimizations [Slides] - Inference System Optimizations [Slides]
W10 - 04/07,04/09	- Spring Break - Prompt Engineering [Slides]
W11 - 04/14,04/16	- Inference Time Scaling [Slides] - Retrieval Augmented Generation [Slides]
W12 - 04/21,04/23	- LLM Agent - Parameter Efficient Fine-Tuning
W13 - 04/28, 04/30	- RL Alignment - LLM Evaluation
W14 - 05/05,05/07	- Guest Speech - Final Review

Grading Policy

4 Homework (4 $\times$ 5% $=$ 20%);
Mid-term exam (30%);
Final exam (50%).

Homework

Topic	Release	Due
Homework1	2026/02/22	2026/03/04
Homework2	2026/03/07	2026/03/18
Homework3	2026/04/11	2026/04/22
Homework4	2026/04/24	2026/05/08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HKUST-COMP4551-2026spring

Large-Scale Machine Learning for Foundation Models

Overview

Syllabus

Grading Policy

Homework

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
Lecture 1 - Introduction and Logistics.pdf		Lecture 1 - Introduction and Logistics.pdf
Lecture 10 - Tensor Model and Optimizer Parallel Training.pdf		Lecture 10 - Tensor Model and Optimizer Parallel Training.pdf
Lecture 11 - MoE and Sequence Parallelism.pdf		Lecture 11 - MoE and Sequence Parallelism.pdf
Lecture 12 - Midterm Review.pdf		Lecture 12 - Midterm Review.pdf
Lecture 13 - Generative Inference Overview.pdf		Lecture 13 - Generative Inference Overview.pdf
Lecture 14 - Generative Inference Algorithm Optimization.pdf		Lecture 14 - Generative Inference Algorithm Optimization.pdf
Lecture 15 - Generative Inference System Optimization.pdf		Lecture 15 - Generative Inference System Optimization.pdf
Lecture 16 - Prompt Engineering.pdf		Lecture 16 - Prompt Engineering.pdf
Lecture 17 - Inference Time Scaling.pdf		Lecture 17 - Inference Time Scaling.pdf
Lecture 18 - Retrieval Augmented Generation.pdf		Lecture 18 - Retrieval Augmented Generation.pdf
Lecture 2 - Machine Learning Preliminary.pdf		Lecture 2 - Machine Learning Preliminary.pdf
Lecture 3 - Stochastic Gradient Descent.pdf		Lecture 3 - Stochastic Gradient Descent.pdf
Lecture 4 - Automatic Differentiation.pdf		Lecture 4 - Automatic Differentiation.pdf
Lecture 5 - Language Model Architecture.pdf		Lecture 5 - Language Model Architecture.pdf
Lecture 6 - LLM Pretraining.pdf		Lecture 6 - LLM Pretraining.pdf
Lecture 7 - Nvidia GPU Performance.pdf		Lecture 7 - Nvidia GPU Performance.pdf
Lecture 8 - Nvidia Collective Communication Library.pdf		Lecture 8 - Nvidia Collective Communication Library.pdf
Lecture 9 - Data and Pipeline Parallel Training.pdf		Lecture 9 - Data and Pipeline Parallel Training.pdf
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

HKUST-COMP4551-2026spring

Large-Scale Machine Learning for Foundation Models

Overview

Syllabus

Grading Policy

Homework

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages