Get started
Start with Mathematical Foundations
Vectors & Geometry
Back to modulesModule 10 ยท 0/57
Module 10
Transformers & LLMs
0/5742528
Submodule 01Attention Mechanisms0/16
- 1Self-Attention MechanismMed.
- 2Multi-Head AttentionHard
- 3Masked Self-AttentionMed.
- 4Positional Encoding CalculatorHard
- 5Efficient Sparse Window AttentionMed.
- 6Grouped Query Attention (GQA)Hard
- 7Build Scaled Dot-Product AttentionMed.
- 8Build a Transformer Encoder LayerHard
- 9Gated AttentionHard
- 10Multiquery Attention (MQA)Hard
- 11Rotary Positional Embeddings (RoPE)Hard
- 12Sliding Window AttentionMed.
- 13Learned Positional EmbeddingsconceptEasy
- 14NoPE (No Positional Embedding) with iRoPE AttentionHard
- 15Attention Sink DetectionconceptMed.
- 16Inference Head Pruning for TransformersconceptEasy
Submodule 02LLM Inference & Memory Systems0/23
- 1Building a Virtual Memory System for KV CacheHard
- 2Attention Memory Traffic and FLOPsHard
- 3Estimate KV Cache Size from Model ConfigMed.
- 4Flash Attention v1 - Forward PassHard
- 5Implementing PagedAttention: Block-wise Attention ComputationHard
- 6KV Cache Memory Budget and Eviction PolicyHard
- 7KV Cache Tiered Offloading SimulatorHard
- 8KV Cache for Efficient Autoregressive AttentionMed.
- 9Continuous Batching (In-Flight Batching) SimulatorHard
- 10Continuous Batching vs Static Batching Throughput ComparisonMed.
- 11Chunked Prefill Scheduling Alongside DecodeHard
- 12Disaggregated Prefill-Decode Serving SimulatorHard
- 13TTFT ITL and TPS from a Token Timestamp StreamconceptEasy
- 14Prefix Cache Hit Rate CalculatorconceptMed.
- 15Preemption Strategies: Swapping vs RecomputationconceptMed.
- 16Request Batching for InferenceconceptMed.
- 17Draft-Target Speculative Decoding SimulationconceptMed.
- 18Speculative Decoding VerificationconceptHard
- 19Speculative Decoding Acceptance Rate vs TemperatureconceptMed.
- 20Speculative Decoding End-to-End SimulationconceptHard
- 21Beam Search with Memory-Efficient Block SharingconceptMed.
- 22N-gram Speculation Dictionary Construction and LookupconceptMed.
- 23EAGLE-Style Draft Model from Hidden StatesconceptMed.
Submodule 03MoE, Compression & Scaling0/18
- 1A Sparse Mixture of Experts LayerHard
- 2Computational Efficiency of MoEEasy
- 3The Noisy Top-K Gating FunctionMed.
- 4Mixture of Experts Load Balancing LossHard
- 5Expert Parallelism Token Routing and Communication CostHard
- 6Sigmoid MoE Router with Bias CorrectionHard
- 7INT8 QuantizationHard
- 8Post-Training Quantization with Per-Channel Scale FactorsHard
- 9Block-wise FP8 QuantizationHard
- 10FP4 Quantization with Microscaling (MXFP4)Hard
- 11Embedding Quantization Quality via Cosine SimilarityMed.
- 12Quantization Quality Check via Perplexity DeltaHard
- 13Break-Even Pay-Per-Token API vs Dedicated GPUMed.
- 14Computing Optimal Model Size with Scaling LawsMed.
- 15MoE with Shared Expert Forward PassconceptMed.
- 16Sparse MoE Top-K RoutingconceptMed.
- 17LoRA: Low-Rank Adaptation Forward PassconceptMed.
- 18QLoRA: Quantized Low-Rank Adaptation Forward PassconceptMed.