DeepSeek-V3-Lite
Full reimplementation of the DeepSeek-V3 architecture from scratch — a 27-layer transformer with 1 dense block and 26 Mixture-of-Experts blocks (2B effective parameters, 64 routed + 2 shared experts, top-6 activated). Implements Multi-Head Latent Attention with KV-cache compression (10–20×), custom FP8 Triton kernels, Multi-Token Prediction, and a complete post-training pipeline.
- MLA with decoupled RoPE (YaRN), absorption trick, KV cache compression 10–20×
- FP8 E4M3FN training — custom Triton kernels for block-wise quant, GEMM, dequant
- Full post-training: SFT, GRPO (group_size=8), R1 distillation, speculative decoding
- FSDP distributed training across 8× RTX 5090 GPUs with safetensors checkpointing