The earlier chapters built each component in isolation. This project assembles them into a single codebase that prepares data, trains a model, and generates text. Each stage builds directly on the previous one, so work through them in order.
Setup & Data
Download FineWeb-Edu shards, encode them to token IDs, and build the training batch pipeline.
GPT Architecture
Implement the GPT model and write the forward pass with cross-entropy loss.
Training Mechanics
Train GPT with AdamW, a learning-rate schedule, validation, and checkpointing.
Inference & Generation
Load a trained checkpoint, control the sampling, and generate text.
We will train GPT-2 Small, a ~124M parameter model. The training data is FineWeb-Edu, a filtered educational web corpus. Training at this scale requires an NVIDIA GPU. If you are on a CPU or Apple Silicon, the architecture chapter includes a lighter configuration with fewer layers and a narrower embedding. No other code changes needed.
prepare_data.pytokenizes FineWeb-Edu shards intotrain.binandval.bindata.pyserves alignedx/ybatches from those filesmodel.pywraps the Transformer stack into a trainable GPTtrain.pyruns the training loop and saves checkpointsgenerate.pyloads a checkpoint and samples continuations from a prompt