Learn AI by Building It

This chapter builds the data pipeline that feeds the rest of the project. prepare_data.py runs once to download raw text, tokenize it, and write the result to train.bin and val.bin. data.py reads from those files and serves aligned x / y batches that the training loop will consume.

Install the Project Dependencies First

pip install torch numpy tiktoken pyarrow huggingface_hub

torch and numpy handle model computation and array manipulation. tiktoken provides GPT-2's tokenizer: 50,257 tokens, the same vocabulary the original model used. We reuse it rather than train a new one. pyarrow reads the parquet files the dataset ships as; huggingface_hub downloads individual files from Hugging Face.

Download the Training Data

The training data comes from FineWeb-Edu, a filtered subset of Common Crawl curated by Hugging Face for educational quality. Andrej Karpathy published a pre-shuffled, 100B-token slice of it called fineweb-edu-100b-shuffle, split into 1,823 parquet shards. That is our dataset.

Each shard contains roughly 53,000 documents and 55 million GPT-2 tokens. To size the download, the Chinchilla paper estimates that a model trains most efficiently on roughly 20 tokens per parameter. For GPT-2 Small's 124M parameters, that works out to about 2.5 billion tokens, or 45 shards. Training beyond that point still improves results, so more shards will give you a stronger model if you have the compute.

A shared config holds the dataset source, shard count, and output paths so every script in the project reads from the same place.

# config.py
from dataclasses import dataclass
from pathlib import Path
 
 
@dataclass
class DataConfig:
    dataset_repo: str = "karpathy/fineweb-edu-100b-shuffle"
    num_shards: int = 46  # one extra shard is reserved for validation
    raw_dir: Path = Path("data/raw")
    train_bin: Path = Path("data/tokens/train.bin")
    val_bin: Path = Path("data/tokens/val.bin")

Lighter Config

For the lighter model (~30M parameters), Chinchilla scaling gives 20 × 30M = 600M tokens, or about 11 training shards. Set num_shards=12 to include one for validation.

The downloader fetches the first num_shards parquet files from Hugging Face. hf_hub_download caches each file locally, so re-running the script skips shards that are already on disk.

# prepare_data.py
from pathlib import Path
from huggingface_hub import hf_hub_download, list_repo_files
 
from config import DataConfig
 
 
def download_shards(cfg: DataConfig) -> list[Path]:
    files = sorted(
        f for f in list_repo_files(cfg.dataset_repo, repo_type="dataset")
        if f.endswith(".parquet")
    )[: cfg.num_shards]
 
    cfg.raw_dir.mkdir(parents=True, exist_ok=True)
 
    local_paths = []
    for filename in files:
        path = hf_hub_download(
            repo_id=cfg.dataset_repo,
            repo_type="dataset",
            filename=filename,
            local_dir=cfg.raw_dir,
        )
        local_paths.append(Path(path))
 
    return local_paths

After the downloader runs, data/raw/ should look like this:

data/
└── raw/
    ├── shard_00000.parquet
    ├── shard_00001.parquet
    └── ...  (N shards total)

Explore the Raw Documents

Before encoding anything, it helps to look at what the raw documents contain. Each parquet file stores its rows in small groups, with each row holding a single document in a text column. By reading one group at a time, we can stream through the file without loading it all into memory.

# prepare_data.py (continued)
import pyarrow.parquet as pq
 
 
def iter_documents(parquet_path: Path):
    pf = pq.ParquetFile(parquet_path)
    for row_group_idx in range(pf.num_row_groups):
        table = pf.read_row_group(row_group_idx, columns=["text"])
        yield from table.column("text").to_pylist()
 
 
cfg = DataConfig()
shard_path = sorted(cfg.raw_dir.glob("*.parquet"))[0]
docs = iter_documents(shard_path)
sample = next(docs)
 
print(sample[:500])
print("chars:", len(sample))

Running this should produce something like:

Shipment & Transport-Sea, Air, Rail, Road, Pipeline
The mode of transportation is an important consideration when planning the
shipment process. Besides the costs, the urgency of the shipment, the value
of the goods being shipped as well as the size and weight of the goods need
to be evaluated when determining the form of transportation.
Seaborne trade accounts for about 90% of the global trade, and as per
UNCTAD, 1687 million tons (2015 estimate) were carried in around 177.6
million containers
chars: 8657

Now that we know what the raw data looks like, the next step is to convert these strings into token IDs for training.

Tokenize and Store

GPT-2's tokenizer (loaded via tiktoken) converts every document into a sequence of token IDs. The full result is written to train.bin and val.bin. Because encoding billions of tokens is CPU-bound and the result never changes, we run tokenization as a separate preprocessing step rather than doing it on the fly during training.

train.bin and val.bin are flat uint16 sequences of token IDs with no headers, no padding, and no metadata. uint16 is wide enough to store every token ID, since GPT-2's vocabulary of 50,257 tokens fits in 16 bits. Inside each file, documents are laid out back to back as one continuous stream of token IDs, with an end-of-text token after each one to mark the boundary.

# prepare_data.py (continued)
import numpy as np
import tiktoken
 
 
enc = tiktoken.get_encoding("gpt2")
eot = enc.eot_token
 
 
def write_split(shards: list[Path], out_path: Path):
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with open(out_path, "wb") as f:
        for shard in shards:
            for text in iter_documents(shard):
                ids = enc.encode_ordinary(text)
                arr = np.array(ids + [eot], dtype=np.uint16)
                f.write(arr.tobytes())
 
 
cfg = DataConfig()
shard_paths = download_shards(cfg)
train_shards = shard_paths[:-1]
val_shards = shard_paths[-1:]
 
write_split(train_shards, cfg.train_bin)
write_split(val_shards, cfg.val_bin)

encode_ordinary tokenizes the text without processing special tokens, so eot never appears in the output unless we add it ourselves. We append it once after each document to mark the boundary in the token stream.

After both files are written, the project directory looks like this:

data/
├── raw/       (parquet shards, safe to delete after tokenizing)
└── tokens/
    ├── train.bin
    └── val.bin

The Batch Pipeline

Now that train.bin and val.bin are on disk, the final piece is a function that serves training batches from them. For each training example, the function picks a random starting index in the file, reads seq_len tokens as the input x, and reads another seq_len tokens starting one position forward as the target y.

With seq_len = 4, a training example would look like this:

318

257

922

318

257

922

1621

At each position, predict the next token

→

318

→

257

318

257

→

922

318

257

922

→

1621

data.py opens both token files with np.memmap, which treats each one as a NumPy array without loading it into memory. When the training loop calls get_batch, it slices directly into that array and reads only the tokens it needs into memory.

# data.py
import numpy as np
import torch
 
from config import DataConfig
 
 
cfg = DataConfig()
train_data = np.memmap(cfg.train_bin, dtype=np.uint16, mode="r")
val_data = np.memmap(cfg.val_bin, dtype=np.uint16, mode="r")
 
 
def get_batch(split: str, batch_size: int, seq_len: int, device: str):
    if split == "train":
        data = train_data
    elif split == "val":
        data = val_data
    else:
        raise ValueError(f"unknown split: {split}")
    starts = torch.randint(0, len(data) - seq_len, (batch_size,))
 
    x = torch.stack([
        torch.from_numpy(np.array(data[i : i + seq_len], dtype=np.int64))
        for i in starts.tolist()
    ])
    y = torch.stack([
        torch.from_numpy(np.array(data[i + 1 : i + 1 + seq_len], dtype=np.int64))
        for i in starts.tolist()
    ])
 
    return x.to(device), y.to(device)

# quick sanity check
x, y = get_batch("train", batch_size=4, seq_len=8, device="cpu")
print(x.shape)  # (4, 8)
print(y.shape)  # (4, 8)
print(x[0])
print(y[0])
 
import tiktoken
enc = tiktoken.get_encoding("gpt2")
print(enc.decode(x[0].tolist()))
print(enc.decode(y[0].tolist()))

If get_batch() is correct, decoding both tensors should show y as x shifted forward by one token.

Milestone

prepare_data.py can now download the first N shuffled shards and write train.bin and val.bin. data.py reads from those files and returns aligned x / y batches. That is the complete data pipeline for the project.

The data pipeline is complete. In the next chapter, we build the model that trains on these batches.