This chapter builds the data pipeline that feeds the rest of the project.
prepare_data.py runs once to download raw text, tokenize it, and write the
result to train.bin and val.bin. data.py reads from those files and
serves aligned x / y batches that the training loop will consume.
pip install torch numpy tiktoken pyarrow huggingface_hubtorch and numpy handle model computation and array manipulation.
tiktoken provides GPT-2's tokenizer: 50,257 tokens, the same vocabulary
the original model used. We reuse it rather than train a new one. pyarrow
reads the parquet files the dataset ships as; huggingface_hub downloads
individual files from Hugging Face.
Download the Training Data
The training data comes from FineWeb-Edu, a filtered subset of Common Crawl curated by Hugging Face for educational quality. Andrej Karpathy published a pre-shuffled, 100B-token slice of it called fineweb-edu-100b-shuffle, split into 1,823 parquet shards. That is our dataset.
Each shard contains roughly 53,000 documents and 55 million GPT-2 tokens. To size the download, the Chinchilla paper estimates that a model trains most efficiently on roughly 20 tokens per parameter. For GPT-2 Small's 124M parameters, that works out to about 2.5 billion tokens, or 45 shards. Training beyond that point still improves results, so more shards will give you a stronger model if you have the compute.
A shared config holds the dataset source, shard count, and output paths so every script in the project reads from the same place.
# config.py
from dataclasses import dataclass
from pathlib import Path
@dataclass
class DataConfig:
dataset_repo: str = "karpathy/fineweb-edu-100b-shuffle"
num_shards: int = 46 # one extra shard is reserved for validation
raw_dir: Path = Path("data/raw")
train_bin: Path = Path("data/tokens/train.bin")
val_bin: Path = Path("data/tokens/val.bin")For the lighter model (~30M parameters), Chinchilla scaling gives 20 × 30M = 600M tokens, or about 11 training shards. Set num_shards=12 to include one for validation.
The downloader fetches the first num_shards parquet files from Hugging Face. hf_hub_download caches each file locally, so re-running the script skips shards that are already on disk.
# prepare_data.py
from pathlib import Path
from huggingface_hub import hf_hub_download, list_repo_files
from config import DataConfig
def download_shards(cfg: DataConfig) -> list[Path]:
files = sorted(
f for f in list_repo_files(cfg.dataset_repo, repo_type="dataset")
if f.endswith(".parquet")
)[: cfg.num_shards]
cfg.raw_dir.mkdir(parents=True, exist_ok=True)
local_paths = []
for filename in files:
path = hf_hub_download(
repo_id=cfg.dataset_repo,
repo_type="dataset",
filename=filename,
local_dir=cfg.raw_dir,
)
local_paths.append(Path(path))
return local_pathsAfter the downloader runs, data/raw/ should look like this:
data/
└── raw/
├── shard_00000.parquet
├── shard_00001.parquet
└── ... (N shards total)
Explore the Raw Documents
Before encoding anything, it helps to look at what the raw documents contain.
Each parquet file stores its rows in small groups, with each row holding a
single document in a text column. By reading one group at a time, we can
stream through the file without loading it all into memory.
# prepare_data.py (continued)
import pyarrow.parquet as pq
def iter_documents(parquet_path: Path):
pf = pq.ParquetFile(parquet_path)
for row_group_idx in range(pf.num_row_groups):
table = pf.read_row_group(row_group_idx, columns=["text"])
yield from table.column("text").to_pylist()
cfg = DataConfig()
shard_path = sorted(cfg.raw_dir.glob("*.parquet"))[0]
docs = iter_documents(shard_path)
sample = next(docs)
print(sample[:500])
print("chars:", len(sample))Running this should produce something like:
Shipment & Transport-Sea, Air, Rail, Road, Pipeline
The mode of transportation is an important consideration when planning the
shipment process. Besides the costs, the urgency of the shipment, the value
of the goods being shipped as well as the size and weight of the goods need
to be evaluated when determining the form of transportation.
Seaborne trade accounts for about 90% of the global trade, and as per
UNCTAD, 1687 million tons (2015 estimate) were carried in around 177.6
million containers
chars: 8657Now that we know what the raw data looks like, the next step is to convert these strings into token IDs for training.
Tokenize and Store
GPT-2's tokenizer (loaded via tiktoken) converts every document into a
sequence of token IDs. The full result is written to train.bin and
val.bin. Because encoding billions of tokens is CPU-bound and the result
never changes, we run tokenization as a separate preprocessing step rather
than doing it on the fly during training.
train.bin and val.bin are flat uint16 sequences of token IDs with no
headers, no padding, and no metadata. uint16 is wide enough to store every
token ID, since GPT-2's vocabulary of 50,257 tokens fits in 16 bits. Inside
each file, documents are laid out back to back as one continuous stream
of token IDs, with an end-of-text token after each one to mark the boundary.
# prepare_data.py (continued)
import numpy as np
import tiktoken
enc = tiktoken.get_encoding("gpt2")
eot = enc.eot_token
def write_split(shards: list[Path], out_path: Path):
out_path.parent.mkdir(parents=True, exist_ok=True)
with open(out_path, "wb") as f:
for shard in shards:
for text in iter_documents(shard):
ids = enc.encode_ordinary(text)
arr = np.array(ids + [eot], dtype=np.uint16)
f.write(arr.tobytes())
cfg = DataConfig()
shard_paths = download_shards(cfg)
train_shards = shard_paths[:-1]
val_shards = shard_paths[-1:]
write_split(train_shards, cfg.train_bin)
write_split(val_shards, cfg.val_bin)encode_ordinary tokenizes the text without processing special tokens,
so eot never appears in the output unless we add it ourselves. We append
it once after each document to mark the boundary in the token stream.
After both files are written, the project directory looks like this:
data/
├── raw/ (parquet shards, safe to delete after tokenizing)
└── tokens/
├── train.bin
└── val.binThe Batch Pipeline
Now that train.bin and val.bin are on disk, the final piece is a
function that serves training batches from them. For each training
example, the function picks a random starting index in the file, reads
seq_len tokens as the input x, and reads another seq_len tokens
starting one position forward as the target y.
With seq_len = 4, a training example would look like this:
data.py opens both token files with np.memmap, which treats each one
as a NumPy array without loading it into memory. When the training loop
calls get_batch, it slices directly into that array and reads only the
tokens it needs into memory.
# data.py
import numpy as np
import torch
from config import DataConfig
cfg = DataConfig()
train_data = np.memmap(cfg.train_bin, dtype=np.uint16, mode="r")
val_data = np.memmap(cfg.val_bin, dtype=np.uint16, mode="r")
def get_batch(split: str, batch_size: int, seq_len: int, device: str):
if split == "train":
data = train_data
elif split == "val":
data = val_data
else:
raise ValueError(f"unknown split: {split}")
starts = torch.randint(0, len(data) - seq_len, (batch_size,))
x = torch.stack([
torch.from_numpy(np.array(data[i : i + seq_len], dtype=np.int64))
for i in starts.tolist()
])
y = torch.stack([
torch.from_numpy(np.array(data[i + 1 : i + 1 + seq_len], dtype=np.int64))
for i in starts.tolist()
])
return x.to(device), y.to(device)# quick sanity check
x, y = get_batch("train", batch_size=4, seq_len=8, device="cpu")
print(x.shape) # (4, 8)
print(y.shape) # (4, 8)
print(x[0])
print(y[0])
import tiktoken
enc = tiktoken.get_encoding("gpt2")
print(enc.decode(x[0].tolist()))
print(enc.decode(y[0].tolist()))If get_batch() is correct, decoding both tensors should show y as
x shifted forward by one token.
prepare_data.py can now download the first N shuffled shards and write
train.bin and val.bin. data.py reads from those files and returns
aligned x / y batches. That is the complete data pipeline for the
project.
The data pipeline is complete. In the next chapter, we build the model that trains on these batches.