Pre-training a GPT-2 AI model on 32 GB of data. I made it both draw and recognise sketches!
Hello everyone!
In this post I will tell you about how I trained a GPT-2 "Small" AI model from scratch on a huge dataset of around 32 GB of raw text files with use of Torch and HuggingFace's Transformers libraries. I will explain steps of such a project, tell you about how it went for me and how I recommend doing these things now, and of course I will show you the code that pre-trains the AI model from a dataset too.
Let's start from the beginning:
1. The idea, purpose and raw dataset
First and foremost, a project like this needs an idea and should have an purpose. I pre-trained (trained from scratch) the AI model just to learn more about the process, but pre-training LLM AI models of that scale is done rarely and is usually pointless, as there are lot of base models already trained on specific languages that can then just be fine-tuned into a specific response format for example.
But let's assume that we want to pre-train such an model anyway, for example as an academic experiment, or to make the model truly ours, or because we want something that at core works on something else than a human or programming language (like the model I pre-trained). The challenge is gathering data to pre-train on, and to train something coherent we need at least tens of gigabytes of it.
Raw text datasets are rare, after all there are lot of LLMs already trained on existing languages that can be just fine-tuned if someone needed them to behave in a specific way. So if you wanted to pre-train your own LLM for a language, you'd have to create the dataset yourself for example by fetching texts from websites. I recommend doing such work on a VPS because running data crawlers on such scale on a private computer at home is a bad idea - you not only risk potentially fetching something malicious but your IP can get blocked, making it problematic to later access websites normally. Anyway, if you were fetching online texts (or other kind of data), please always respect their terms of use, of both the data and services that host them.
In my case, I decided to try something original and train GPT-2 to draw simple sketches by writing 1s and 0s. To do that, I decided to use the dataset from "Quick, Draw!" online game created by Google.
2. Data sanitization and preparation
Let's assume that we have raw data in any form. Now we have to prepare it for training.
We first must clear the texts off unnecessary elements, such as people's personal informations, URLs, emojis, hashtags and other. Then we must write them to text files, it is recommended that each text file should be 100-500 MB big. The best format for GPT-2 is to split different parts of the dataset with at least 2 or 3 newlines. You may also want to add special tags, like <|msgstart> or <|msgend|>. An example dataset text file would then look for example like this:
<bot> This is first conversation in the dataset </bot>
<user> Yes, this is first conversation in the dataset </user>
<bot> This is second talk in the dataset </bot>
<user> Yes, this is second conversation in the dataset, and we discuss something else here. </user>
If you were writing conversations, they of course should be longer. Also usually LLMs are first trained on general text data to understand the language and its grammar, and then fine-tuned into format of for example a conversation.
If you write the program for sanitizing and preparing the training files in something fast like C++, make it work in parallel and the computer it will run on will have a decent CPU and SSD, it should process around 30 GB of text data in 2 hours at worst. I written my program for that in Python and made it run on 6 cores of my laptop's Intel Core i7 (8 cores, 2.8 GHz, from 2017, OS: Ubuntu, unknown 500 GB SSD), and it finished in around 4 hours.
In my case, the data from Google was in NDJSON files, which from data important to me contained names of drawings and the drawings in vector format. My program converted each to 24x24 texts where 1 was white and 0 was black (24*24=576, even if every single 0 and 1 became separate token that is around half of GPT-2's 1024 token window). I also included special tokens that mark start and end of an image and message, and name of the object before and after the image. I got around 32 GB of raw text for pre-training this way.
This is an example of an apple in my format:
<MSGSTART>
apple
<IMGSTART>
111111111111111111111111
111111111111111111111111
111111111111111111111111
111111111111111111111111
111111111100111111111111
111111111100111111111111
111111111100111111111111
111110000000011111111111
111100000000000111111111
111000111111100000011111
110001111111111111000111
110011111111111111110001
110011111111111111111001
110011111111111111111001
111011111111111111110001
111001111111111111100011
111000111111111111100111
111110001111111111000111
111111000111111110001111
111111111000000000011111
111111111110000000111111
111111111111111111111111
111111111111111111111111
111111111111111111111111
<IMGEND>
apple
<MSGEND>
I hoped that thanks to this the GPT-2 will be able to both generate and recognise images. For example if you typed:
apple
<IMGSTART>
it would generate the image, and if you pasted an image and then <IMGEND>, then it would generate name of what is in the "picture".
3. Tokenization and training
LLMs operate on tokens, so we must run tokenizer to split the training data into such. GPT-2 uses ByteLevelBPETokenizer. First, we create and train the tokenizer with this code:
from tokenizers import ByteLevelBPETokenizer #pip install tokenizers
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(
files=[Paths to all training dataset text files shall be here],
vocab_size=50257,
min_frequency=1,
special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"] #Add other special tokens that you use here
)
tokenizer.save_model("model/tokenizer")
This code will create two small files: vocab.json and merges.txt. They are necessary for encoding and decoding text into tokens later. The tokenizer runs entirely on CPU and the Python library automatically uses all its cores for performance. On the same device, running this took me around 30 minutes.
The next steps are splitting the whole dataset into tokens, and then packing them into 1024-token long blocks (for way faster training). While that also happens only on CPU, it requires much more computing power and needed 1 TB of total disc space to work (and finally created around 200 GB of data in total).
This is when I already needed a computer better than what I have at home. Looking at various services, I decided to try out RunPod, which allows renting very powerful Linux computers (paid for per hour) with or without GPUs, and gives you full control over them via SSH or Jupyter. They call these Pods. They also allow you to create AWS S3 buckets, which in a nutshell are a terminal-friendly way of cloud file storage.
I created a Pod with eight A100 PCle GPUs, chosen Ubuntu with pre-installed Torch and Jupyter for it. I at first felt confused by fact its memory was split between "Container Disk" and "Volume Disk", so I set both to 200 GB. Then I copied the tokenizer and training data onto it, launched my code and... it crashed due to lack of disc space.
I stopped it to not pay for it, and created a second Pod to play around with their settings. This is when I discovered what "Container Disk" and "Volume Disk" actually are:
- Container Disk of a RunPod Pod contains the operating system and software installed on the device. Cache memory usually gets created there too (and this is what I did not have enough of, resulting in my code crashing). When you shutdown your Pod, Container Disc is deleted and then created from scratch when you launch it again, so when you relaunch it data you created there is gone and you'd have to reinstall anything you installed with for example package managers again.
- Volume Disk is by default mounted as a directory "workspace", it does not get deleted when you turn your Pod off. This is where you should store all your files that you do not want to get deleted.
Once I increased both to 500 GB each, whole process went without issues. The code tokenized the dataset, created the 1024-token long packs and saved it all to memory so it can be reused if the code ran again. It took 4 hours in total from what I remember.
Now as I think about it, I could had chosen a powerful CPU-only Pod for that task, GPU is not needed for the tokenization stage and I'd have saved lot of money that way.
Anyway, once the code reached training the model and loaded what was needed onto GPUs the expected training time was around 30 hours.
I then though about something: what if I take their most powerful Pod with eight B200 GPUs? It would cost 50$ per hour and not around 20$/hour, but if the time was much lower then in total I would pay less. Turned out I was right!
I used AWS S3 to copy all files from my A100 Pod to a S3 bucket, and then I attached it directly to such a B200 Pod. And when I ran the code, the time was little less than 2.5 hours.
And this is the whole code I used, optimised for 8xB200:
# GPUs are needed only for final step - training
# Once you use the 8 B200 GPUs, start program with: torchrun --nproc_per_node=8 pretrain.py
# If something fails and produces a partial folder, delete it before re-launching the code.
import os
# Global env tuning
os.environ["OMP_NUM_THREADS"] = "8"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
# NCCL tuning (single-node NVLINK-heavy cluster; keep an eye on network interface if using RDMA)
os.environ.setdefault("NCCL_DEBUG", "WARN")
os.environ.setdefault("NCCL_IB_DISABLE", "0")
os.environ.setdefault("NCCL_P2P_LEVEL", "NVL")
import math
import torch
from transformers import (
    GPT2Config,
    GPT2LMHeadModel,
    GPT2TokenizerFast,
    Trainer,
    TrainingArguments,
)
from datasets import load_from_disk, Dataset, DatasetDict, load_dataset
# Paths
TOKENIZER_PATH = "model/tokenizer"           #Tokenizer trained earlier
TOKENIZED_DIR = "tokenized_dataset"           # raw tokenized input
PACKED_TOKENIZED_DIR = "tokenized_dataset_packed"  # fixed-length output
RAW_DATASET = "data" # raw dataset path
BLOCK_SIZE = 1024
# Machine sizing (edit if different)
WORLD_SIZE = int(os.environ.get("WORLD_SIZE", 8))   # number of GPUs/processes
TOTAL_CPUS = os.cpu_count() or 1
CPUS_RESERVED = 32                                  # leave some for system / other tasks
CPUS_FOR_MAP = max(1, TOTAL_CPUS - CPUS_RESERVED)
# Per-process dataloader workers — roughly TOTAL_CPUS / WORLD_SIZE
DATALOADER_WORKERS_PER_PROC = max(2, TOTAL_CPUS // max(1, WORLD_SIZE))
def tokenize_raw_dataset():
    # 1. Load pretrained tokenizer
    tokenizer = GPT2TokenizerFast.from_pretrained(TOKENIZER_PATH, add_prefix_space=True)
    tokenizer.add_special_tokens({"pad_token": ""})
    # 2. Load raw dataset
    dataset = load_dataset("text", data_files={"train": RAW_DATASET+"/*.txt"})
    # 3. Define tokenization function
    def tokenize_batch(batch):
        return tokenizer(
            batch["text"],
            truncation=True,
            max_length=1024,
            return_special_tokens_mask=False,
        )
    # 4. Apply tokenization
    tokenized = dataset["train"].map(
        tokenize_batch,
        batched=True,
        batch_size=20000,
        num_proc=16,            # parallel workers
        remove_columns=["text"],
        keep_in_memory=False,   # use disk cache to avoid OOM
    )
    # 5. Save tokenized dataset
    print(f"Saving tokenized dataset to {TOKENIZED_DIR}…")
    tokenized.save_to_disk(TOKENIZED_DIR)
def _is_fixed_block(ds, block_size=BLOCK_SIZE, sample_k=1000):
    n = min(len(ds), sample_k)
    if n == 0:
        return False
    for i in range(n):
        ids = ds[i]["input_ids"]
        if not isinstance(ids, (list, tuple)) or len(ids) != block_size:
            return False
    return True
def _pack_group_texts(examples):
    # Efficient pure-Python pack; used with many processes / large batch_size
    all_ids = []
    for ids in examples["input_ids"]:
        all_ids.extend(ids)
    total_len = (len(all_ids) // BLOCK_SIZE) * BLOCK_SIZE
    all_ids = all_ids[:total_len]
    chunks = [all_ids[i: i + BLOCK_SIZE] for i in range(0, total_len, BLOCK_SIZE)]
    return {
        "input_ids": chunks,
        "attention_mask": [[1] * BLOCK_SIZE for _ in range(len(chunks))],
    }
def get_or_build_packed_dataset():
    # Tokenize raw dataset if never done yet
    if not os.path.exists(TOKENIZED_DIR):
        tokenize_raw_dataset()
    
    # Load original (tokenized) dataset from disk — accept either Dataset or DatasetDict
    base = load_from_disk(TOKENIZED_DIR)
    if isinstance(base, DatasetDict):
        if "train" not in base:
            raise ValueError(f"{TOKENIZED_DIR} is a DatasetDict but has no 'train' split.")
        base = base["train"]
    if not isinstance(base, Dataset):
        raise ValueError(f"Loaded object from {TOKENIZED_DIR} is not a Dataset or DatasetDict.")
    # If a packed dataset already exists and is correct, reuse it
    if os.path.isdir(PACKED_TOKENIZED_DIR):
        ds = load_from_disk(PACKED_TOKENIZED_DIR)
        if isinstance(ds, DatasetDict):
            ds = ds["train"]
        if _is_fixed_block(ds, BLOCK_SIZE):
            print("Found valid packed dataset — loading.")
            return ds
        print("Packed dataset exists but not fixed-length; rebuilding.")
    # If original is already fixed blocks, return
    if _is_fixed_block(base, BLOCK_SIZE):
        print("Original dataset already fixed-length blocks; using as-is.")
        return base
    # Repack: use large batch_size + many num_proc workers to utilize CPU cores
    print("Repacking dataset into fixed 1024-token blocks...")
    pack_num_proc = min(256, CPUS_FOR_MAP)  # use many CPUs but leave headroom
    pack_batch = 100_000                     # large batch to reduce Python overhead per-map
    remove_cols = base.column_names
    packed = base.map(
        _pack_group_texts,
        batched=True,
        batch_size=pack_batch,
        num_proc=pack_num_proc,
        remove_columns=remove_cols,
        load_from_cache_file=True,
        keep_in_memory=False,
        desc="Packing to 1024-token blocks",
    )
    print(f"Saving packed dataset to {PACKED_TOKENIZED_DIR}...")
    packed.save_to_disk(PACKED_TOKENIZED_DIR)
    return packed
class FixedLengthCausalCollator:
    def __init__(self, block_size: int):
        self.block_size = block_size
    def __call__(self, features):
        input_ids = torch.tensor([f["input_ids"] for f in features], dtype=torch.long)
        if "attention_mask" in features[0]:
            attention_mask = torch.tensor([f["attention_mask"] for f in features], dtype=torch.long)
        else:
            attention_mask = torch.ones((input_ids.size(0), self.block_size), dtype=torch.long)
        return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": input_ids}
def main():
    # Low-level speed knobs
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.benchmark = True
    # Tokenizer
    tokenizer = GPT2TokenizerFast.from_pretrained(TOKENIZER_PATH, add_prefix_space=True)
    if tokenizer.pad_token is None:
        tokenizer.add_special_tokens({"pad_token": ""})
    # Model config: GPT-2 Small
    config = GPT2Config(
        vocab_size=len(tokenizer),
        n_positions=BLOCK_SIZE,
        n_ctx=BLOCK_SIZE,
        n_embd=768,
        n_layer=12,
        n_head=12,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
    config.use_cache = False
    model = GPT2LMHeadModel(config)
    model.resize_token_embeddings(len(tokenizer))
    # Dataset
    dataset = get_or_build_packed_dataset()
    # Collator
    data_collator = FixedLengthCausalCollator(block_size=BLOCK_SIZE)
    # TrainingArguments tuned for 8xB200
    training_args = TrainingArguments(
        output_dir="model/gpt2-pretrained-tmp",
        overwrite_output_dir=True,
        do_train=True,
        do_eval=False,
        num_train_epochs=3,
        # Per-GPU batch
        per_device_train_batch_size=64,
        gradient_accumulation_steps=1,
        learning_rate=3e-4,
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        warmup_ratio=0.01,
        # Use bfloat16 on Blackwell for best throughput
        bf16=True,
        fp16=False,
        # Compile for best runtime (PyTorch 2.x + Inductor)
        torch_compile=True,
        torch_compile_backend="inductor",
        # DDP/NCCL tuning
        ddp_backend="nccl",
        ddp_bucket_cap_mb=2048,   # larger buckets to reduce sync overhead on fast GPUs
        ddp_find_unused_parameters=False,
        # Dataloader perf (per-process). With 288 vCPUs, each process can use ~36 workers:
        dataloader_num_workers=DATALOADER_WORKERS_PER_PROC,
        dataloader_pin_memory=True,
        dataloader_persistent_workers=True,
        dataloader_prefetch_factor=4,
        dataloader_drop_last=True,
        # Optimizer choices
        optim="adamw_torch_fused",
        max_grad_norm=1.0,
        # Logging and saving
        logging_strategy="steps",
        logging_steps=100,
        save_strategy="steps",
        save_steps=10000,
        save_total_limit=2,
        report_to=[],  # disable reporters unless you configure them
        run_name="gpt2-pretraining-b200",
        remove_unused_columns=False,
    )
    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=dataset,
        tokenizer=tokenizer,
    )
    # Train
    trainer.train()
    # Only rank-0 saves (Trainer handles this internally in ddp mode)
    trainer.save_model("model/gpt2-pretrained")
if __name__ == "__main__":
    main()
  4. Results
I checked on the Pod after 2.5 hours, when I noticed it succesfully finished I quickly turned it off so it stops withdrawing more money. I then launched it with one minimal CPU just to take a look at and download the model files.
All files of the successfully pre-trained GPT-2 "Small" are in the left. In the right are sizes of folders on the machine. When it finished I was too excited to take screenshots, so I took this one many hours later.In total I spent around 327$ on this project, but if I had my current knowledge of RunPod, planned it better, prepared the whole final, optimized code earlier instead of experimenting with it on the go, did tokenization operations on a CPU-only Pod and then went with the GPU one for only the last, training stage that really requires a GPU; I believe it could had costed less than half of that.
But let's return to the GPT-2 model I trained.
After testing it with a simle implementation of a GPT-2 model use, I written a nice GUI app for it that allows to change the model's generation settings and has a nice canvas for displaying images it generates.
It turned out quite good in generating images, but worse in recognising them (stroke width of 2 pixels seem to help it tho). Here are a few examples:
Thank you for reading my post! For me this experiment was a great experience I learned a lot from, and I wish I could do this again but on a real language dataset.
This model's HuggingFace repository: https://huggingface.co/Wojtekb30/gpt-2-pencil-small








 
 
Comments
Post a Comment