Wojtekb30's Blog (EN)

Posts

Showing posts from August, 2025

Pre-training a GPT-2 AI model on 32 GB of data. I made it both draw and recognise sketches!

August 21, 2025

Hello everyone! In this post I will tell you about how I trained a GPT-2 "Small" AI model from scratch on a huge dataset of around 32 GB of raw text files with use of Torch and HuggingFace's Transformers libraries. I will explain steps of such a project, tell you about how it went for me and how I recommend doing these things now, and of course I will show you the code that pre-trains the AI model from a dataset too. Let's start from the beginning: 1. The idea, purpose and raw dataset First and foremost, a project like this needs an idea and should have an purpose. I pre-trained (trained from scratch) the AI model just to learn more about the process, but pre-training LLM AI models of that scale is done rarely and is usually pointless, as there are lot of base models already trained on specific languages that can then just be fine-tuned into a specific response format for example. But let's assume that we want to pre-train such an model anyway, for example as an a...