AssemblerGPT - my GPT fine-tuning adventure
Hello everyone reading this!
Before I start with the topic of this post, I would like to inform that due to being busy next ISKRA Project articles will sadly be delayed. Sadly, I won't be able to publish them in regular intervals.
Anyway, let's get into the topic I would like to write about here.
I have fine-tuned different AI models in the past, but I never fine-tuned a latest OpenAI's GPT model. Some time ago, I decided to finally try that.
But for those of you who don't know, I will first explain what fine-tuning is.
In a nutshell, it is a process of training artificial intelligence, usually to shape its output (responses) in a specific manner, teach it dealing with new tasks or to provide it with new data. It is usually done by providing lot of examples of input (for example questions) and desired output for each of the inputs (for example correct answers to all the questions).
A example could be: if you fine-tune a text generator AI on theatrical play scripts, the text it is going to generate after fine-tuning will resemble theatrical play scripts.
Fine-tuning usually makes output more reliable and content generation faster and (in the long run) cheaper, compared to modifying AI behavior by other means such as for example SYSTEM message (overall AI behavior directive) modern GPT models support.
Lastly, unlike initial training which is necessary to create an AI model, fine-tuning can be redone multiple times and it does not have to depend on original training material. For example, if you want to create a theatrical play script text generator AI; you can first train and create a default model on any type of texts so it gains understanding of the language, and then fine-tune it into the specific response pattern - theatrical play scripts. And once you gather even more good examples of theatrical play scripts in the future, you can fine-tune the previously fine-tuned AI even further to improve its responses even more. Too much fine-tuning or fine-tuning on AI generated content might corrupt the AI model by solidifying mistakes or making it too repetitive to be useful.
The idea
The first and in fact the main challenge for me was to think of what to fine-tune the AI for and on what data. I did not have any specific needs, nor a large dataset that could be used for fine-tuning. I also wanted it to be something that would allow me to see the effects of my fine-tuning without any doubts. If I for example fine-tuned it to be sarcastic, I would not really know if the AI is sarcastic thanks to my fine-tuning or because it is just good at resembling human writing (and ChatGPT is one of the best when it comes to that).
I finally came up with an idea. I remembered how months ago I once tried to make ChatGPT write assembly code, and it was unable to write anything that actually works (or at least I wasn't lucky to make any of it work). I considered it a great opportunity - I'll teach ChatGPT real, working x86-64 assembler!
But what is assembly code?
Putting it simple, every computer program is a set of instructions which the CPU (processor) does one by one. For the CPU, each of these instructions is a different number, and the whole program is a long set of said numbers. These numbers are called the machine code.
Very long time ago, computer programs had to be directly written in machine code, which programmers would do by making holes in paper cards or tapes to represent different binary numbers.
Some time later, the assembly code appeared. It allowed programmers to write programs with text, and assembly converts very directly into machine code.
Different CPU models may have different architectures (internal structure), which makes their machine code unique. This is what makes programs not cross-compatible between devices. Since the machine code is different between CPUs, the assembly code for different CPUs is different and unique too.
When you write a program nowadays in a programming language which gets compiled (converted) into machine code (such language is for example C++), your code is first translated into assembly code of the processor architecture of choice, and then that assembly code is converted into the final machine code which can be run as a program.
Since assembly code is unique for different CPU models, often very complex and people nowadays don't code directly in assembly (except in academic settings or for very specialised uses such as creation of new operating systems) in favor of more advanced and simpler programming languages that get automatically translated into assembly anyway; there are not enough examples online for ChatGPT to properly learn it and to understand how it differs for different CPUs. This is why "default" ChatGPT (as of when I did the tests, July 2024) is not really able to write assembly code that would actually work.
Getting the data for fine-tuning
As I described earlier, fine-tuning of GPT (and many other AI models) requires providing lot of examples of possible inputs and desired outputs for them. Therefore, I needed examples of questions to generate certain computer programs and working assemby code of such programs.
I decided to use the fact that GCC C/C++ compiler can save assembly code generated from C++ code into a text file, if given a proper command to do so in the command line.
But first, I needed many C++ programs to convert into assembly, preferably very simple ones written in pure C++ without much use of libraries.
I decided that the easiest way to get them will be to AI generate them.
But didn't I say earlier that fine-tuning AI on AI generated content is a bad idea?
Well, the answer is that fine-turning AI directly on AI generated content solidifies mistakes and cause repetition in responses, since it brings nothing new for the model. In my case however, I was going to use the C++ code to generate assembly code with GCC, and then to train the AI on the assembly code, which will be completely new data for the model. I was not going to train GPT on assembly code it generated itself, that would definitely make it useless and be pointless. And of course, I also agree that it would be best to use human-made C or C++ code, but for sake of this experiment AI generated C++ code is enough.
Anyway, to generate lot of C++ code examples I decided to use OpenAI's Batches. Batches allow mass-generation of GPT model responses. They work faster than generating responses manually one by one, and cheaper since they run on unused computing power of OpenAI's servers whenever they are available instead of having to instantly respond. They promise to finish all batch generation requests within 24 hours, tho usually it goes much faster than that.
To start batch processing, a proper JSONL file is needed. JSONL contains multiple JSON objects separated in new lines. This is the one I created and used:
{"custom_id": "cpp-batch-1-test3", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-3.5-turbo", "messages": [{"role": "system", "content": "You will write only the code of a random program in pure C++ and nothing else. Do not include any explanations, comments, or additional text."}], "n": 100, "max_tokens": 1000}}
{"custom_id": "cpp-batch-2-test3", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-3.5-turbo", "messages": [{"role": "system", "content": "You will write only the code of a random complex program in pure C++ and nothing else. The program shall contain multiple functions. Do not include any explanations, comments, or additional text."}], "n": 100, "max_tokens": 3000}}
The first part of each line (which are separate requests contains in order:
- A user-friendly name of the batch.
- Web request method to use.
- URL to API endpoint (/v1/chat/completions is for ChatGPT).
- Body of the API web request.
- Number of possible responses to generate (here 100).
- Maximum token count (here 1000, think of it as word limit up to which the AI generates text).
The Body of the API request contains in order:
- AI model to use (I chosen the fastest GPT-3.5-turbo)
- System message (behavior directive, here a command that it will generate simple C++ code and nothing else).
The first batch is to generate random simple C++ programs, the second to generate more complex C++ programs which contain things such as functions.
After less than a minute of processing, I could download a JSONL containing in total 200 simple C++ codes that could be fully compiled and run.
I then used a script to write every of these C++ codes into separate text files, and then another script to convert them into text files containing assembly code. I was working on Linux Ubuntu, and used command "gcc -S [C++ code file name] -o [file name to save assembly code into]" on each of the C++ codes. After several minutes of computing, I finally had 200 examples of various very simple programs written in real x86-64 assembler.
I had proper examples of output, but I still lacked examples of input. I decided to AI generate it as well, based on what the C++ code samples do. After a while of experimenting with different Systems of GPT-3.5-turbo, I came up with this one:
"You will be given code in C++.
Write detailed description what it does, but in form of a question to write a Assembler program that works like this. For example:
"Write program that will add, divide and multiply two numbers"
Sound like a human and do not be repetetive. Never ask "How...", but give commands."
Thanks to that it ended up generating requests to generate code which were unique and sounded human (human enough for my needs at least), like for example "Can you create an Assembler program that generates a random number between 1 and 100 and prints it to the screen?" or "Write an Assembler program that takes a vector of integers {3, 2, 5, 1, 4}, sorts it in ascending order, and then outputs the sorted numbers separated by spaces.". I again broke the rule to not fine-tune AI on AI generated content, but in my opinion in this specific case it is not really an issue that would have visible negative effects.
Anyway, I now had inputs, and assembler outputs. I also needed a System for the fine-tuned model. I came up with this:
"You are a coding assistant AI specialised in writing x86-64 g++ Assembler code.
The Assembler code you provide is for 64-bit Intel Core i7 8th gen CPU and Linux Ubuntu Mate operating system.
The Assembler code is x86-64 and can be further compiled with g++ 13.2.0 (Ubuntu 13.2.0-23ubuntu4).
Before writing the Assembler code in your responses, always write C++ code first, preferably use pure C++.".
I written about g++ instead of GCC in this System prompt, since the assembly code had to be compiled further with g++ instead of GCC to properly use external libraries. For some reason I could not get further compilation to machine code to work on GCC.
Anyway, finally I written a script to create a JSONL for fine-tuning (fine-tuning is based on data provided in JSONL format too, just like batches), where examples had such structure:
- The System prompt (always the same as written above).
- Input: One of the ready input examples.
- Output: "Sure! But let's first write C++ code which works like that:" then C++ code corresponding with the input, then "Now, let's write the Assembler code:" and finally the corresponding assembly code.
I made the AI depend on that exact System prompt to use my fine-tuning, and I technically hardcoded how begginning and middle of the answer looks like (which does not really matter because the code is what matters here). I decided to include C++ code in the response too for more context regarding the assembly code itself.
One attempt to run the script to generate the training JSONL failed, creating a 111 example long version. That will be relevant later. Of course I managed to produce a 200 example long training material too.
Finally, it was time to start the fine-tuning.
I wanted the first fine-tuning attempt to be cheap, and finally chosen to fine-tune GPT-4o-mini-2024-07-18 (on which there was a big discount). To make it even cheaper for the first try, I decided to use the 111 example long dataset I mentioned earlier instead of the 200 example long one.
The final cost of the fine-tuning was 19.09$, and it took around 20 minutes.
Once it finished (I was extremely surprised it worked on the first try btw), it was the time to try it out. The best enviroinment for that was in my opinion the OpenAI Playground, which even allows making output comparisons with a base GPT model.
As you can see, the screenshots I took right after I fine-tuned the AI don't show much except pure results, so now I shall use screenshots I took now further in this post.
The setup will involve the fine-tuned model with the System message it was fine-tuned with. I will also put a default GPT-4o-mini with same System in the left for output comparison. Temperature parameter of both is set to 0.90, since I noticed it makes the assembly code little more reliable.
I also already asked both to write a Assembler program that will add numbers 8 and 120:
This is what a default GPT-4o-mini answered:
Here’s a simple C++ program that adds the numbers 8 and 120. After that, I will provide you with the corresponding x86-64 Assembly code.
### C++ Code
```cpp
#include
int main() {
int a = 8;
int b = 120;
int sum = a + b;
std::cout << "The sum of " << a << " and " << b << " is " << sum << std::endl;
return 0;
}
```
### Assembly Code
Below is the corresponding x86-64 Assembly code that performs the same operation as the C++ program above.
```assembly
section .data
msg db "The sum of 8 and 120 is ", 0 ; Message to print
len equ $ - msg ; Length of the message
section .bss
sum resb 4 ; Reserve space for sum
section .text
global _start
_start:
; Initialize values
mov rax, 8 ; Move 8 into rax
add rax, 120 ; Add 120 to rax (now rax = 128)
; Store the result in 'sum'
mov [sum], rax ; Store the value of rax in 'sum'
; Print message
mov rax, 1 ; syscall: write
mov rdi, 1 ; file descriptor: stdout
mov rsi, msg ; address of string to output
mov rdx, len ; length of string
syscall ; invoke syscall
; Print the sum
mov rax, 1 ; syscall: write
mov rdi, 1 ; file descriptor: stdout
mov rsi, sum ; address of the sum (to convert to string)
mov rdx, 4 ; length of sum (we will just print raw bytes)
syscall ; invoke syscall
; Exit program
mov rax, 60 ; syscall: exit
xor rdi, rdi ; exit code 0
syscall ; invoke syscall
```
### Explanation:
1. The assembly program initializes the numbers 8 and 120, then computes their sum.
2. It uses `syscall` for outputting messages.
3. The message is printed to the console, followed by displaying the raw byte value of the sum (note that you would need additional formatting to print it properly).
To compile the assembly code, you would typically save it into a file called `sum.asm`, and you can assemble and link it using the following commands:
```sh
nasm -f elf64 -o sum.o sum.asm
ld -o sum sum.o
./sum
```
Make sure you have the necessary toolchain installed (NASM for assembly and `ld` for linking) on your Ubuntu system to execute the above commands.
And this is what my fine-tuned answered:
Sure! But let's first write C++ code which works like that:
#include
int main() {
int number1 = 8;
int number2 = 120;
int sum = number1 + number2;
std::cout << "The sum of " << number1 << " and " << number2 << " is: " << sum << std::endl;
return 0;
}
Now, let's write the Assembler code:
.file "cppcode.cpp"
.text
#APP
.globl _ZSt21ios_base_library_initv
.section .rodata
.LC0:
.string "The sum of "
.LC1:
.string " and "
.LC2:
.string " is: "
#NO_APP
.text
.globl main
.type main, @function
main:
.LFB1988:
.cfi_startproc
endbr64
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $8, -12(%rbp)
movl $120, -8(%rbp)
movl -12(%rbp), %edx
movl -8(%rbp), %eax
addl %edx, %eax
movl %eax, -4(%rbp)
leaq .LC0(%rip), %rax
movq %rax, %rsi
leaq _ZSt4cout(%rip), %rax
movq %rax, %rdi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@PLT
movq %rax, %rdx
movl -12(%rbp), %eax
movl %eax, %esi
movq %rdx, %rdi
call _ZNSolsEi@PLT
movq %rax, %rdx
leaq .LC1(%rip), %rax
movq %rax, %rsi
movq %rdx, %rdi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@PLT
movq %rax, %rdx
movl -8(%rbp), %eax
movl %eax, %esi
movq %rdx, %rdi
call _ZNSolsEi@PLT
movq %rax, %rdx
leaq .LC2(%rip), %rax
movq %rax, %rsi
movq %rdx, %rdi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@PLT
movq %rax, %rdx
movl -4(%rbp), %eax
movl %eax, %esi
movq %rdx, %rdi
call _ZNSolsEi@PLT
movq _ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_@GOTPCREL(%rip), %rdx
movq %rdx, %rsi
movq %rax, %rdi
call _ZNSolsEPFRSoS_E@PLT
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE1988:
.size main, .-main
.section .rodata
.type _ZNSt8__detail30__integer_to_chars_is_unsignedIjEE, @object
.size _ZNSt8__detail30__integer_to_chars_is_unsignedIjEE, 1
_ZNSt8__detail30__integer_to_chars_is_unsignedIjEE:
.byte 1
.type _ZNSt8__detail30__integer_to_chars_is_unsignedImEE, @object
.size _ZNSt8__detail30__integer_to_chars_is_unsignedImEE, 1
_ZNSt8__detail30__integer_to_chars_is_unsignedImEE:
.byte 1
.type _ZNSt8__detail30__integer_to_chars_is_unsignedIyEE, @object
.size _ZNSt8__detail30__integer_to_chars_is_unsignedIyEE, 1
_ZNSt8__detail30__integer_to_chars_is_unsignedIyEE:
.byte 1
.ident "GCC: (Ubuntu 13.2.0-23ubuntu4) 13.2.0"
.section .note.GNU-stack,"",@progbits
.section .note.gnu.property,"a"
.align 8
.long 1f - 0f
.long 4f - 1f
.long 5
0:
.string "GNU"
1:
.align 8
.long 0xc0000002
.long 3f - 2f
2:
.long 0x3
3:
.align 8
4:
Now, let's save the code generated by mine into a file, finish compiling it and run it!
As we can see, code generated by my AI works as intended.
Let's add writing a text after the math. Let's write "Made by Wojtekb30". My input was "Create a Assembler program that will first display "Made by Wojtekb30" and then add numbers 8 and 120".
My AI's full response to this was:
Sure! But let's first write C++ code which works like that:
#include
int main() {
std::cout << "Made by Wojtekb30" << std::endl;
int number1 = 8;
int number2 = 120;
int sum = number1 + number2;
std::cout << "The sum of " << number1 << " and " << number2 << " is: " << sum << std::endl;
return 0;
}
Now, let's write the Assembler code:
.file "cppcode.cpp"
.text
#APP
.globl _ZSt21ios_base_library_initv
.section .rodata
.LC0:
.string "Made by Wojtekb30"
#NO_APP
.text
.globl main
.type main, @function
main:
.LFB1988:
.cfi_startproc
endbr64
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
leaq .LC0(%rip), %rax
movq %rax, %rsi
leaq _ZSt4cout(%rip), %rax
movq %rax, %rdi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@PLT
movq _ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_@GOTPCREL(%rip), %rdx
movq %rdx, %rsi
movq %rax, %rdi
call _ZNSolsEPFRSoS_E@PLT
movl $8, -12(%rbp)
movl $120, -8(%rbp)
movl -12(%rbp), %edx
movl -8(%rbp), %eax
addl %edx, %eax
movl %eax, -4(%rbp)
leaq .LC1(%rip), %rax
movq %rax, %rsi
leaq _ZSt4cout(%rip), %rax
movq %rax, %rdi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@PLT
movq %rax, %rdx
movl -12(%rbp), %eax
movl %eax, %esi
movq %rdx, %rdi
call _ZNSolsEi@PLT
movq %rax, %rdx
leaq .LC2(%rip), %rax
movq %rax, %rsi
movq %rdx, %rdi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@PLT
movq %rax, %rdx
movl -8(%rbp), %eax
movl %eax, %esi
movq %rdx, %rdi
call _ZNSolsEi@PLT
movq %rax, %rdx
leaq .LC3(%rip), %rax
movq %rax, %rsi
movq %rdx, %rdi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@PLT
movq %rax, %rdx
movl -4(%rbp), %eax
movl %eax, %esi
movq %rdx, %rdi
call _ZNSolsEi@PLT
movq _ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_@GOTPCREL(%rip), %rdx
movq %rdx, %rsi
movq %rax, %rdi
call _ZNSolsEPFRSoS_E@PLT
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE1988:
.size main, .-main
.section .rodata
.LC1:
.string " and "
.LC2:
.string " is: "
.LC3:
.string "Made by Wojtekb30"
.ident "GCC: (Ubuntu 13.2.0-23ubuntu4) 13.2.0"
.section .note.GNU-stack,"",@progbits
.section .note.gnu.property,"a"
.align 8
.long 1f - 0f
.long 4f - 1f
.long 5
0:
.string "GNU"
1:
.align 8
.long 0xc0000002
.long 3f - 2f
2:
.long 0x3
3:
.align 8
4:
Now let's again save it to a text file, compile and run it:
It kinda worked, but "Made by Wojtekb30" is printed to terminal twice, and the second time is in wrong place.
Conclusions
Personally, I am very happy and impressed that it worked, especially since it did on the first try.
I have done more tests than described here, and noticed that the AI may write code that causes memory violations or doesn't work as expected. Still, it a very good result in my opinion for how little training material it had, And it is especially good compared with the assembly written by default GPTs, which I never managed to make run at all (again, it might be because I was doing something wrong).
Btw, if I wanted to use my fine-tuned model via API, I'd just have to type the full, very long model name as model name in the API request.
When fine-tuning I also made some mistakes. The System message I fine-tuned it on was definitely too long and complex, and C++ code in the answers quite unnecessary. As a result, some examples in the fine-tuning JSONL were beyond maximum context size (were too long and too complex), so they likely got omitted during the process. I also chose assembly code that I can easily directly run on my computer, but if fine-tuning such model seriously for public use, I would rather go with ARM64 or RISC assembler since it is more commonly written from scratch than assembly code for x86-64 CPUs; and therefore would be more useful to people. In a serious model, I would also use C/C++ code and if available assembly code written by humans rather than AI generated.
Again, I am glad it worked.
Thank you for reading this post! I hope you liked it.
Next thing I will publish will most likely be a next ISKRA project experiment. Or a article on how GPT AIs work internally, or a tutorial-log on training and fine-tuning GPT2 or other AI model from scratch on your own computer. I'll see, and I am happy to hear suggestions!
See you in other posts! And check out other stuff I written, I hope it will be interesting to you too.
Comments
Post a Comment