ISKRA experiment 02_10_2024-1 - "Defeat the Minecraft creeper" [reupload]

ISKRA experiment 02_10_2024-1 - "Defeat the Minecraft Creeper"

Hello everyone! In this post, I will describe my recent and first public ISKRA Project experiment. I have conducted some experiments before, but I believe they are not worth publishing. Let's consider this one the first official ISKRA Project experiment.

Let's begin with the description of it.

I've built an installation in Minecraft Bedrock Edition:

This was the starting point and viewport of our AI:

The goal of the AI was to defeat the Creeper by using a set of pre-defined control commands:

W - move forward 1 block

S - move backwards 1 block

A - strafe left 1 block

D - strafe right 1 block

P - place a brown dirt block in front of you

R - destroy block in front of you

F - punch to attack/fight

The commands P and R were never actually implemented since they were unnecessary for this experiment. The Creeper was trapped in a structure to prevent it from escaping. The game mode was set to creative.

Each iteration, the AI could see the current viewport downscaled to first 720p and then 1080p (real screen resolution is 4K).

I used my own OpenAI Python classes to interact with GPT models in this experiment.

I employed two instances of GPT AIs - one named the Brain which would control the situation, and another named Movement which would extract movement commands from Brain's outputs. These were then read one by one and executed within Minecraft by simulating keyboard presses. I will refer to the AIs by their names Brain and Movement further in this text. In the code, I used Polish variable names, calling Brain "mozg" ("mózg") and Movement "ruch". Lastly, I used text-to-speech in my code to read aloud Brain's responses, primarily for the video linked at the end of this article.

Overall the algorithm looked like this:

1. Get command text from the user. Define Brain and Movement.
2. Start program loop.
3. Get a screenshot and send it to Brain with an ambiguous command "Do something".
4. Give Brain's response to Movement, extract movement instructions in the proper format.
5. Execute the instructions one by one. End the whole program if the instruction says "DONE"; otherwise, continue by returning to point 3.

First SYSTEM commands of AIs (they define behavior):

Movement (never changed):

Extract movement instructions from given text.
Available instructions are:
W - move forward 1 block
S - move backwards 1 block
A - strafe left 1 block
D - strafe right 1 block
P - place block in front of you
R - destroy block in front of you
F - punch to attack/fight
Separate each instruction with |
Example output:
W|W|A|P
If the text is 'DONE', add instruction 'DONE' too. If the text indicates that the objective was accomplished, also add instruction 'DONE'.

Brain:

You will be playing Minecraft. You will receive images of what your character sees now. You can control your character by writing commands. Separate them with |.
Available instructions are:
W - move forward 1 block
S - move backwards 1 block
A - strafe left 1 block
D - strafe right 1 block
P - place a brown dirt block in front of you
R - destroy block in front of you
F - punch to attack/fight
Do your best to accomplish the objective. When you accomplish the full objective and verify that, write DONE.
Current objective:
[objective typed by the user gets added here in a new line]

Movement always remained a GPT-3.5-turbo, while Brain started as GPT-4o-mini but in some tests utilized GPT-4o. Only GPT-4o and 4o-mini can process images as of now. Both were set to a generation limit of 1000 tokens (one token is usually one whole new word or part of a word generated by AI, "roughly 4 characters" as OpenAI states). I will not specify in which attempts the AI was set to GPT-4o or 4o-mini since both behaved very similarly.

For all attempts, the objective was simply "defeat the Creeper."

Attempt #1:

Brain returned random movements that looked like an intense fight. It would strafe left and right a lot while performing attacks in the meantime.

I concluded that it could not see the enemy. After confirming that screenshots were taken and sent to the AI correctly, I decided to increase the screenshot size it received from 720p to 1080p.

Attempt #2:

The bot would (try to) approach the Creeper, indicating that it could finally see and recognize it. However, it either did not approach close enough to successfully attack or would get lost and begin behaving like in the first attempt.

I concluded that it had difficulty judging distance and understanding the task.

Attempt #3:

I added
"The cross-like cursor in the middle shows what you are currently facing and will interact with. You must be very, very close to objects or mobs to interact with them."
and
"Remember to make sure you are not too far away from objects."
to the system message to help Brain understand that it may be too far.

I also introduced Komandor (commander) - GPT-3.5-turbo, which would expand the initial command to include details on how to perform the task. Its SYSTEM message was:

"Expand given input by adding more details, such as how to tell when the objective will be accomplished. The context is Minecraft. Keep your answer short, only slightly expand the original input."

After starting the program, not much changed in its behavior. In fact, it displayed even more instances of the random behavior observed in the first attempt. I concluded that Komandor's comments were confusing it, and Brain's SYSTEM message needed further updates.

Also, I noticed a typo in "keep" ("kepp") when writing this now, but it functioned well nonetheless.

Attempt #4:

I deactivated Komandor by commenting out the part of the code that added its text to the objective input and further edited Brain's SYSTEM message.

The AI needed context to work; the next tokens (words it generated) depended on all previous tokens in the entire conversation. So I allowed it to provide context and added:

"Write your own detailed thoughts and comments too, not only movement instructions."

Since making this addition, the AI began writing descriptions of what it saw and intended to do. Random behavior, like in the first attempt, did not occur again. However, the AI still remained too far away to successfully attack the target.

Attempt #5 - final:

I further upgraded Brain's SYSTEM message, finally reaching this one:

Do your best to accomplish the objective. When you accomplish the full objective and verify that, write DONE.
Remember to make sure you are not too far away from objects; ensure you move very close to targets. Being too far away is most likely the reason nothing changes. Objects are farther away than you think.
Try different methods if your current one does not work.
Always make sure you are correctly positioned. Always get as close to targets as you can.
The cross-like cursor in the middle shows what you are currently facing and will interact with. You must be very, very close to objects or mobs to interact with them.
Write your own detailed thoughts and comments too, not just movement instructions.

The AI indeed attempted different solutions when previous attempts were unsuccessful. It tried using block destruction commands, but those obviously did not work, which it seemingly recognized.

After several more failed attempts to attack the Creeper, it finally moved close enough to do so and successfully attacked.

Success! The bot managed to defeat the Creeper. It went even further; instead of writing DONE, it wanted to collect what the Creeper dropped. I decided to leave that issue for the next experiment to fix.

Interestingly, it struck the Creeper exactly the number of times needed to defeat it. It recognized that we were using a diamond sword. Did it know how many attacks were required to defeat the Creeper with a diamond sword? Similarly, in previous steps when it was too far for the attack to be effective, the AI would attack three times.

Video of the final attempt: https://youtu.be/HsWzNVjqzZQ


What next?
I will try to improve the AI by:
- Fine-tuning* the ways of writing descriptions of the current situation and movement sequences.
- Better SYSTEM message.
- Finding a way to allow it to judge distance accurately.
- Better comparison between previous and current states.
- Improving task memory and understanding of tasks.
- Providing additional context regarding what to do during the program's execution; so far, each loop it would just receive the command "Do something."
- Experimenting with TEMPERATURE and other advanced model parameters. Currently, all settings were left at default.
- I noticed that mixing AI models yields surprisingly good results. AIs tend to get stuck on their own ideas, but switching the entire conversation to a different AI model often brings new insights. I experienced this when asking an AI to fix code—one AI model on its own would keep failing and eventually revert to the initial (and still broken) state; getting trapped in a loop of failures. However, switching the AI model and presenting the entire conversation often allowed a new model to resolve the issue. Dynamic (even random) switching between GPT-4o and GPT-4o-mini (since other models cannot analyze images) should produce better results and behavior. We'll see.

Once this works well, I will implement actions for placing and removing blocks. Currently, its view is fixed in one position. Then we will be able to present it with new challenges!

* Fine-tuning is a process of adapting an AI model to specific tasks and data. In simpler terms, you can consider it further teaching the AI specific skills, knowledge, and behaviors.

Full final code (excluding the API classes themselves):


klucz = 'api-key'
import pyautogui
from WojAPIOpenAI import *
import time
import pyttsx3

def speak(text):
    engine = pyttsx3.init()
    voices = engine.getProperty('voices')
    engine.setProperty('voice', voices[1].id)
    engine.say(text)
    engine.runAndWait()
    del engine

def press(key: str, blocks=1):
    pyautogui.keyDown(key)
    time.sleep(0.15 * blocks)
    pyautogui.keyUp(key)

def downscale_to_1080p(image):
    width, height = image.size
    new_width = 1920
    new_height = int(new_width * height / width)
    resized_image = image.resize((new_width, new_height), resample=Image.LANCZOS)
    return resized_image

def ekran():
    screen_width, screen_height = pyautogui.size()
    screenshot = pyautogui.screenshot()
    first_monitor_screenshot = screenshot.crop((0, 0, screen_width, screen_height))
    return downscale_to_1080p(first_monitor_screenshot)

def place():
    placeholder = True

def destroy():
    placeholder = True

komandor = ChatGPTAgent(klucz, 'gpt-3.5-turbo', 500, "Expand given input by adding more details, such as how to tell when the objective will be accomplished. The context is Minecraft. Keep your answer short, only slightly expand the original input.")
zadanie = "\n" + str(input('Wpisz zadanie [ENG]: '))
del komandor

print("Current objective: " + zadanie)
speak("Current objective: " + zadanie)

print("Starting in 10 seconds...")
speak("Starting in 10 seconds...")
time.sleep(10)

rola1 = '''
You will be playing Minecraft. You will receive images of what your character sees now. You can control your character by writing commands. Separate them with |.
Available instructions are:
W - move forward 1 block
S - move backwards 1 block
A - strafe left 1 block
D - strafe right 1 block
P - place a brown dirt block in front of you
R - destroy block in front of you
F - punch to attack/fight

Do your best to accomplish the objective. When you accomplish the full objective and verify that, write DONE.
Remember to make sure you are not too far away from objects; ensure you move very close to targets. Being too far away is most likely the reason nothing changes. Objects are farther away than you think.
Try different methods if your current one does not work.
Always make sure you are correctly positioned. Always get as close to targets as you can.
The cross-like cursor in the middle shows what you are currently facing and will interact with. You must be very, very close to objects or mobs to interact with them.
Write your own detailed thoughts and comments too, not just movement instructions.
Current objective:
'''

rola2 = '''
Extract movement instructions from given text.
Available instructions are:
W - move forward 1 block
S - move backwards 1 block
A - strafe left 1 block
D - strafe right 1 block
P - place block in front of you
R - destroy block in front of you
F - punch to attack/fight

Separate each instruction with |
Example output:
W|W|A|P

If the text is 'DONE', add instruction 'DONE' too. If the text indicates that the objective was accomplished, also add instruction 'DONE' too.
'''

mozg = ImgChatGPTAgent(klucz, 'gpt-4o', 1000, rola1 + zadanie)
ruch = ChatGPTAgent(klucz, 'gpt-3.5-turbo', 1000, rola2)

lock = True
while lock:
    time.sleep(1)
    if lock == False:
        break
    obraz = ekran()
    tekst1 = mozg.GetResponseWithImg(obraz, "Do something")
    del obraz
    print("Original: " + tekst1)
    speak(tekst1)
    print(" ")
    tekst2 = ruch.GetResponse(tekst1)
    print("Movement: " + tekst2)
    print(" ")
    ruchy = tekst2.split("|")
    for i in ruchy:
        time.sleep(1)
        t = i.lower().strip()
        if t == "w" or t == "a" or t == "s" or t == "d":
            press(t, 1)
        elif t == "p":
            place()
        elif t == "r":
            destroy()
        elif t == "f":
            pyautogui.click()
        elif t == "done":
            lock = False

print("PROG_DONE")

Lastly, I apologize for the delay in publishing this article since the day of the experiment, and I regret that the code is not very well organized or readable. This is my first post of this kind, and I made the decision to publish it suddenly; I wasn't writing the code with publication in mind. However, I thought it was better to share the original code I used, despite its imperfections.

Thank you for reading and stay tuned for more!

Comments

Popular posts from this blog

Project ISKRA - introduction

Using agent-based approach to make ChatGPT capable of video analysis - part 1 - proof of concept