Using agent-based approach to make ChatGPT capable of video analysis - part 1 - proof of concept

Hello everyone!

Today, we will create a way to let ChatGPT watch videos.

But first, I will describe my approach to this problem.

So models like GPT-4o and GPT-4o-mini are able to analyze static images. My approach to let the AI analyze movies will look like this:

- The video will be split into individual frames on specific interval.

- Certain amounts of the frames will become scenes (for example 8 frames when the frames are took every 4 seconds will make a 32 second long scene in total). The scenes will then be assembled into images containing all the frames, creating sort of a "comic".

- If set to do so, audio fragment for each scene will be processed thru speech recognition to get transcript of dialogues. As I am writing this, OpenAI released a model gpt-4o-audio-preview, which they claim can analyze audio; but since I haven't used it yet, let's first extract speech with a classic speech recognition library. In this part I am not going to add audio support yet tho.

- Once a "comic" is generated and any speech from the scene extracted, they will be provided to an AI agent to write a detailed description of the scene/fragment. GPT-4o-mini can be used here for better speed and lower costs of operation, but we could use GPT-4o for much more detailed descriptions and image analysis.

- Finally, once descriptions of all the scenes/fragments (and therefore of the whole video) are ready, they will be used as context by the master AI, which will be interacting with the user. Since that instance will need to process a large context, GPT-4o or o1-mini will be the best in that role. 

I am going to make the whole code a Python class, so it will be easy to import and use in any other projects.

Let's begin.

I am going to initalize the class with OpenAI api key, path to the video file, time between frames, frames per scene/fragment, boolean variable to determine whenever to process audio, and frame height (to downscale the video frames, processing bigger images costs more and takes more time).

The initalizator method of the class now looks like this:

class Video_ChatGPT:
    def __init__(self, openai_api_key:str, video_path: str, time_between_frames_seconds: float,
                 frames_per_scene: int, process_audio=True, frame_height = 480):
        self.video_path = video_path
        self.time_between_frames_seconds = time_between_frames_seconds
        self.api_key = openai_api_key
        self.frames_per_scene = frames_per_scene
        self.process_audio = process_audio
        self.frame_height = frame_height

First, we need to load the video and split it into the frames on the given interval. For that, we will implement a class method based on MoviePy, which will return us a list containing all the frames as PIL Image variables. The function will also already downscale the images, ensuring original aspect ratio is kept.

    def get_video_images(self, video_path, seconds_between_frames=1.0, desired_height=480):
        print("Extracting frames from the video "+video_path)
        images = []
        try:
            with VideoFileClip(video_path) as clip:
                current_time = 0
                current_step = 0
                steps = int(int(clip.duration)/seconds_between_frames)
                while current_time < clip.duration:
                    image = Image.fromarray(clip.get_frame(current_time))
                    x,y = image.size
                    divider = desired_height/y
                    new_width=x*divider
                    image = image.convert('RGB')
                    image = image.resize((int(new_width),desired_height), resample=Image.LANCZOS)
                    images.append(image)
                    print("Extracted frame "+str(current_step)+" out of "+str(steps))
                    current_step += 1
                    current_time += seconds_between_frames
            print("Frames extracted.")
            return images
        except Exception as e:
            print(f"An error occured when extracting video frames: {e}")
            return images

In order to test it, the end of the script I called this function and then used a simple loop to save all images from the list to files. Let's try it on "Steamboat Willie", which entered the public domain this year.

After a minute passed, the script successfully extracted the images from the movie (one frame every 5 seconds).

Now we must combine the frames into comic-like forms for AI to analyze. I written such function to do this:

    def create_comics(self, images, amount_per_page: int):
        list_l = len(images)
        comics = []
        if list_l>=1:
            x,y = images[0].size
            n=0
            paste_y = 0
            canvas = Image.new('RGB', (x, y*amount_per_page), color='white')
            while n0:
                    comics.append(canvas)
                    print("Created image page "+str(int(n/amount_per_page))+" out of "+str(int(list_l/amount_per_page)))
                    paste_y = 0
                    canvas = Image.new('RGB', (x, y*amount_per_page), color='white')
                    canvas.paste(images[n],(0,paste_y))
                    n+=1
                    paste_y+=y
                else:
                    canvas.paste(images[n],(0,paste_y))
                    n+=1
                    paste_y+=y
        return comics

This function turns images from the previous list into comic-like pages. This is example of how such looks like:


Now we have bare minimum to test my approach. Let's modify the class to use these functions and add the AI agents using my own OpenAI API interface. As I described earlier, I will now send each 'comic' to AI for describing it.

Finally, I ended up with such code:

from moviepy.video.io.VideoFileClip import VideoFileClip
from PIL import Image
import os
import easyapiopenai
from WojSafeAPI import * #My own library for loading API keys safely, you cannot use it.

class Video_ChatGPT:
    def __init__(self,openai_api_key:str, video_path: str, time_between_frames_seconds: float,
                 frames_per_scene: int, process_audio=True, frame_height = 480,
                 watcher_model='gpt-4o-mini',master_model='gpt-4o',
                 watcher_token_limit = 500, master_token_limit=15*1000, add_descriptions_to_system=False,
                 add_to_watchers_system=""):
        
        self.video_path = video_path
        self.time_between_frames_seconds = time_between_frames_seconds
        self.api_key = openai_api_key
        self.frames_per_scene = frames_per_scene
        self.process_audio = process_audio
        self.frame_height = frame_height
        self.add_descriptions_to_system = add_descriptions_to_system
        
        temp1 = self.get_video_images(self.video_path, self.time_between_frames_seconds, self.frame_height)
        self.video_comics = self.create_comics(temp1, self.frames_per_scene)
        del temp1 #We can delete the list with all frames since only the one with the 'comics' will be used.
        
        watcher_system = "You will recieve fragment of a video in form of a "+str(self.frames_per_scene) +" comic. Write a description of the fragment as a whole. Never mention it being a comic."+add_to_watchers_system
        self.frame_watcher_ai = easyapiopenai.ImgChatGPTAgent(self.api_key,watcher_model,watcher_token_limit,watcher_system)
        
        self.video_description = self.describe_video()
        #del self.video_comics
        
        master_system = "You are a helpful assistant capable of video analsis. Do not speak about the video unless specifically asked about it."
        if self.add_descriptions_to_system==True:
            master_system = master_system + "\n\n[Description of the video scene by scene:\n" + self.video_description+"]"
        
        self.master_ai = easyapiopenai.ChatGPTAgent(self.api_key,master_model,master_token_limit,master_system)
        
    def GetResponseFromMaster(self, prompt):
        final_prompt = prompt
        if self.add_descriptions_to_system == False:
            final_prompt = final_prompt + "\n\nDescription of the video:\n\n"+self.video_description
        return self.master_ai.GetResponse(final_prompt)
        
    def describe_video(self):
        description = ""
        n=1
        print("Describing each scene...")
        for i in self.video_comics:
            try:
                #i.show()
                text = self.frame_watcher_ai.GetResponseWithImg(i,"Describe this fragment of the video based on the comic.").replace("\n"," ")
                description = description + "Scene number "+str(n)+":\n"+text+"\n\n"
                print("Scene "+str(n)+" "+text)
                print(" ")
                n+=1
            except Exception as e:
                print(f"An error occured describing the fragment: {e}")
                self.frame_watcher_ai.ClearHistory()
        print("Done describing.")
        return description
        
    def get_video_images(self, video_path, seconds_between_frames=1.0, desired_height=480):
        print("Extracting frames from the video "+video_path)
        images = []
        try:
            with VideoFileClip(video_path) as clip:
                current_time = 0
                current_step = 0
                steps = int(int(clip.duration)/seconds_between_frames)
                while current_time < clip.duration:
                    image = Image.fromarray(clip.get_frame(current_time))
                    x,y = image.size
                    divider = desired_height/y
                    new_width=x*divider
                    image = image.convert('RGB')
                    image = image.resize((int(new_width),desired_height), resample=Image.LANCZOS)
                    images.append(image)
                    print("Extracted frame "+str(current_step)+" out of "+str(steps))
                    current_step += 1
                    current_time += seconds_between_frames
            print("Frames extracted.")
            return images
        except Exception as e:
            print(f"An error occured when extracting video frames: {e}")
            return images

    def create_comics(self, images, amount_per_page: int):
        list_l = len(images)
        comics = []
        if list_l>=1:
            x,y = images[0].size
            n=0
            paste_y = 0
            canvas = Image.new('RGB', (x, y*amount_per_page), color='white')
            while n0:
                    comics.append(canvas)
                    print("Created image page "+str(int(n/amount_per_page))+" out of "+str(int(list_l/amount_per_page)))
                    paste_y = 0
                    canvas = Image.new('RGB', (x, y*amount_per_page), color='white')
                    canvas.paste(images[n],(0,paste_y))
                    n+=1
                    paste_y+=y
                else:
                    canvas.paste(images[n],(0,paste_y))
                    n+=1
                    paste_y+=y
        return comics
                

if __name__ == "__main__":
    agent = Video_ChatGPT(openai_api_key=YourAPIKeyHere('openai'),video_path="film.webm",time_between_frames_seconds=5,frames_per_scene=5,process_audio=False,frame_height=300)
    while True:
        print(agent.GetResponseFromMaster(input("Type: ")))

the init method splits the video into the 'comics', and then uses describe_video() to produce a description for each of the fragments, which are collected together into one string. That string will then be either added to the user's message or AI's System prompt. 

A description of a single scene looks for example like this:

"Scene 1: In this animated fragment, the scene unfolds in a rustic kitchen or barn setting highlighted by a large wooden box labeled "POTATO BIN," overflowing with potatoes.   The first panel features Mickey Mouse, with a cheerful and spirited demeanor, standing next to the potato bin. He is likely engaged in a lighthearted task, possibly preparing to grab a potato. His posture is energetic, hinting at a playful adventure ahead.  In the second panel, Mickey, still focused on the task at hand, raises his arms in excitement, perhaps in anticipation of discovering something amusing in the bin. The background maintains a cluttered but charming vibe that complements the playful atmosphere.  The third panel introduces a mischievous bird character, likely a parrot, peering in through a window. The bird’s exaggerated expression, with wide eyes and an open beak, suggests it’s ready to join in the fun or perhaps cause some trouble, adding an element of surprise to the scene.  The final panel returns to Mickey, now holding a significant object (possibly a potato) and sporting a big grin. The bird is still visible in the foreground, enhancing the whimsical nature of the narrative.   Overall, the fragment captures a delightful mix of humor, anticipation, and character interactions, characteristic of classic animation, showcasing a charming moment filled with playful antics."

Once the processing is done, the master AI can use the data as context and answer queries about the video.

What's next?

I think I will end this post here and then write next posts where I will be improving it. Currently it makes a proof of concept.

First, to understand videos it also needs audio data, especially dialogues. The code is currently not too optimal and descriptions are huge. I also must make it more modular than it currently is.

Currently, the descriptions are without specific details. As a result, the AI will likely not be able to answer specific questions well. 

However, I have ideas how to solve that.

Instead of computing the whole video at the beginning, we can wait until a question about the video is asked by the user. 

First, we can first take user's input to a minimalistic AI agent (like 4o-mini or 3.5-turbo) which will determine whenever the prompt asks about something related to a video. 

If no, then the question will be just passed to the Master AI without any video analysis. 

But if yes, then the AI agent could determine what to focus on (for example character's clothing, who wins a battle; whatever the user asked about) and then start video analysis. 

Having information what aspect to focus on or look for, the AI agent responsible for analysis of the video can look at frames in that context. If it sees that a fragment of the video does not contain any information related to the prompt, it could simply answer with for example NO_INFO_HERE. 

That approach would shorten video descriptions to only what's necessary, and the description would be more tailored to match needs of what's needed to answer the user's prompt. Less tokens (words) would be generated by the AI too, resulting in faster execution and lower computation price. Lastly, such approach would make it easy to change the analysed video during the chat, without need to reinitalize the whole class and therefore reset the whole conversation.

But I will implement that in part 2, so stay tuned!

This Christmas, I also hope to manage to write about things I said earlier that I want to write more about, like continuation of my ISKRA project experiments or more tutorials.

Thank you for reading this post!

Comments

Popular posts from this blog

Project ISKRA - introduction

ISKRA experiment 02_10_2024-1 - "Defeat the Minecraft creeper" [reupload]