Using agent-based approach to make ChatGPT capable of video analysis - part 1 - proof of concept
Hello everyone!
Today, we will create a way to let ChatGPT watch videos.
But first, I will describe my approach to this problem.
So models like GPT-4o and GPT-4o-mini are able to analyze static images. My approach to let the AI analyze movies will look like this:
- The video will be split into individual frames on specific interval.
- Certain amounts of the frames will become scenes (for example 8 frames when the frames are took every 4 seconds will make a 32 second long scene in total). The scenes will then be assembled into images containing all the frames, creating sort of a "comic".
- If set to do so, audio fragment for each scene will be processed thru speech recognition to get transcript of dialogues. As I am writing this, OpenAI released a model gpt-4o-audio-preview, which they claim can analyze audio; but since I haven't used it yet, let's first extract speech with a classic speech recognition library. In this part I am not going to add audio support yet tho.
- Once a "comic" is generated and any speech from the scene extracted, they will be provided to an AI agent to write a detailed description of the scene/fragment. GPT-4o-mini can be used here for better speed and lower costs of operation, but we could use GPT-4o for much more detailed descriptions and image analysis.
- Finally, once descriptions of all the scenes/fragments (and therefore of the whole video) are ready, they will be used as context by the master AI, which will be interacting with the user. Since that instance will need to process a large context, GPT-4o or o1-mini will be the best in that role.
I am going to make the whole code a Python class, so it will be easy to import and use in any other projects.
Let's begin.
I am going to initalize the class with OpenAI api key, path to the video file, time between frames, frames per scene/fragment, boolean variable to determine whenever to process audio, and frame height (to downscale the video frames, processing bigger images costs more and takes more time).
The initalizator method of the class now looks like this:
class Video_ChatGPT:
def __init__(self, openai_api_key:str, video_path: str, time_between_frames_seconds: float,
frames_per_scene: int, process_audio=True, frame_height = 480):
self.video_path = video_path
self.time_between_frames_seconds = time_between_frames_seconds
self.api_key = openai_api_key
self.frames_per_scene = frames_per_scene
self.process_audio = process_audio
self.frame_height = frame_height
First, we need to load the video and split it into the frames on the given interval. For that, we will implement a class method based on MoviePy, which will return us a list containing all the frames as PIL Image variables. The function will also already downscale the images, ensuring original aspect ratio is kept.
def get_video_images(self, video_path, seconds_between_frames=1.0, desired_height=480):
print("Extracting frames from the video "+video_path)
images = []
try:
with VideoFileClip(video_path) as clip:
current_time = 0
current_step = 0
steps = int(int(clip.duration)/seconds_between_frames)
while current_time < clip.duration:
image = Image.fromarray(clip.get_frame(current_time))
x,y = image.size
divider = desired_height/y
new_width=x*divider
image = image.convert('RGB')
image = image.resize((int(new_width),desired_height), resample=Image.LANCZOS)
images.append(image)
print("Extracted frame "+str(current_step)+" out of "+str(steps))
current_step += 1
current_time += seconds_between_frames
print("Frames extracted.")
return images
except Exception as e:
print(f"An error occured when extracting video frames: {e}")
return images
In order to test it, the end of the script I called this function and then used a simple loop to save all images from the list to files. Let's try it on "Steamboat Willie", which entered the public domain this year.
After a minute passed, the script successfully extracted the images from the movie (one frame every 5 seconds).
Now we must combine the frames into comic-like forms for AI to analyze. I written such function to do this:
def create_comics(self, images, amount_per_page: int):
list_l = len(images)
comics = []
if list_l>=1:
x,y = images[0].size
n=0
paste_y = 0
canvas = Image.new('RGB', (x, y*amount_per_page), color='white')
while n0:
comics.append(canvas)
print("Created image page "+str(int(n/amount_per_page))+" out of "+str(int(list_l/amount_per_page)))
paste_y = 0
canvas = Image.new('RGB', (x, y*amount_per_page), color='white')
canvas.paste(images[n],(0,paste_y))
n+=1
paste_y+=y
else:
canvas.paste(images[n],(0,paste_y))
n+=1
paste_y+=y
return comics
This function turns images from the previous list into comic-like pages. This is example of how such looks like:
Now we have bare minimum to test my approach. Let's modify the class to use these functions and add the AI agents using my own OpenAI API interface. As I described earlier, I will now send each 'comic' to AI for describing it.
from moviepy.video.io.VideoFileClip import VideoFileClip
from PIL import Image
import os
import easyapiopenai
from WojSafeAPI import * #My own library for loading API keys safely, you cannot use it.
class Video_ChatGPT:
def __init__(self,openai_api_key:str, video_path: str, time_between_frames_seconds: float,
frames_per_scene: int, process_audio=True, frame_height = 480,
watcher_model='gpt-4o-mini',master_model='gpt-4o',
watcher_token_limit = 500, master_token_limit=15*1000, add_descriptions_to_system=False,
add_to_watchers_system=""):
self.video_path = video_path
self.time_between_frames_seconds = time_between_frames_seconds
self.api_key = openai_api_key
self.frames_per_scene = frames_per_scene
self.process_audio = process_audio
self.frame_height = frame_height
self.add_descriptions_to_system = add_descriptions_to_system
temp1 = self.get_video_images(self.video_path, self.time_between_frames_seconds, self.frame_height)
self.video_comics = self.create_comics(temp1, self.frames_per_scene)
del temp1 #We can delete the list with all frames since only the one with the 'comics' will be used.
watcher_system = "You will recieve fragment of a video in form of a "+str(self.frames_per_scene) +" comic. Write a description of the fragment as a whole. Never mention it being a comic."+add_to_watchers_system
self.frame_watcher_ai = easyapiopenai.ImgChatGPTAgent(self.api_key,watcher_model,watcher_token_limit,watcher_system)
self.video_description = self.describe_video()
#del self.video_comics
master_system = "You are a helpful assistant capable of video analsis. Do not speak about the video unless specifically asked about it."
if self.add_descriptions_to_system==True:
master_system = master_system + "\n\n[Description of the video scene by scene:\n" + self.video_description+"]"
self.master_ai = easyapiopenai.ChatGPTAgent(self.api_key,master_model,master_token_limit,master_system)
def GetResponseFromMaster(self, prompt):
final_prompt = prompt
if self.add_descriptions_to_system == False:
final_prompt = final_prompt + "\n\nDescription of the video:\n\n"+self.video_description
return self.master_ai.GetResponse(final_prompt)
def describe_video(self):
description = ""
n=1
print("Describing each scene...")
for i in self.video_comics:
try:
#i.show()
text = self.frame_watcher_ai.GetResponseWithImg(i,"Describe this fragment of the video based on the comic.").replace("\n"," ")
description = description + "Scene number "+str(n)+":\n"+text+"\n\n"
print("Scene "+str(n)+" "+text)
print(" ")
n+=1
except Exception as e:
print(f"An error occured describing the fragment: {e}")
self.frame_watcher_ai.ClearHistory()
print("Done describing.")
return description
def get_video_images(self, video_path, seconds_between_frames=1.0, desired_height=480):
print("Extracting frames from the video "+video_path)
images = []
try:
with VideoFileClip(video_path) as clip:
current_time = 0
current_step = 0
steps = int(int(clip.duration)/seconds_between_frames)
while current_time < clip.duration:
image = Image.fromarray(clip.get_frame(current_time))
x,y = image.size
divider = desired_height/y
new_width=x*divider
image = image.convert('RGB')
image = image.resize((int(new_width),desired_height), resample=Image.LANCZOS)
images.append(image)
print("Extracted frame "+str(current_step)+" out of "+str(steps))
current_step += 1
current_time += seconds_between_frames
print("Frames extracted.")
return images
except Exception as e:
print(f"An error occured when extracting video frames: {e}")
return images
def create_comics(self, images, amount_per_page: int):
list_l = len(images)
comics = []
if list_l>=1:
x,y = images[0].size
n=0
paste_y = 0
canvas = Image.new('RGB', (x, y*amount_per_page), color='white')
while n0:
comics.append(canvas)
print("Created image page "+str(int(n/amount_per_page))+" out of "+str(int(list_l/amount_per_page)))
paste_y = 0
canvas = Image.new('RGB', (x, y*amount_per_page), color='white')
canvas.paste(images[n],(0,paste_y))
n+=1
paste_y+=y
else:
canvas.paste(images[n],(0,paste_y))
n+=1
paste_y+=y
return comics
if __name__ == "__main__":
agent = Video_ChatGPT(openai_api_key=YourAPIKeyHere('openai'),video_path="film.webm",time_between_frames_seconds=5,frames_per_scene=5,process_audio=False,frame_height=300)
while True:
print(agent.GetResponseFromMaster(input("Type: ")))
Once the processing is done, the master AI can use the data as context and answer queries about the video.
What's next?
I think I will end this post here and then write next posts where I will be improving it. Currently it makes a proof of concept.
First, to understand videos it also needs audio data, especially dialogues. The code is currently not too optimal and descriptions are huge. I also must make it more modular than it currently is.
Currently, the descriptions are without specific details. As a result, the AI will likely not be able to answer specific questions well.
However, I have ideas how to solve that.
Instead of computing the whole video at the beginning, we can wait until a question about the video is asked by the user.
First, we can first take user's input to a minimalistic AI agent (like 4o-mini or 3.5-turbo) which will determine whenever the prompt asks about something related to a video.
If no, then the question will be just passed to the Master AI without any video analysis.
But if yes, then the AI agent could determine what to focus on (for example character's clothing, who wins a battle; whatever the user asked about) and then start video analysis.
Having information what aspect to focus on or look for, the AI agent responsible for analysis of the video can look at frames in that context. If it sees that a fragment of the video does not contain any information related to the prompt, it could simply answer with for example NO_INFO_HERE.
That approach would shorten video descriptions to only what's necessary, and the description would be more tailored to match needs of what's needed to answer the user's prompt. Less tokens (words) would be generated by the AI too, resulting in faster execution and lower computation price. Lastly, such approach would make it easy to change the analysed video during the chat, without need to reinitalize the whole class and therefore reset the whole conversation.
But I will implement that in part 2, so stay tuned!
This Christmas, I also hope to manage to write about things I said earlier that I want to write more about, like continuation of my ISKRA project experiments or more tutorials.
Thank you for reading this post!
Comments
Post a Comment