Creating an Automated Workflow for Clip Farming
/ 3 min read
Introduction
I always encountered Minecraft videos when scrolling through YouTube Shorts from content creators such as DrDonut, JudeLow, and some clips from the UnstableSMP.
I personally never watch streamers when they’re live so I could never create clips on YouTube Shorts consistently, but I always thought that this could be automated such as searching for clips to editing them. With the release of artificial intelligence and my time all freed up, I can finally put this idea to use and maybe learn how to integrate artificial intelligence into workflows.
I had two requirements for this project: everything should be able to run on a single computer (including the AI models) and it should take 30 minutes inspect the clips to create out 5 new videos. You might ask, why not rent out GPU processing power from big cloud companies such as Amazon Web Services or Google Cloud? As of April 2026 for 24GB of NVIDIA VRAM, it would cost around $220 per month to run a g6.2xlarge instance provided by Amazon Web Services 8 hours a day all month, I haven’t even bothered to check how much it would cost to run it on Google Cloud.
A question I’ve also been asked is will I open source the code and details about the infrastructure? I probably won’t beyond the scope of this article.
Development
I wasn’t sure on how to actually implement everything, so I separated the project into 3 different subprojects:
- A process that watches YouTube streams (let’s call this the “worker”)
- The backend that can start the processes mentioned above
- The frontend to interact with the backend
Building the Minimal Viable Product
Implementing the worker process was a little difficult since this was the only thing that was new to me at the time. I used uv to manage the Python dependencies and used OpenAI’s Whisper to transcribe the audio from an audio stream given from Streamlink. The transcribed text was later then fed to Meta’s llamma3.2 model to determine if the audio was clip-worthy or not. However this was not enough to determine if a clip is high quality enough to be uploaded to a short-form content platform.
You can see the program working somewhat perfectly transcribing Farex’s YouTube stream.
The result was straight trash. My program couldn’t tell exactly who was speaking since there was proximity chat on some of the Minecraft servers the streamers were playing on, it was clipping random moments such as boring conversations, or even completely misunderstanding what is going on in the stream.
The following is the only video that is worth clipping out of the hundreds that have been clipped from lerdi’s YouTube stream. The worst part is that the program clipped it for a completely different reason because it misinterpreted what the streamer was saying.
The Next Steps
I was about to give up on this project until I did a little bit more research which led me to discover that Google DeepMind released an open model called Gemma and Alibaba Cloud (阿里云) has released their own called Qwen. The difference between these and OpenAI’s Whisper is that these models analyze video and audio together which could be game-changing for my clipfarmer.