Building SpeechCut – Speech-Driven Video Editing on Mobile Wasn't Easy

SpeechCut may look simple from the outside, you just tap words and the video edits itself. But behind the scenes, any type of video editor is pretty complex thing to build. You need to carefully choreograph the dance between audio, video, and text. In this post I want to share some of the challenges I faced while building it, and why it was worth it.

Going Speech First

Traditional video editors are built around the video timeline first: you drag clips, set in/out points, and work with different layers. Usually it's a very flexible way of working, but when editing fast paced speech-based shortvideos on mobile, it's really not the nicest way to work. This is why I decided to flip that model on its head.

In SpeechCut, your speech and transcription are at the heart of everything. When you record, we create a detailed transcript that captures your voice with precise word timings. Every word is linked to the exact moment you said it. When you tap to remove a word, that moment gets skipped in your final video. Also, when you remove a word, it's not really deleted, and it's as easy to toggle back on when ever you want to.

This approach makes editing feel natural, stress free and effortless.

All your edits are automatically based on what actually matters: Your speech and it's timings
We adapt the video quality to your device, ensuring smooth playback whether you're using a flagship phone or an older model
The captions stay in sync as you edit, even if you remove phrases or words from your video

Video Sync and Word-Level Timing

To make this work:

I generate a transcript with word-level timestamps, and combine them into logical phrases.
Then, I map each word and phrase to the video player using those timestamps.
If you disable a word or phrase, it doesn't just get grayed out-it actually skips playback during preview and export.

All this happens on-device, with minimal backend interaction.

Real-Time Preview

One of the hardest features to build was the real-time preview where the video skips disabled words as if it were edited, and does this swiftly. Most mobile players aren't built for this kind of nonlinear playback, so I had to build a custom preview engine that tracks audio chunks and aligns video playback with each word. Nerdy stuff!

Why Go Through All This?

Because creators don't want to mess around with timelines, razor tools, and nested clips. They want to say something meaningful and clean it up quickly-on their phones, while the idea is still fresh.

SpeechCut is still growing, but I'm proud of how far it's come. If you're into video editing, tech experiments, or mobile UX design, I'd love to hear what you think-or what you'd build next.