← Back to Blog

Building SpeechCut – Audio-Driven Editing on Mobile Wasn't Easy

Juuso By Juuso

27 October 2025

SpeechCut may look simple from the outside, you just tap words and the video edits itself. But behind the scenes, any type of video editor is pretty complex thing to build. You need to carefully choreograph the dance between audio, video, and text. In this post I want to share some of the challenges I faced while building it, and why it was worth it.

Going Speech First

Traditional video editors are built around the video timeline first: you drag clips, set in/out points, and work with different layers. Usually it's a very flexible way of working, but when editing fast paced speech-based shortvideos on mobile, it's really not the nicest way to work. This is why I decided to flip that model on its head.

In SpeechCut, your speech and transcription are at the heart of everything. When you record, we create a perfect transcript that matches your voice. Every word is linked to the exact moment you said it. When you tap to remove a word, that moment gets skipped in your final video. Also, when you remove a word, it's not really deleted, and it's as easy to toggle back on when ever you want to.

This approach makes editing feel natural, stress free and effortless.

  • All your edits are automatically based on what actually matters: Your speech and it's timings
  • We adapt the video quality to your device, ensuring smooth playback whether you're using a flagship phone or an older model
  • The captions stays always perfectly in sync, even if you remove phrases or words from your video

Video Sync and Word-Level Timing

To make this work:

  • I generate a transcript with word-level timestamps, and combine them into logical phrases.
  • Then, I map each word and phrase to the video player using those timestamps.
  • If you disable a word or phrase, it doesn't just get grayed out-it actually skips playback during preview and export.

All this happens on-device, with minimal backend interaction.

Real-Time Preview

One of the hardest features to build was the real-time preview where the video skips disabled words as if it were edited, and does this swiftly. Most mobile players aren't built for this kind of nonlinear playback, so I had to build a custom preview engine that tracks audio chunks and aligns video playback with each word. Nerdy stuff!

Why Go Through All This?

Because creators don't want to mess around with timelines, razor tools, and nested clips. They want to say something meaningful and clean it up quickly-on their phones, while the idea is still fresh.

SpeechCut is still growing, but I'm proud of how far it's come. If you're into video editing, tech experiments, or mobile UX design, I'd love to hear what you think-or what you'd build next.