At the '25 GStreamer conference I gave a talk titled Costly Speech: an introduction.
This was in reference to the fact that all the speech-related elements used in the pipeline I presented were wrappers around for-pay cloud services or for-pay on-site servers.
At the end of the talk, I mentioned that plans for future development included new, "free" backends. The first piece of the puzzle was a Whisper-based transcriber.
I have the pleasure to announce that it is now implemented and published, thank you to Ray Tiley from Tightrope Media Systems for sponsoring this work!
Design / Implementation
The main design goal was for the new transcriber to behave identically to the existing transcribers, in particular:
- It needed to output timestamped words one at a time
- It needed to handle live streams with a configurable latency
In order to fulfill that second requirement, the implementation has to feed the model with chunks of a configurable duration.
This approach works well for constraining the latency, but didn't give the best results accuracy-wise, as words close to the chunk boundaries would often go misssing, poorly transcribed or duplicated.
To address this, the implementation uses two mechanisms:
- It always feeds the previous chunk when running inference for a given chunk
- It extracts tokens from a sliding window at a configurable distance from the "live edge"
Here's an example with a 4-second chunk duration and a 1 second live edge offset:
0 1 2 3 4 5 6 7 8
| 4-second chunk | 4-second chunk |
| 4-second token window |
This approach greatly mitigates the boundary issues, as the tokens are always extracted from a "stable" region of the model's output.
With the above settings, the element reports a 5-second latency, to which a configurable processing latency is added. That processing latency is dependent on the hardware, on my machine using CUDA and a NVIDIA RTX 5080 GPU processing time is around 10x real time, which means 1 second processing latency is sufficient.
The obvious drawback of this approach is a doubling of the resource usage as each chunk is fed twice through the inference model, it could be further refined to only feed part of the previous chunk and thus increase performance without sacrificing accuracy.
As the interface of the element follows that of other transcribers, it can be used as an alternative transcriber within transcriberbin.
Future prospects
The biggest missing piece to bring the transcriber to feature parity with other transcribers such as the speechmatics-based one is speaker diarization (~ identification).
Whisper itself does not support diarization. The tinydiarize project aimed to finetune models to address this, but it has unfortunately been put on hold for now, and only supported detecting speaker changes, not identifying individual speakers.
It is not clear at the moment what would be the best open source option to integrate for this task. Models such as NVidia's streaming sortformer are promising, but limited to four speakers for example.
We are very interested in suggestions on this front. Don't hesitate to hit us up if you have any or are interested in sponsoring further improvements to our growing stack of speech-related elements!