Centricular

Expertise, Straight from the Source



Devlog

Read about our latest work!

At the '25 GStreamer conference I gave a talk titled Costly Speech: an introduction.

This was in reference to the fact that all the speech-related elements used in the pipeline I presented were wrappers around for-pay cloud services or for-pay on-site servers.

At the end of the talk, I mentioned that plans for future development included new, "free" backends. The first piece of the puzzle was a Whisper-based transcriber.

I have the pleasure to announce that it is now implemented and published, thank you to Ray Tiley from Tightrope Media Systems for sponsoring this work!

Design / Implementation

The main design goal was for the new transcriber to behave identically to the existing transcribers, in particular:

  • It needed to output timestamped words one at a time
  • It needed to handle live streams with a configurable latency

In order to fulfill that second requirement, the implementation has to feed the model with chunks of a configurable duration.

This approach works well for constraining the latency, but didn't give the best results accuracy-wise, as words close to the chunk boundaries would often go misssing, poorly transcribed or duplicated.

To address this, the implementation uses two mechanisms:

  • It always feeds the previous chunk when running inference for a given chunk
  • It extracts tokens from a sliding window at a configurable distance from the "live edge"

Here's an example with a 4-second chunk duration and a 1 second live edge offset:

0     1     2     3     4     5     6     7     8
| 4-second chunk        | 4-second chunk        |
                  | 4-second token window |

This approach greatly mitigates the boundary issues, as the tokens are always extracted from a "stable" region of the model's output.

With the above settings, the element reports a 5-second latency, to which a configurable processing latency is added. That processing latency is dependent on the hardware, on my machine using CUDA and a NVIDIA RTX 5080 GPU processing time is around 10x real time, which means 1 second processing latency is sufficient.

The obvious drawback of this approach is a doubling of the resource usage as each chunk is fed twice through the inference model, it could be further refined to only feed part of the previous chunk and thus increase performance without sacrificing accuracy.

As the interface of the element follows that of other transcribers, it can be used as an alternative transcriber within transcriberbin.

Future prospects

The biggest missing piece to bring the transcriber to feature parity with other transcribers such as the speechmatics-based one is speaker diarization (~ identification).

Whisper itself does not support diarization. The tinydiarize project aimed to finetune models to address this, but it has unfortunately been put on hold for now, and only supported detecting speaker changes, not identifying individual speakers.

It is not clear at the moment what would be the best open source option to integrate for this task. Models such as NVidia's streaming sortformer are promising, but limited to four speakers for example.

We are very interested in suggestions on this front. Don't hesitate to hit us up if you have any or are interested in sponsoring further improvements to our growing stack of speech-related elements!



Icecast is a Free and Open Source multimedia streaming server, primarily used for audio and radio streaming over HTTP(S).

In GStreamer you can send an audio stream to such a server with the shout2send sink element based on libshout2.

This works perfectly fine, but has one limitation: it does not support the AAC audio codec, which for some use cases and target systems is the preferred audio codec. This is because libshout2 does not support it and will not support it, at least not officially upstream.

Some streaming servers such as the Rocket Streaming Audio Server (RSAS) do support this though, and as such it would be nice to be able to send streams to them in AAC format as well.

Enter icecastsink, which is a new sink element written in Rust to send audio to an Icecast server.

It supports sending AAC audio in addition to Ogg/Vorbis, Ogg/Opus, FLAC and MP3, and also has support for automatic re-connect in case the server kicks off the client, which might happen if the client doesn't send data for a while.

Give it a spin and let us know how it goes!



One of the many items on my "nice-to-have" TODO list has been shipping a GStreamer installer that natively targets Windows ARM64. Cerbero has had support for cross-compiling to Windows ARM64 since GStreamer 1.16 in the form of targeting UWP. However, once that was laid to rest with GStreamer 1.22, we didn't start shipping Windows ARM64 installers instead because it was looking like Microsoft's ARM64 experiment had also failed.

Lately, however, there's been a significant resurgence of ARM64 laptops that run Windows, and they seem to actually have compelling features for some types of users. So I spent a day or two and reinstated support for Windows ARM64 built with MSVC in Cerbero.

My purpose was just to find the shortest path to getting that to a usable state, so a bunch of plugins are missing. In particular all Rust plugins had to be disabled due to an issue building the ring crate. I am optimistic that someone will come along and help fix these issues 😉

You can find the installer at the usual location: https://gstreamer.freedesktop.org/download/#windows

Note that these binaries are cross-compiled from x86_64, so the installer itself is x86, and the contents are missing gobject-introspection and Python bindings. We are also unable to generate Python wheels for Windows ARM64 because of this. If someone would like to help with any of this, please get in touch on the Windows channel in GStreamer's Matrix community.



Currently most code using the GStreamer Analytics library library is written in C or Python. To check how well the API works from Rust, and to have an excuse to play with the Rust burn deep-learning framework, I've implemented an object detection inference element based on the YOLOX model and a corresponding tensor decoder that allows usage with other elements based on the GstAnalytics API. I started this work at the last GStreamer hackfest, but this has now finally been merged and will be part of the GStreamer 1.28.0 release.

burn is a deep-learning framework in Rust that is approximately on the same level of abstraction as PyTorch. It features lots of computation backends (CPU-based, Vulkan, CUDA, ROCm, Metal, libtorch, ...), has loaders (or better: code generation) for e.g. ONNX or PyTorch models, and compiles and optimizes the model for a specific backend. It also comes with a repository containing various example models and links to other community models.

The first element is burn-yoloxinference. It takes raw RGB video frames and passes them through burn; as of the time of this writing either through a CPU-based or a Vulkan-based computation backend. The output then is the very same video frames with the raw object detection results attached as a GstTensorMeta. This is essentially a 85x8400 float matrix, which contains 8400 rows of candidate object detection boxes (4 floats) together with confidence values for the classes (80 floats for the pre-trained models on the COCO classes) and one confidence value for the overall box. The element itself is mostly boilerplate, caps negotiation code and glue code between GStreamer and burn.

The second element is yoloxtensordec. This takes the output of the first element and decodes the GstTensorMeta into a GstAnalyticsRelationMeta, which describes the detected objects with their bounding boxes in an abstract way. As part of this it also implements a non-maximum suppression (NMS) filter using intersection over unions (IoU) of bounding boxes to reduce the 8400 candidate boxes to a much lower number of actual likely object detections. The GstAnalyticsRelationMeta can then be used e.g. by the generic objectdetectionoverlay to render rectangles on top of the video, or the ioutracker elements to track objects over a sequence of frames. Again, this element is mostly boilerplate and caps negotiation code, plus around 100 SLOC of algorithm. In comparison the C YOLOv9 tensor decoder element is about 3x as much code, mostly thanks to the overhead of C memory book-keeping, lack of useful data structures and lack of abstraction language tools.

The reason why the tensor decoder is a separate element is mostly to have one such element per model and to have it implemented independently of the actual implementation and runtime of the model. The same tensor decoder should, for example, also work fine on the output of the onnxinference element with a YOLOX model. From GStreamer 1.28 onwards it will also be possible to autoplug suitable tensor decoders via the tensordecodebin element.

That the tensor decoders are independent of the actual implementation of the model also has the advantage that it can be implemented in a different language, preferably in a safer and less verbose language than C.

For using both elements together and using objectdetectionoverlay to render rectangles around the object detections, the following pipeline can be used:

gst-launch-1.0 souphttpsrc location=https://raw.githubusercontent.com/tracel-ai/models/f4444a90955c1c6fda90597aac95039a393beb5a/squeezenet-burn/samples/cat.jpg \
    ! jpegdec ! videoconvertscale ! "video/x-raw,width=640,height=640" \
    ! burn-yoloxinference model-type=large backend-type=vulkan ! yoloxtensordec label-file=COCO_classes.txt \
    ! videoconvertscale ! objectdetectionoverlay \
    ! videoconvertscale ! imagefreeze ! autovideosink -v

The output should look similar to this image.

I also did a lightning talk about this at the GStreamer conference this year.



When using HTTP Live Streaming (HLS), a common use case is to use MPEG-TS segments or fragmented MP4 fragments. This is done so that the overall stream is available as a sequence of small HTTP-based file downloads, each being one short chunk of an overall bounded or unbounded media stream.

The playlist file (.m3u8) contains a list of these small segments or fragments. This is the standard and most common approach for HLS. For the HLS CMAF case, a multi-segment playlist would look like below.

#EXTM3U
#EXT-X-VERSION:6
#EXT-X-INDEPENDENT-SEGMENTS
#EXT-X-TARGETDURATION:5
#EXT-X-PLAYLIST-TYPE:VOD
#EXT-X-MAP:URI="init00000.mp4"
#EXTINF:5,
segment00000.m4s
#EXTINF:5,
segment00001.m4s
#EXTINF:5,
segment00002.m4s

An alternative approach is to use a single media file with the EXT-X-BYTERANGE tag. This method is primarily used for on-demand (VOD) streaming where the complete media file already exists and can reduce the number of files that needs to be managed on the server. Single file with byte-ranges requires the server and client to support HTTP byte range requests and 206 Partial Content responses.

The single media file use case wasn't supported so far with either of hlssink3 or hlscmafsink. A new property single-media-file has been added, which lets users specify the use of a single media file.

hlscmafsink.set_property("single-media-file", "main.mp4");
hlssink3.set_property("single-media-file", "main.ts");

For the HLS CMAF case, this would generate a playlist like below.

#EXTM3U
#EXT-X-VERSION:6
#EXT-X-INDEPENDENT-SEGMENTS
#EXT-X-TARGETDURATION:5
#EXT-X-PLAYLIST-TYPE:VOD
#EXT-X-MAP:URI="main.mp4",BYTERANGE="768@0"
#EXT-X-BYTERANGE:100292@768
#EXTINF:5,
main.mp4
#EXT-X-BYTERANGE:98990@101060
#EXTINF:5,
main.mp4
#EXT-X-BYTERANGE:99329@200050
#EXTINF:5,
main.mp4

This can be useful if one has storage requirements where the use of a single media file for HLS might be favourable.



Audio source separation describes the process of splitting an already mixed audio stream into its individual, logical sources. For example, splitting a song into separate streams for its individual instruments and vocals. This can be used for example for karaoke, music practice, or isolating the speaker from background noise for easier understanding by humans or improving results of speech-to-text processing.

Starting with GStreamer 1.28.0 an element for this purpose will be included. It is based on the Python/pytorch implementation of demucs and comes with various pre-trained models with different performance and accuracy characteristics, as well as which different sets of sources they can separate. CPU-based processing is generally multiple times real-time on modern CPUs (around 8x on mine) but GPU-based processing via pytorch is also possible.

The element itself is part of the GStreamer Rust plugins and can either run demucs locally in-process using an embedded Python interpreter via pyo3, or via a small Python service over WebSockets that can run either locally or remotely (e.g. for thin clients). The used model, and chunk size and overlap between chunks can be configured. Chunk size and overlap provide control over the introduced latency (lower values give lower latency) and quality (higher values give better quality).

The separate sources are provided on individual source pads of the element and it effectively behaves like a demuxer. A pipeline for karaoke would for example look as follows:

gst-launch-1.0 uridecodebin uri=file:///path/to/music/file ! audioconvert ! tee name=t ! \
  queue max-size-time=0 max-size-bytes=0 max-size-buffers=2 ! demucs name=demucs model-name=htdemucs \
  demucs.src_vocals ! queue ! audioamplify amplification=-1 ! mixer.sink_0 \
  t. ! queue max-size-time=9000000000 max-size-bytes=0 max-size-buffers=0 ! mixer.sink_1 \
  audiomixer name=mixer ! audioconvert ! autoaudiosink

This takes an URI to a music file, passes that through the demucs element for extracting the vocals, then takes the original input via a tee and subtracts the vocals from it by first inverting all samples of the vocals stream with the audioamplify element and then mixing it with the original input with an audiomixer.

I also did a lightning talk about this at the GStreamer conference this year.



Back in June '25, I implemented a new speech synthesis element using the ElevenLabs API.

In this post I will briefly explain some of the design choices I made, and provide one or two usage examples.

POST vs. WSS

ElevenLabs offers two interfaces for speech synthesis:

  • Either open a websocket and feed the service small chunks of text (eg words) to receive a continuous audio stream

  • Or POST longer segments of text to receive independent audio fragments

The websocket API is well-adapted to conversational use cases, and can offer the lowest latency, but isn't the most well-suited to the use cases I was targeting: my goal was to use it to synthesize audio from text that was first transcribed, then translated from an original input audio stream.

In this situation we have two constraints we need to be mindful of:

  • For translation purposes we need to construct large enough text segments prior to translating, in order for the translation service to operate with enough context to do a good job.

  • Once audio has been synthesized, we might also need to resample it in order to have it fit within the original duration of the speech.

Given that:

  • The latency benefits from using the websocket API are largely negated by the larger text segments we would use as the input

  • Resampling the continuous stream we would receive to make sure individual words are time-shifted back to the "correct" position, while possible thanks to the sync_alignment option, would have increased the complexity of the resulting element

I chose to use the POST API for this element. We might still choose to implement a websocket-based version if there is a good story for using GStreamer in a conversational pipeline, but that is not on my radar for now.

Additionally, we already have a speech synthesis element around the AWS Polly API which is also POST-based, so both elements can share a similar design.

Audio resampling

As mentioned previously, the ElevenLabs API does not offer direct control over the duration of the output audio.

For instance, you might be dubbing speech from a fast speaker with a slow voice, potentially causing the output audio to drift out of sync.

To address this, the element can optionally make use of signalsmith_stretch to resample the audio in a pitch-preserving manner.

When the feature is enabled it can be used through the overflow=compress property.

The effect can sometimes be pretty jarring for very short input, so an extra property is also exposed to allow some tolerance for drift: max-overflow. It represents the maximum duration by which the audio output is allow to drift out of sync, and does a good job using up intervals of silence between utterances.

Voice cloning

The ElevenLabs API exposes a pretty powerful feature, Instant Voice Cloning. It can be used to create a custom voice that will sound very much like a reference voice, requiring only a handful of seconds to a few minutes of reference audio data to produce useful results.

Using the multilingual model, that newly-cloned voice can even be used to generate convincing speech in a different language.

A typical pipeline for my target use case can be represented as (pseudo gst-launch):

input_audio_src ! transcriber ! translator ! synthesizer

When using a transcriber element such as speechmaticstranscriber, speaker "diarization" (fancy word for detection) can be used to determine when a given speaker was speaking, thus making it possible to clone voices even in a multi-speaker situation.

The challenge in this situation however is that the synthesizer element doesn't have access to the original audio samples, as it only deals with text as the input.

I thus decided on the following solution:

input_audio_src ! voicecloner ! transcriber ! .. ! synthesizer

The voice cloner element will accumulate audio samples, then upon receiving custom upstream events from the transcriber element with information about speaker timings it will start cloning voices and trim its internal sample queue.

To be compatible, a transcriber simply needs to send the appropriate events upstream. The speechmaticstranscriber element can be used as a reference.

Finally, once a voice clone is ready, the cloner element sends another event downstream with a mapping of speaker id to voice id. The synthesizer element can then intercept the event and start using the newly-created voice clone.

The cloner element can also be used in single-speaker voice by just setting the speaker property to some identifier and watching for messages on the bus:

gst-launch-1.0 -m -e alsasrc ! audioconvert ! audioresample ! queue ! elevenlabsvoicecloner api-key=$SPEECHMATICS_API_KEY speaker="Mathieu" ! fakesink

Putting it all together

At this year's GStreamer conference I gave a talk where I demo'd these new elements.

This is the pipeline I used then:

AWS_ACCESS_KEY_ID="XXX" AWS_SECRET_ACCESS_KEY="XXX" gst-launch-1.0 uridecodebin uri=file:///home/meh/Videos/spanish-convo-trimmed.webm name=ud \
  ud. ! queue max-size-time=15000000000 max-size-bytes=0 max-size-buffers=0 ! clocksync ! autovideosink \
  ud. ! audioconvert ! audioresample ! clocksync ! elevenlabsvoicecloner api-key=XXX ! \
    speechmaticstranscriber url=wss://eu2.rt.speechmatics.com/v2 enable-late-punctuation-hack=false join-punctuation=false api-key="XXX" max-delay=2500 latency=4000 language-code=es diarization=speaker ! \
    queue max-size-time=15000000000 max-size-bytes=0 max-size-buffers=0 ! textaccumulate latency=3000 drain-on-final-transcripts=false extend-duration=true ! \
    awstranslate latency=1000 input-language-code="es-ES" output-language-code="en-EN" ! \
    elevenlabssynthesizer api-key=XXX retry-with-speed=false overflow=compress latency=3000 language-code="en" voice-id="iCKVfVbyCo5AAswzTkkX" model-id="eleven_multilingual_v2" max-overflow=0 ! \
    queue max-size-time=15000000000 max-size-bytes=0 max-size-buffers=0 ! audiomixer name=m ! autoaudiosink audiotestsrc volume=0.03 wave=violet-noise ! clocksync ! m.

Watch my talk for the result, or try it yourself (you will need API keys for speechmatics / AWS / elevenlabs)!



The GStreamer Material Exchange Format (MXF) muxer and demuxer elements so far only supported extracting Vertical Ancillary Data (VANC) as closed captions. Any other VANC data was silently dropped. This was primarily reflected by the sink pad template of mxfmux.

  SINK template: 'vanc_sink_%u'
    Availability: On request
    Capabilities:
      closedcaption/x-cea-708
                 format: cdp
              framerate: [ 0/1, 2147483647/1 ]

mxfmux and mxfdemux have now been extended to support arbitrary VANC data.

SMPTE 436 (pdf) specification defines how the ancillary data is stored in MXF. SMPTE 2038 (pdf) defines the carriage of Ancillary Data Packets in an MPEG-2 Transport Stream acting as a more structured format (ST2038) in comparison to the line-based format (ST436M). mxfdemux converts from ST436M to ST2038 while mxfmux converts from ST2038 to ST436M. So mxfdemux now outputs VANC (ST436M) essence tracks as ST2038 streams and mxfmux consumes ST2038 streams to output VANC (ST436M) essence tracks.

A breaking change was introduced to support this in the muxer, by updating the acceptable caps on the pad. The sink pad template of mxfmux has now changed to meta/x-st-2038 instead of the earlier closedcaption/x-cea-708. Applications can use cctost2038anc for converting closed captions to ST2038.

  SINK template: 'vanc_sink_%u'
    Availability: On request
    Capabilities:
      meta/x-st-2038
              alignment: frame (gchararray)

While the pad templates of mxfdemux haven't changed as shown below, the caps on the source pad are going to be meta/x-st-2038 for VANC data, so applications have to handle different caps now. Closed captions can be extracted via st2038anctocc.

  SRC template: 'track_%u'
    Availability: Sometimes
    Capabilities:
      ANY

The older behaviour is still available via an environment variable GST_VANC_AS_CEA708. In addition, mxfdemux can now read both, 8-bit and 10-bit VANC data from MXF files.

The ST2038 elements available in Rust plugins and described in an earlier post here, have also seen some fixes for correctly handling alignment and framerate.