Centricular

Expertise, Straight from the Source



Devlog

Read about our latest work!

Currently most code using the GStreamer Analytics library library is written in C or Python. To check how well the API works from Rust, and to have an excuse to play with the Rust burn deep-learning framework, I've implemented an object detection inference element based on the YOLOX model and a corresponding tensor decoder that allows usage with other elements based on the GstAnalytics API. I started this work at the last GStreamer hackfest, but this has now finally been merged and will be part of the GStreamer 1.28.0 release.

burn is a deep-learning framework in Rust that is approximately on the same level of abstraction as PyTorch. It features lots of computation backends (CPU-based, Vulkan, CUDA, ROCm, Metal, libtorch, ...), has loaders (or better: code generation) for e.g. ONNX or PyTorch models, and compiles and optimizes the model for a specific backend. It also comes with a repository containing various example models and links to other community models.

The first element is burn-yoloxinference. It takes raw RGB video frames and passes them through burn; as of the time of this writing either through a CPU-based or a Vulkan-based computation backend. The output then is the very same video frames with the raw object detection results attached as a GstTensorMeta. This is essentially a 85x8400 float matrix, which contains 8400 rows of candidate object detection boxes (4 floats) together with confidence values for the classes (80 floats for the pre-trained models on the COCO classes) and one confidence value for the overall box. The element itself is mostly boilerplate, caps negotiation code and glue code between GStreamer and burn.

The second element is yoloxtensordec. This takes the output of the first element and decodes the GstTensorMeta into a GstAnalyticsRelationMeta, which describes the detected objects with their bounding boxes in an abstract way. As part of this it also implements a non-maximum suppression (NMS) filter using intersection over unions (IoU) of bounding boxes to reduce the 8400 candidate boxes to a much lower number of actual likely object detections. The GstAnalyticsRelationMeta can then be used e.g. by the generic objectdetectionoverlay to render rectangles on top of the video, or the ioutracker elements to track objects over a sequence of frames. Again, this element is mostly boilerplate and caps negotiation code, plus around 100 SLOC of algorithm. In comparison the C YOLOv9 tensor decoder element is about 3x as much code, mostly thanks to the overhead of C memory book-keeping, lack of useful data structures and lack of abstraction language tools.

The reason why the tensor decoder is a separate element is mostly to have one such element per model and to have it implemented independently of the actual implementation and runtime of the model. The same tensor decoder should, for example, also work fine on the output of the onnxinference element with a YOLOX model. From GStreamer 1.28 onwards it will also be possible to autoplug suitable tensor decoders via the tensordecodebin element.

That the tensor decoders are independent of the actual implementation of the model also has the advantage that it can be implemented in a different language, preferably in a safer and less verbose language than C.

For using both elements together and using objectdetectionoverlay to render rectangles around the object detections, the following pipeline can be used:

gst-launch-1.0 souphttpsrc location=https://raw.githubusercontent.com/tracel-ai/models/f4444a90955c1c6fda90597aac95039a393beb5a/squeezenet-burn/samples/cat.jpg \
    ! jpegdec ! videoconvertscale ! "video/x-raw,width=640,height=640" \
    ! burn-yoloxinference model-type=large backend-type=vulkan ! yoloxtensordec label-file=COCO_classes.txt \
    ! videoconvertscale ! objectdetectionoverlay \
    ! videoconvertscale ! imagefreeze ! autovideosink -v

The output should look similar to this image.

I also did a lightning talk about this at the GStreamer conference this year.



When using HTTP Live Streaming (HLS), a common use case is to use MPEG-TS segments or fragmented MP4 fragments. This is done so that the overall stream is available as a sequence of small HTTP-based file downloads, each being one short chunk of an overall bounded or unbounded media stream.

The playlist file (.m3u8) contains a list of these small segments or fragments. This is the standard and most common approach for HLS. For the HLS CMAF case, a multi-segment playlist would look like below.

#EXTM3U
#EXT-X-VERSION:6
#EXT-X-INDEPENDENT-SEGMENTS
#EXT-X-TARGETDURATION:5
#EXT-X-PLAYLIST-TYPE:VOD
#EXT-X-MAP:URI="init00000.mp4"
#EXTINF:5,
segment00000.m4s
#EXTINF:5,
segment00001.m4s
#EXTINF:5,
segment00002.m4s

An alternative approach is to use a single media file with the EXT-X-BYTERANGE tag. This method is primarily used for on-demand (VOD) streaming where the complete media file already exists and can reduce the number of files that needs to be managed on the server. Single file with byte-ranges requires the server and client to support HTTP byte range requests and 206 Partial Content responses.

The single media file use case wasn't supported so far with either of hlssink3 or hlscmafsink. A new property single-media-file has been added, which lets users specify the use of a single media file.

hlscmafsink.set_property("single-media-file", "main.mp4");
hlssink3.set_property("single-media-file", "main.ts");

For the HLS CMAF case, this would generate a playlist like below.

#EXTM3U
#EXT-X-VERSION:6
#EXT-X-INDEPENDENT-SEGMENTS
#EXT-X-TARGETDURATION:5
#EXT-X-PLAYLIST-TYPE:VOD
#EXT-X-MAP:URI="main.mp4",BYTERANGE="768@0"
#EXT-X-BYTERANGE:100292@768
#EXTINF:5,
main.mp4
#EXT-X-BYTERANGE:98990@101060
#EXTINF:5,
main.mp4
#EXT-X-BYTERANGE:99329@200050
#EXTINF:5,
main.mp4

This can be useful if one has storage requirements where the use of a single media file for HLS might be favourable.



Audio source separation describes the process of splitting an already mixed audio stream into its individual, logical sources. For example, splitting a song into separate streams for its individual instruments and vocals. This can be used for example for karaoke, music practice, or isolating the speaker from background noise for easier understanding by humans or improving results of speech-to-text processing.

Starting with GStreamer 1.28.0 an element for this purpose will be included. It is based on the Python/pytorch implementation of demucs and comes with various pre-trained models with different performance and accuracy characteristics, as well as which different sets of sources they can separate. CPU-based processing is generally multiple times real-time on modern CPUs (around 8x on mine) but GPU-based processing via pytorch is also possible.

The element itself is part of the GStreamer Rust plugins and can either run demucs locally in-process using an embedded Python interpreter via pyo3, or via a small Python service over WebSockets that can run either locally or remotely (e.g. for thin clients). The used model, and chunk size and overlap between chunks can be configured. Chunk size and overlap provide control over the introduced latency (lower values give lower latency) and quality (higher values give better quality).

The separate sources are provided on individual source pads of the element and it effectively behaves like a demuxer. A pipeline for karaoke would for example look as follows:

gst-launch-1.0 uridecodebin uri=file:///path/to/music/file ! audioconvert ! tee name=t ! \
  queue max-size-time=0 max-size-bytes=0 max-size-buffers=2 ! demucs name=demucs model-name=htdemucs \
  demucs.src_vocals ! queue ! audioamplify amplification=-1 ! mixer.sink_0 \
  t. ! queue max-size-time=9000000000 max-size-bytes=0 max-size-buffers=0 ! mixer.sink_1 \
  audiomixer name=mixer ! audioconvert ! autoaudiosink

This takes an URI to a music file, passes that through the demucs element for extracting the vocals, then takes the original input via a tee and subtracts the vocals from it by first inverting all samples of the vocals stream with the audioamplify element and then mixing it with the original input with an audiomixer.

I also did a lightning talk about this at the GStreamer conference this year.



Back in June '25, I implemented a new speech synthesis element using the ElevenLabs API.

In this post I will briefly explain some of the design choices I made, and provide one or two usage examples.

POST vs. WSS

ElevenLabs offers two interfaces for speech synthesis:

  • Either open a websocket and feed the service small chunks of text (eg words) to receive a continuous audio stream

  • Or POST longer segments of text to receive independent audio fragments

The websocket API is well-adapted to conversational use cases, and can offer the lowest latency, but isn't the most well-suited to the use cases I was targeting: my goal was to use it to synthesize audio from text that was first transcribed, then translated from an original input audio stream.

In this situation we have two constraints we need to be mindful of:

  • For translation purposes we need to construct large enough text segments prior to translating, in order for the translation service to operate with enough context to do a good job.

  • Once audio has been synthesized, we might also need to resample it in order to have it fit within the original duration of the speech.

Given that:

  • The latency benefits from using the websocket API are largely negated by the larger text segments we would use as the input

  • Resampling the continuous stream we would receive to make sure individual words are time-shifted back to the "correct" position, while possible thanks to the sync_alignment option, would have increased the complexity of the resulting element

I chose to use the POST API for this element. We might still choose to implement a websocket-based version if there is a good story for using GStreamer in a conversational pipeline, but that is not on my radar for now.

Additionally, we already have a speech synthesis element around the AWS Polly API which is also POST-based, so both elements can share a similar design.

Audio resampling

As mentioned previously, the ElevenLabs API does not offer direct control over the duration of the output audio.

For instance, you might be dubbing speech from a fast speaker with a slow voice, potentially causing the output audio to drift out of sync.

To address this, the element can optionally make use of signalsmith_stretch to resample the audio in a pitch-preserving manner.

When the feature is enabled it can be used through the overflow=compress property.

The effect can sometimes be pretty jarring for very short input, so an extra property is also exposed to allow some tolerance for drift: max-overflow. It represents the maximum duration by which the audio output is allow to drift out of sync, and does a good job using up intervals of silence between utterances.

Voice cloning

The ElevenLabs API exposes a pretty powerful feature, Instant Voice Cloning. It can be used to create a custom voice that will sound very much like a reference voice, requiring only a handful of seconds to a few minutes of reference audio data to produce useful results.

Using the multilingual model, that newly-cloned voice can even be used to generate convincing speech in a different language.

A typical pipeline for my target use case can be represented as (pseudo gst-launch):

input_audio_src ! transcriber ! translator ! synthesizer

When using a transcriber element such as speechmaticstranscriber, speaker "diarization" (fancy word for detection) can be used to determine when a given speaker was speaking, thus making it possible to clone voices even in a multi-speaker situation.

The challenge in this situation however is that the synthesizer element doesn't have access to the original audio samples, as it only deals with text as the input.

I thus decided on the following solution:

input_audio_src ! voicecloner ! transcriber ! .. ! synthesizer

The voice cloner element will accumulate audio samples, then upon receiving custom upstream events from the transcriber element with information about speaker timings it will start cloning voices and trim its internal sample queue.

To be compatible, a transcriber simply needs to send the appropriate events upstream. The speechmaticstranscriber element can be used as a reference.

Finally, once a voice clone is ready, the cloner element sends another event downstream with a mapping of speaker id to voice id. The synthesizer element can then intercept the event and start using the newly-created voice clone.

The cloner element can also be used in single-speaker voice by just setting the speaker property to some identifier and watching for messages on the bus:

gst-launch-1.0 -m -e alsasrc ! audioconvert ! audioresample ! queue ! elevenlabsvoicecloner api-key=$SPEECHMATICS_API_KEY speaker="Mathieu" ! fakesink

Putting it all together

At this year's GStreamer conference I gave a talk where I demo'd these new elements.

This is the pipeline I used then:

AWS_ACCESS_KEY_ID="XXX" AWS_SECRET_ACCESS_KEY="XXX" gst-launch-1.0 uridecodebin uri=file:///home/meh/Videos/spanish-convo-trimmed.webm name=ud \
  ud. ! queue max-size-time=15000000000 max-size-bytes=0 max-size-buffers=0 ! clocksync ! autovideosink \
  ud. ! audioconvert ! audioresample ! clocksync ! elevenlabsvoicecloner api-key=XXX ! \
    speechmaticstranscriber url=wss://eu2.rt.speechmatics.com/v2 enable-late-punctuation-hack=false join-punctuation=false api-key="XXX" max-delay=2500 latency=4000 language-code=es diarization=speaker ! \
    queue max-size-time=15000000000 max-size-bytes=0 max-size-buffers=0 ! textaccumulate latency=3000 drain-on-final-transcripts=false extend-duration=true ! \
    awstranslate latency=1000 input-language-code="es-ES" output-language-code="en-EN" ! \
    elevenlabssynthesizer api-key=XXX retry-with-speed=false overflow=compress latency=3000 language-code="en" voice-id="iCKVfVbyCo5AAswzTkkX" model-id="eleven_multilingual_v2" max-overflow=0 ! \
    queue max-size-time=15000000000 max-size-bytes=0 max-size-buffers=0 ! audiomixer name=m ! autoaudiosink audiotestsrc volume=0.03 wave=violet-noise ! clocksync ! m.

Watch my talk for the result, or try it yourself (you will need API keys for speechmatics / AWS / elevenlabs)!



The GStreamer Material Exchange Format (MXF) muxer and demuxer elements so far only supported extracting Vertical Ancillary Data (VANC) as closed captions. Any other VANC data was silently dropped. This was primarily reflected by the sink pad template of mxfmux.

  SINK template: 'vanc_sink_%u'
    Availability: On request
    Capabilities:
      closedcaption/x-cea-708
                 format: cdp
              framerate: [ 0/1, 2147483647/1 ]

mxfmux and mxfdemux have now been extended to support arbitrary VANC data.

SMPTE 436 (pdf) specification defines how the ancillary data is stored in MXF. SMPTE 2038 (pdf) defines the carriage of Ancillary Data Packets in an MPEG-2 Transport Stream acting as a more structured format (ST2038) in comparison to the line-based format (ST436M). mxfdemux converts from ST436M to ST2038 while mxfmux converts from ST2038 to ST436M. So mxfdemux now outputs VANC (ST436M) essence tracks as ST2038 streams and mxfmux consumes ST2038 streams to output VANC (ST436M) essence tracks.

A breaking change was introduced to support this in the muxer, by updating the acceptable caps on the pad. The sink pad template of mxfmux has now changed to meta/x-st-2038 instead of the earlier closedcaption/x-cea-708. Applications can use cctost2038anc for converting closed captions to ST2038.

  SINK template: 'vanc_sink_%u'
    Availability: On request
    Capabilities:
      meta/x-st-2038
              alignment: frame (gchararray)

While the pad templates of mxfdemux haven't changed as shown below, the caps on the source pad are going to be meta/x-st-2038 for VANC data, so applications have to handle different caps now. Closed captions can be extracted via st2038anctocc.

  SRC template: 'track_%u'
    Availability: Sometimes
    Capabilities:
      ANY

The older behaviour is still available via an environment variable GST_VANC_AS_CEA708. In addition, mxfdemux can now read both, 8-bit and 10-bit VANC data from MXF files.

The ST2038 elements available in Rust plugins and described in an earlier post here, have also seen some fixes for correctly handling alignment and framerate.



As part of our ongoing efforts to extend GStreamer's support for ancillary data, I've recently improved the ancillary data handling in the Blackmagic DeckLink plugin. This plugin can be used to capture or output SDI/HDMI/ST2110 streams with Blackmagic DeckLink capture/output cards.

Previously only CEA 608/708 closed captions and AFD/Bar ancillary data was handled in that plugin. Now it can also additionally handle any other kind of ancillary data via GstAncillaryMeta and leave interpretation or handling of the concrete payload to the application or other elements.

This new behaviour was added in this MR, which is part of git main now, and can be enabled via the output-vanc properties on the video source / sink elements.

The same was already supported before by the plugin for AJA capture/output cards.

For example the following pipeline can be used to forward an SDI stream from an one DeckLink card to an AJA card

gst-launch-1.0 decklinkvideosrc output-vanc=true ! queue ! combiner.video \
  decklinkaudiosrc ! queue ! combiner.audio \
  ajasinkcombiner name=combiner ! ajasink handle-ancillary-meta=true

With both the AJA and DeckLink sink elements, special care is needed to not e.g. output closed captions twice. Both sinks can retrieve them from GstVideoClosedCaptionMeta and GstAncillaryMeta, and outputting from both will likely lead to problems at the consumer of the output.



While working on other ancillary data related features in GStreamer (more on that some other day), I noticed that we didn't have support for sending or receiving ancillary data via RTP in GStreamer despite it being a quite simple RTP mapping defined in RFC 8331 and it being used as part of ST 2110.

The new RTP rtpsmpte291pay payloader and rtpsmpte291depay depayloader can be found in this MR for gst-plugins-rs, which should be merged in the next days.

The new elements pass the SMPTE ST 291-1 ancillary data as ST 2038 streams through the pipeline. ST 2038 streams can be directly extracted from or stored in MXF or MPEG-TS containers, can be extracted or inserted into SDI streams with the AJA or Blackmagic Decklink sources/sinks, or can be handled generically by the ST 2038 elements from the rsclosedcaption plugin.

For example the following pipeline can be used to convert an SRT subtitle file to CEA-708 closed captions, which are then converted to an ST 2038 stream and sent over RTP:

$ gst-launch-1.0 filesrc location=file.srt ! subparse ! \
    tttocea708 ! closedcaption/x-cea-708,framerate=30/1 ! ccconverter ! \
    cctost2038anc ! rtpsmpte291pay ! \
    udpsink host=123.123.123.123 port=45678

Now you might be wondering how ST 291-1 and ST 2038 are related to each other and what ST 2038 has to do with RTP.

ST 291-1 is the basic standard that defines the packet format for ancillary packets as e.g. transmitted over SDI. ST 2038 on the other hand defines a mechanism for packaging ST 291-1 into MPEG-TS, and in addition to the plain ST 291-1 packets provides some additional information like the line number on which the ST 291-1 packet is to be stored. RFC 8331 defines a similar mapping just for RTP, and apart from one field it provides exactly the same information and conversion between the two formats is relatively simple.

Using ST 2038 as generic ancillary data stream format in GStreamer seemed like the pragmatic choice here. GStreamer already had support for handling ST 2038 streams in various elements, a set of helper elements to handle ST 2038 streams, and e.g. GStreamer's MXF ANC support (ST 436) also uses ST 2038 as stream format.



For one of our recent projects, we worked on adding multitrack audio capabilities to the GStreamer FLV plugin following the Enhanced RTMP (v2) specification. All changes are now merged upstream (see MR 9682).

Enhanced RTMP

As the name suggests, this is an enhancement to the RTMP (and FLV) specifications. The latest version was released earlier this year and is aimed at meeting the technical standards of current and future online media broadcasting requirements, which include:

  • Contemporary audio/video codecs (HEVC, AV1, Opus, FLAC, etc.)
  • Multitrack capabilities (for concurrent management and processing)
  • Connection stability and resilience
  • and more

FLV and RTMP in GStreamer

The existing FLV and RTMP2 plugins followed the previous versions of the RTMP/FLV specifications, so they could handle at most one video and one audio track at a time. This is where most of the work was needed, to add the ability to handle multiple tracks.

Multitrack Audio

We considered a couple of options for adding multitrack audio and enhanced FLV capabilities:

  • Write completely new element(s), preferably in Rust (or)
  • Extend the current FLV muxer and demuxer elements

Writing a fresh set of elements from scratch, perhaps even in Rust, would have potentially made it easier to accommodate newer versions of the specification. But the second option, extending the existing FLV muxer/demuxer elements turned out to be simpler.

Problems to Solve

So, at a high level, we had two problems to solve:

  1. Handle multiple tracks

    As mentioned above, the FLV and RTMP plugins were equipped to handle only one audio and one video track. So we needed to add support for handling multiple audio and video tracks.

  2. Maintain backwards compatibility

    There should be no breakage in any existing applications that stream using the legacy FLV format. So, the muxer needs a mechanism to decide whether a given audio input needs to be written into the FLV container in the enhanced format or the legacy format.

A two-step solution

We arrived at a two-step solution for the implementation of multiple track handling:

  1. Use the audio template pad only for the legacy format and define a new audio_%u template for the enhanced format. That makes it clear which stream needs to be written as a legacy FLV track or an enhanced FLV track. The index of the audio_%u pads is also used as the track ID when writing enhanced FLV.

  2. Derive a new element from the existing FLV muxer called eflvmux, which defines the new audio_%u pad templates. The old flvmux will continue to support only the legacy codec/format. That way, the existing applications that use flvmux for legacy FLV streaming will not face any conflicts while requesting the pads.

Minor Caveat

Note that applications that use eflvmux need to specify the correct pad template name (audio or audio_%u) when requesting sink pads to ensure that the input audio data is written to the correct FLV track (legacy or enhanced).

Some formats such as MP3 and AAC are supported in both legacy and enhanced tracks, so we can't just auto-detect the right thing to do.

Interoperability issues

An interesting thing we noticed while testing streaming of multitrack audio with Twitch.tv is that when we tried to stream multiple enhanced FLV tracks or a mix of single legacy track and one or more enhanced FLV tracks, none of the combinations worked.

On the other hand, OBS was able to stream multitrack audio just fine to the same endpoint. Dissecting the RTMP packets sent out by OBS revealed that Twitch can accept at most two tracks, one legacy and one enhanced, and the enhanced FLV track's ID needs to be a non-zero value. To our knowledge, this is not documented anywhere.

It was a simple matter of track ID semantics which could be easily missed without referring to the OBS Studio code. This is also the case with FFmpeg which we recently noticed.

So we have requested a clarification on the track ID semantics from the enhanced RTMP specification maintainers and got a confirmation that 0 remains a valid value for track ID. As mentioned in the specification, it can be used to represent the highest priority track or the default track.

However, when streaming to servers like Twitch you may need to take care to request only pads with index greater than 0 from eflvmux because it may not accept tracks with ID 0.

Sample Pipelines to test

Here are some sample pipelines I used for testing the muxer and demuxer during the implementation.

Scope for other features

The FLV muxer and demuxer have undergone significant structural changes in order to support multiple audio tracks. This should make it easy to update the existing multitrack video capability merge request as well as add support for advanced codecs listed in the specification, some of which (like H265 and AV1) are already in progress.

There is also a work-in-progress merge request to add the eRTMP related support to the rtmp2 plugin.

P.S.: You can also refer to my talk on this topic at the GStreamer Conference that took place in London last month. The recording will be soon published on Ubicast.