Centricular

Expertise, Straight from the Source



« Back

Devlog

Posts tagged with #centricular

Back in June '25, I implemented a new speech synthesis element using the ElevenLabs API.

In this post I will briefly explain some of the design choices I made, and provide one or two usage examples.

POST vs. WSS

ElevenLabs offers two interfaces for speech synthesis:

  • Either open a websocket and feed the service small chunks of text (eg words) to receive a continuous audio stream

  • Or POST longer segments of text to receive independent audio fragments

The websocket API is well-adapted to conversational use cases, and can offer the lowest latency, but isn't the most well-suited to the use cases I was targeting: my goal was to use it to synthesize audio from text that was first transcribed, then translated from an original input audio stream.

In this situation we have two constraints we need to be mindful of:

  • For translation purposes we need to construct large enough text segments prior to translating, in order for the translation service to operate with enough context to do a good job.

  • Once audio has been synthesized, we might also need to resample it in order to have it fit within the original duration of the speech.

Given that:

  • The latency benefits from using the websocket API are largely negated by the larger text segments we would use as the input

  • Resampling the continuous stream we would receive to make sure individual words are time-shifted back to the "correct" position, while possible thanks to the sync_alignment option, would have increased the complexity of the resulting element

I chose to use the POST API for this element. We might still choose to implement a websocket-based version if there is a good story for using GStreamer in a conversational pipeline, but that is not on my radar for now.

Additionally, we already have a speech synthesis element around the AWS Polly API which is also POST-based, so both elements can share a similar design.

Audio resampling

As mentioned previously, the ElevenLabs API does not offer direct control over the duration of the output audio.

For instance, you might be dubbing speech from a fast speaker with a slow voice, potentially causing the output audio to drift out of sync.

To address this, the element can optionally make use of signalsmith_stretch to resample the audio in a pitch-preserving manner.

When the feature is enabled it can be used through the overflow=compress property.

The effect can sometimes be pretty jarring for very short input, so an extra property is also exposed to allow some tolerance for drift: max-overflow. It represents the maximum duration by which the audio output is allow to drift out of sync, and does a good job using up intervals of silence between utterances.

Voice cloning

The ElevenLabs API exposes a pretty powerful feature, Instant Voice Cloning. It can be used to create a custom voice that will sound very much like a reference voice, requiring only a handful of seconds to a few minutes of reference audio data to produce useful results.

Using the multilingual model, that newly-cloned voice can even be used to generate convincing speech in a different language.

A typical pipeline for my target use case can be represented as (pseudo gst-launch):

input_audio_src ! transcriber ! translator ! synthesizer

When using a transcriber element such as speechmaticstranscriber, speaker "diarization" (fancy word for detection) can be used to determine when a given speaker was speaking, thus making it possible to clone voices even in a multi-speaker situation.

The challenge in this situation however is that the synthesizer element doesn't have access to the original audio samples, as it only deals with text as the input.

I thus decided on the following solution:

input_audio_src ! voicecloner ! transcriber ! .. ! synthesizer

The voice cloner element will accumulate audio samples, then upon receiving custom upstream events from the transcriber element with information about speaker timings it will start cloning voices and trim its internal sample queue.

To be compatible, a transcriber simply needs to send the appropriate events upstream. The speechmaticstranscriber element can be used as a reference.

Finally, once a voice clone is ready, the cloner element sends another event downstream with a mapping of speaker id to voice id. The synthesizer element can then intercept the event and start using the newly-created voice clone.

The cloner element can also be used in single-speaker voice by just setting the speaker property to some identifier and watching for messages on the bus:

gst-launch-1.0 -m -e alsasrc ! audioconvert ! audioresample ! queue ! elevenlabsvoicecloner api-key=$SPEECHMATICS_API_KEY speaker="Mathieu" ! fakesink

Putting it all together

At this year's GStreamer conference I gave a talk where I demo'd these new elements.

This is the pipeline I used then:

AWS_ACCESS_KEY_ID="XXX" AWS_SECRET_ACCESS_KEY="XXX" gst-launch-1.0 uridecodebin uri=file:///home/meh/Videos/spanish-convo-trimmed.webm name=ud \
  ud. ! queue max-size-time=15000000000 max-size-bytes=0 max-size-buffers=0 ! clocksync ! autovideosink \
  ud. ! audioconvert ! audioresample ! clocksync ! elevenlabsvoicecloner api-key=XXX ! \
    speechmaticstranscriber url=wss://eu2.rt.speechmatics.com/v2 enable-late-punctuation-hack=false join-punctuation=false api-key="XXX" max-delay=2500 latency=4000 language-code=es diarization=speaker ! \
    queue max-size-time=15000000000 max-size-bytes=0 max-size-buffers=0 ! textaccumulate latency=3000 drain-on-final-transcripts=false extend-duration=true ! \
    awstranslate latency=1000 input-language-code="es-ES" output-language-code="en-EN" ! \
    elevenlabssynthesizer api-key=XXX retry-with-speed=false overflow=compress latency=3000 language-code="en" voice-id="iCKVfVbyCo5AAswzTkkX" model-id="eleven_multilingual_v2" max-overflow=0 ! \
    queue max-size-time=15000000000 max-size-bytes=0 max-size-buffers=0 ! audiomixer name=m ! autoaudiosink audiotestsrc volume=0.03 wave=violet-noise ! clocksync ! m.

Watch my talk for the result, or try it yourself (you will need API keys for speechmatics / AWS / elevenlabs)!



As part of our ongoing efforts to extend GStreamer's support for ancillary data, I've recently improved the ancillary data handling in the Blackmagic DeckLink plugin. This plugin can be used to capture or output SDI/HDMI/ST2110 streams with Blackmagic DeckLink capture/output cards.

Previously only CEA 608/708 closed captions and AFD/Bar ancillary data was handled in that plugin. Now it can also additionally handle any other kind of ancillary data via GstAncillaryMeta and leave interpretation or handling of the concrete payload to the application or other elements.

This new behaviour was added in this MR, which is part of git main now, and can be enabled via the output-vanc properties on the video source / sink elements.

The same was already supported before by the plugin for AJA capture/output cards.

For example the following pipeline can be used to forward an SDI stream from an one DeckLink card to an AJA card

gst-launch-1.0 decklinkvideosrc output-vanc=true ! queue ! combiner.video \
  decklinkaudiosrc ! queue ! combiner.audio \
  ajasinkcombiner name=combiner ! ajasink handle-ancillary-meta=true

With both the AJA and DeckLink sink elements, special care is needed to not e.g. output closed captions twice. Both sinks can retrieve them from GstVideoClosedCaptionMeta and GstAncillaryMeta, and outputting from both will likely lead to problems at the consumer of the output.



While working on other ancillary data related features in GStreamer (more on that some other day), I noticed that we didn't have support for sending or receiving ancillary data via RTP in GStreamer despite it being a quite simple RTP mapping defined in RFC 8331 and it being used as part of ST 2110.

The new RTP rtpsmpte291pay payloader and rtpsmpte291depay depayloader can be found in this MR for gst-plugins-rs, which should be merged in the next days.

The new elements pass the SMPTE ST 291-1 ancillary data as ST 2038 streams through the pipeline. ST 2038 streams can be directly extracted from or stored in MXF or MPEG-TS containers, can be extracted or inserted into SDI streams with the AJA or Blackmagic Decklink sources/sinks, or can be handled generically by the ST 2038 elements from the rsclosedcaption plugin.

For example the following pipeline can be used to convert an SRT subtitle file to CEA-708 closed captions, which are then converted to an ST 2038 stream and sent over RTP:

$ gst-launch-1.0 filesrc location=file.srt ! subparse ! \
    tttocea708 ! closedcaption/x-cea-708,framerate=30/1 ! ccconverter ! \
    cctost2038anc ! rtpsmpte291pay ! \
    udpsink host=123.123.123.123 port=45678

Now you might be wondering how ST 291-1 and ST 2038 are related to each other and what ST 2038 has to do with RTP.

ST 291-1 is the basic standard that defines the packet format for ancillary packets as e.g. transmitted over SDI. ST 2038 on the other hand defines a mechanism for packaging ST 291-1 into MPEG-TS, and in addition to the plain ST 291-1 packets provides some additional information like the line number on which the ST 291-1 packet is to be stored. RFC 8331 defines a similar mapping just for RTP, and apart from one field it provides exactly the same information and conversion between the two formats is relatively simple.

Using ST 2038 as generic ancillary data stream format in GStreamer seemed like the pragmatic choice here. GStreamer already had support for handling ST 2038 streams in various elements, a set of helper elements to handle ST 2038 streams, and e.g. GStreamer's MXF ANC support (ST 436) also uses ST 2038 as stream format.



For one of our recent projects, we worked on adding multitrack audio capabilities to the GStreamer FLV plugin following the Enhanced RTMP (v2) specification. All changes are now merged upstream (see MR 9682).

Enhanced RTMP

As the name suggests, this is an enhancement to the RTMP (and FLV) specifications. The latest version was released earlier this year and is aimed at meeting the technical standards of current and future online media broadcasting requirements, which include:

  • Contemporary audio/video codecs (HEVC, AV1, Opus, FLAC, etc.)
  • Multitrack capabilities (for concurrent management and processing)
  • Connection stability and resilience
  • and more

FLV and RTMP in GStreamer

The existing FLV and RTMP2 plugins followed the previous versions of the RTMP/FLV specifications, so they could handle at most one video and one audio track at a time. This is where most of the work was needed, to add the ability to handle multiple tracks.

Multitrack Audio

We considered a couple of options for adding multitrack audio and enhanced FLV capabilities:

  • Write completely new element(s), preferably in Rust (or)
  • Extend the current FLV muxer and demuxer elements

Writing a fresh set of elements from scratch, perhaps even in Rust, would have potentially made it easier to accommodate newer versions of the specification. But the second option, extending the existing FLV muxer/demuxer elements turned out to be simpler.

Problems to Solve

So, at a high level, we had two problems to solve:

  1. Handle multiple tracks

    As mentioned above, the FLV and RTMP plugins were equipped to handle only one audio and one video track. So we needed to add support for handling multiple audio and video tracks.

  2. Maintain backwards compatibility

    There should be no breakage in any existing applications that stream using the legacy FLV format. So, the muxer needs a mechanism to decide whether a given audio input needs to be written into the FLV container in the enhanced format or the legacy format.

A two-step solution

We arrived at a two-step solution for the implementation of multiple track handling:

  1. Use the audio template pad only for the legacy format and define a new audio_%u template for the enhanced format. That makes it clear which stream needs to be written as a legacy FLV track or an enhanced FLV track. The index of the audio_%u pads is also used as the track ID when writing enhanced FLV.

  2. Derive a new element from the existing FLV muxer called eflvmux, which defines the new audio_%u pad templates. The old flvmux will continue to support only the legacy codec/format. That way, the existing applications that use flvmux for legacy FLV streaming will not face any conflicts while requesting the pads.

Minor Caveat

Note that applications that use eflvmux need to specify the correct pad template name (audio or audio_%u) when requesting sink pads to ensure that the input audio data is written to the correct FLV track (legacy or enhanced).

Some formats such as MP3 and AAC are supported in both legacy and enhanced tracks, so we can't just auto-detect the right thing to do.

Interoperability issues

An interesting thing we noticed while testing streaming of multitrack audio with Twitch.tv is that when we tried to stream multiple enhanced FLV tracks or a mix of single legacy track and one or more enhanced FLV tracks, none of the combinations worked.

On the other hand, OBS was able to stream multitrack audio just fine to the same endpoint. Dissecting the RTMP packets sent out by OBS revealed that Twitch can accept at most two tracks, one legacy and one enhanced, and the enhanced FLV track's ID needs to be a non-zero value. To our knowledge, this is not documented anywhere.

It was a simple matter of track ID semantics which could be easily missed without referring to the OBS Studio code. This is also the case with FFmpeg which we recently noticed.

So we have requested a clarification on the track ID semantics from the enhanced RTMP specification maintainers and got a confirmation that 0 remains a valid value for track ID. As mentioned in the specification, it can be used to represent the highest priority track or the default track.

However, when streaming to servers like Twitch you may need to take care to request only pads with index greater than 0 from eflvmux because it may not accept tracks with ID 0.

Sample Pipelines to test

Here are some sample pipelines I used for testing the muxer and demuxer during the implementation.

Scope for other features

The FLV muxer and demuxer have undergone significant structural changes in order to support multiple audio tracks. This should make it easy to update the existing multitrack video capability merge request as well as add support for advanced codecs listed in the specification, some of which (like H265 and AV1) are already in progress.

There is also a work-in-progress merge request to add the eRTMP related support to the rtmp2 plugin.

P.S.: You can also refer to my talk on this topic at the GStreamer Conference that took place in London last month. The recording will be soon published on Ubicast.



At the GStreamer project, we produce SDKs for lots of platforms: Linux, Android, macOS, iOS, and Windows. However, as we port more and more plugins to Rust 🦀, we are finding ourselves backed into a corner.

Rust static libraries are simply too big.

To give you an example, the AWS folks changed their SDK back in March to switch their cryptographic toolkit over to their aws-lc-rs crate [1]. However, that causes a 2-10x increase in code size (bug reports here and here), which gets duplicated on every plugin that makes use of their ecosystem!

What are Rust staticlibs made of?

To summarise, each Rust plugin packs a copy of its dependencies, plus a copy of the Rust standard library. This is not a problem on shared libraries and executables by their very nature, but on static libraries it causes several issues:

First approach: Single-Object Prelinking

I won't bore you with the details as I've written another blog post on the subject; the gist is that you can unpack the library, and then ask the linker to perform "partial linking" or "relocatable linking" (Linux term) or "Single-Object Prelinking" (the Apple term, which I'll use throughout the post) over the object files. Setting which symbols you want to be visible for downstream consumers lets dead-code elimination take place at the plugin level, ensuring your libraries are now back to a reasonable size.

Why is it not enough?

Single-Object Prelinking has two drawbacks:

  • Unoptimized code: the linker won't be able to deduplicate functions between melded objects, as they've been hidden by the prelinking process.
  • Windows: there are no officially supported tools (read: Visual Studio, LLVM, GCC) to perform this at the compiler level. It is possible to do this with binutils, but the PE-COFF format doesn't allow to change the visibility of unexported functions.

Melt all the object files with the power of dragons' fire!

As said earlier, no tools on Windows support prelinking officially yet, but there's another thing we can do: library deduplication.

Thanks to Rust's comprehensive crate ecosystem, I wrote a new CLI tool which I called dragonfire. Given a complete Rust workspace or list of static libraries, dragonfire:

  1. reads all the static libraries in one pass
  2. deduplicates the object files inside them based on their size and naming (Rust has its own, unique naming convention for object files -- pretty useful!)
  3. copies the duplicate objects into a new static library (usually called gstrsworkspace as its primary use is for the GStreamer ecosystem)
  4. removes the duplicates from the rest of the libraries
  5. updates the symbol table in each of the libraries with the bundled LLVM tools

Thanks to the ar crate, the unpacking and writing only happens at stage 3, ensuring no wasteful I/O slowdowns takes place. The llvm-tools-preview component in turn takes care of locating and calling up llvm-ar for updating the workspace's symbol tables.

A special mention is deserved to the object files' naming convention. Assume a Rust staticlib named libfoo, its object files will be named as:

  • crate_name-hash1.crate_name.hash2-cgu.nnn.rcgu.o
  • On Windows only: foo.crate_name-hash1.crate_name.hash2-cgu.nnn.rcgu.o
  • On non-Windows platforms: same as above, but replacing foo with libfoo-hash

In all cases, crate_name means a dependency present somewhere in the workspace tree, and nnn is a number that will be bigger than zero whenever -C codegen-units was set to higher than 1.

For dragonfire purposes, dropping the library prefix is enough to be able to deduplicate object files; however, on Windows we can also find import library stubs, which LLVM can generate on its own by the use of the #[raw-dylib] annotation [2]. Import stubs can have any extension, e.g. .dll, .exe and .sys (the latter two coming from private Win32 APIs). These stubs cannot be deduplicated as they are generated individually per imported function, so dragonfire must preserve them where they are.

Drawbacks of object file deduplication

Again there are several disadvantages of this approach. On Apple platforms, deduplicating libraries triggers a strange linker error, which I've not seen before:

ld: multiple errors: compact unwind must have at least 1 fixup in '<framework>/GStreamer[arm64][1021](libgstrsworkspace_a-3f2b47962471807d-lse_ldset4_acq.o)'; r_symbolnum=-19 out of range in '<framework>/GStreamer[arm64][1022](libgstrsworkspace_a-compiler_builtins-350c23344d78cfbc.compiler_builtins.5e126dca1f5284a9-cgu.162.rcgu.o)'

This also led me to find that Rust libraries were packing bitcode, which is forbidden by Apple. (This was thankfully already fixed before shipping time, but we've not yet updated our Rust minimum version to take advantage of it.)

Another drawback is that Rust's implementation of LTO causes dead-code elimination at the crate level, as opposed to the workspace level. This makes object file deduplication impossible, as each copy is different.

For the Windows platform, there is an extra drawback which affects specifically object files produced by LLVM: the COMDAT sections are set to IMAGE_COMDAT_SELECT_NODUPLICATES. This means that the linker will outright reject functions with multiple definitions, rather than realise they're all duplicates and discarding all but one of the copies. MSVC in particular performs symbol resolution before dead-code elimination. This means that linking will fail because of unresolved symbols before dead code elimination kicks in; to use deduplicated libraries, one must set the linker flags /OPT:REF /FORCE:UNRESOLVED to ensure the dead code can be successfully eliminated.

Results

With library deduplication, we can make static libraries up to 44x smaller when building under MSVC [3] (you can expand the tables below for the full comparison):

  • gstaws.lib: from 173M to 71M (~2.5x)
  • gstrswebrtc.lib: from 193M to 66M (~2.9x)
  • gstwebrtchttp.lib: from 66M to 1,5M (~ 44x)
Table: before and after melding under MSVC
file no prelinking melded
gstaws.lib 173M 71M
gstcdg.lib 36M 572K
gstclaxon.lib 32M 568K
gstdav1d.lib 34M 936K
gstelevenlabs.lib 59M 1008K
gstfallbackswitch.lib 37M 2,3M
gstffv1.lib 34M 744K
gstfmp4.lib 39M 3,2M
gstgif.lib 34M 1,1M
gstgopbuffer.lib 30M 456K
gsthlsmultivariantsink.lib 46M 1,6M
gsthlssink3.lib 41M 1,2M
gsthsv.lib 34M 796K
gstjson.lib 31M 704K
gstlewton.lib 33M 1,2M
gstlivesync.lib 33M 728K
gstmp4.lib 38M 2,2M
gstmpegtslive.lib 31M 704K
gstndi.lib 38M 2,8M
gstoriginalbuffer.lib 34M 376K
gstquinn.lib 75M 23M
gstraptorq.lib 33M 2,4M
gstrav1e.lib 46M 11M
gstregex.lib 38M 404K
gstreqwest.lib 58M 1,4M
gstrsanalytics.lib 35M 1000K
gstrsaudiofx.lib 54M 22M
gstrsclosedcaption.lib 52M 8,4M
gstrsinter.lib 35M 604K
gstrsonvif.lib 46M 2,0M
gstrspng.lib 35M 1,2M
gstrsrtp.lib 59M 11M
gstrsrtsp.lib 57M 4,4M
gstrstracers.lib 40M 2,4M
gstrsvideofx.lib 48M 11M
gstrswebrtc.lib 193M 66M
gstrsworkspace.lib N/A 137M
gststreamgrouper.lib 30M 376K
gsttextahead.lib 30M 332K
gsttextwrap.lib 32M 2,1M
gstthreadshare.lib 52M 12M
gsttogglerecord.lib 35M 808K
gsturiplaylistbin.lib 31M 648K
gstvvdec.lib 34M 564K
gstwebrtchttp.lib 66M 1,5M

The results from the melding above can be compared with the file sizes obtained using LTO on Windows [4] (remember it doesn't actually fix linking against plugins):

  • gstaws.lib: from 71M (LTO) to 67M (melded) (-5.6%)
  • gstrswebrtc.lib: from 105M to 66M (-37.1%)
  • gstwebrtchttp.lib: from 28M to 1,5M (-94.6%)
Table: before and after LTO under MSVC (no melding involved)
file (codegen-units=1 in all cases) no prelinking lto=thin opt-level=s + lto=thin debug=1 + opt-level=s debug=1 + lto=thin + opt-level=s
old/gstaws.lib 199M 199M 171M 78M 67M
old/gstcdg.lib 11M 11M 11M 7,5M 7,5M
old/gstclaxon.lib 11M 11M 11M 7,7M 7,7M
old/gstdav1d.lib 12M 12M 12M 7,9M 7,8M
old/gstelevenlabs.lib 52M 52M 49M 24M 22M
old/gstfallbackswitch.lib 18M 18M 17M 11M 11M
old/gstffv1.lib 11M 11M 11M 7,6M 7,6M
old/gstfmp4.lib 20M 20M 19M 12M 11M
old/gstgif.lib 12M 12M 12M 7,9M 7,9M
old/gstgopbuffer.lib 9,7M 9,7M 9,7M 7,5M 7,4M
old/gsthlsmultivariantsink.lib 16M 16M 16M 9,6M 9,4M
old/gsthlssink3.lib 14M 14M 14M 8,9M 8,8M
old/gsthsv.lib 11M 11M 11M 7,8M 7,7M
old/gstjson.lib 12M 12M 12M 8,4M 8,2M
old/gstlewton.lib 12M 12M 12M 8,1M 8,1M
old/gstlivesync.lib 12M 12M 12M 8,3M 8,2M
old/gstmp4.lib 17M 17M 17M 9,9M 9,7M
old/gstmpegtslive.lib 12M 12M 12M 8,0M 7,9M
old/gstndi.lib 21M 21M 20M 12M 11M
old/gstoriginalbuffer.lib 9,6M 9,6M 9,7M 7,4M 7,3M
old/gstquinn.lib 94M 94M 86M 39M 35M
old/gstraptorq.lib 18M 18M 17M 9,8M 9,4M
old/gstrav1e.lib 39M 39M 37M 19M 18M
old/gstregex.lib 26M 26M 25M 14M 14M
old/gstreqwest.lib 53M 53M 49M 24M 22M
old/gstrsanalytics.lib 15M 15M 14M 9,2M 8,9M
old/gstrsaudiofx.lib 57M 57M 56M 23M 22M
old/gstrsclosedcaption.lib 40M 40M 36M 20M 18M
old/gstrsinter.lib 14M 14M 13M 8,5M 8,4M
old/gstrsonvif.lib 21M 21M 20M 11M 11M
old/gstrspng.lib 13M 13M 13M 8,2M 8,2M
old/gstrsrtp.lib 47M 47M 44M 22M 20M
old/gstrsrtsp.lib 35M 35M 33M 16M 15M
old/gstrstracers.lib 28M 28M 27M 16M 15M
old/gstrsvideofx.lib 16M 16M 35M 9,2M 15M
old/gstrswebrtc.lib 329M 329M 284M 124M 105M
old/gststreamgrouper.lib 9,6M 9,6M 9,7M 7,2M 7,2M
old/gsttextahead.lib 9,6M 9,6M 9,5M 7,4M 7,3M
old/gsttextwrap.lib 13M 13M 13M 8,4M 8,4M
old/gstthreadshare.lib 49M 49M 45M 23M 20M
old/gsttogglerecord.lib 13M 13M 13M 8,5M 8,4M
old/gsturiplaylistbin.lib 11M 11M 11M 7,9M 7,9M
old/gstvvdec.lib 11M 11M 11M 7,5M 7,5M
old/gstwebrtchttp.lib 69M 69M 63M 30M 28M

Conclusion

This article presents several longstanding pain points in Rust, namely staticlib binary sizes, symbol leaking, and incompatibilities between Rust and MSVC. I demonstrate the tool dragonfire that aims to address and work around, where possible, these issues, along with remaining issues to be addressed.

As explained earlier, dragonfire treated libraries are live on all platforms except Apple's, if you use the development packages from mainline; it's on track hopefully for the 1.28 release of GStreamer. There's already a merge request pending to enable it for Apple platforms, we're only waiting to update the Rust mininum version.

If you want to have a look, dragonfire's source code is available at Freedesktop's GitLab instance. Please note that at the moment I have no plans to submit this to crates.io.

Feel free to contact me with any feedback, and thanks for reading!


  1. See its default-https-client feature at lib.rs, you will find it throughout the AWS SDK ecosystem. ↩︎

  2. https://doc.rust-lang.org/reference/items/external-blocks.html#dylib-versus-raw-dylib ↩︎

  3. In all cases the -C flags are debug=1 + codegen-units=1 + opt-level=s; see this comment for the complete results across all platforms. ↩︎

  4. Source: https://gitlab.freedesktop.org/gstreamer/cerbero/-/merge_requests/1895 ↩︎



Over the past few years, we've been slowly working on improving the platform-specific plugins for Windows, macOS, iOS, and Android, and making them work as well as the equivalent plugins on Linux. In this episode, we will look at audio device switching in the source and sink elements on macOS and Windows.

On Linux, if you're using the PulseAudio elements (both with the PulseAudio daemon and PipeWire), you get perfect device switching: quick, seamless, easy, and reliable. Simply set the device property whenever you want and you're off to the races. If the device gets unplugged, the pipeline will continue, and you will get notified of the unplug via the GST_MESSAGE_DEVICE_REMOVED bus message from GstDeviceMonitor so you can change the device.

As of a few weeks ago, the Windows Audio plugin wasapi2 implements the same behaviour. All you have to do is set the device property to whatever device you want (fetched using the GstDeviceMonitor API), at any time.

A merge request is open for adding the same feature to the macOS audio plugin, and is expected to be merged soon.

For graceful error handling, such as accidental device unplug or other unexpected errors, there's a new continue-on-error property. Setting that will cause the source to output silence after unplug, whereas the sink will simply discard the buffers. An element warning will be emitted to notify the app (alongside the GST_MESSAGE_DEVICE_REMOVED bus message if there was a hardware unplug), and the app can switch the device by setting the device property.

Thanks to Seungha and Piotr for working on this!



HIP (formerly known as Heterogeneous-computing Interface for Portability) is AMD’s GPU programming API that enables portable, CUDA-like development across both AMD and NVIDIA platforms.

  • On AMD GPUs, HIP runs natively via the ROCm stack.
  • On NVIDIA GPUs, HIP operates as a thin translation layer over the CUDA runtime and driver APIs.

This allows developers to maintain a single codebase that can target multiple GPU vendors with minimal effort.

Where HIP Is Used

HIP has seen adoption in AMD-focused GPU computing workflows, particularly in environments that require CUDA-like programmability. Examples include:

  • PyTorch ROCm backend for deep learning workloads
  • Select scientific applications like LAMMPS and GROMACS have experimented with HIP backends for AMD GPU support
  • GPU-accelerated media processing on systems that leverage AMD hardware

While HIP adoption has been more limited compared to CUDA, its reach continues to expand as support for AMD GPUs grows across a broader range of use cases.

The Challenge: Compile-Time Platform Lock-in

Despite its cross-vendor goal, HIP still has a fundamental constraint at the build level. As of HIP 6.3, HIP requires developers to statically define their target platform at compile time via macros like:

#define __HIP_PLATFORM_AMD__    // for AMD ROCm
#define __HIP_PLATFORM_NVIDIA__ // for CUDA backend

This leads to two key limitations:

  • You must compile separate binaries for AMD and NVIDIA
  • A single binary cannot support both platforms simultaneously
  • HIP does not support runtime backend switching natively

GstHip’s Solution

To overcome this limitation, GstHip uses runtime backend dispatch through:

  • dlopen() on Linux
  • LoadLibrary() on Windows

Instead of statically linking against a single HIP backend, GstHip loads both the ROCm HIP runtime and the CUDA driver/runtime API at runtime. This makes it possible to:

  • Detect available GPUs dynamically
  • Choose the appropriate backend per device
  • Even support simultaneous use of AMD and NVIDIA GPUs in the same process

Unified Wrapper API

GstHip provides a clean wrapper layer that abstracts backend-specific APIs via a consistent naming scheme:

hipError_t HipFooBar(GstHipVendor vendor, ...);

The Hip prefix (capital H) clearly distinguishes the wrapper from native hipFooBar(...) functions. The GstHipVendor enum indicates which backend to target:

  • GST_HIP_VENDOR_AMD
  • GST_HIP_VENDOR_NVIDIA

Internally, each HipFooBar(...) function dispatches to the correct backend by calling either:

  • hipFooBar(...) for AMD ROCm
  • cudaFooBar(...) for NVIDIA CUDA

These symbols are dynamically resolved via dlopen() / LoadLibrary(), enabling runtime backend selection without GPU vendor-specific builds.

Memory Interop

All memory interop in GstHip is handled through the hipupload and hipdownload elements. While zero-copy is not supported due to backend-specific resource management and ownership ambiguity, GstHip provides optimized memory transfers between systems:

  • System Memory ↔ HIP Memory: Utilizes HIP pinned memory to achieve fast upload/download operations between host and device memory
  • GstGL ↔ GstHip: Uses HIP resource interop APIs to perform GPU-to-GPU memory copies between OpenGL and HIP memory
  • GstCUDA ↔ GstHip (on NVIDIA platforms): Since both sides use CUDA memory, direct GPU-to-GPU memory copies are performed using CUDA APIs.

GPU-Accelerated Filter Elements

GstHip includes GPU-accelerated filters optimized for real-time media processing:

  • hipconvertscale/hipconvert/hipscale: Image format conversion and image scaling
  • hipcompositor: composing multiple video streams into single video stream

These filters use the same unified dispatch system and are compatible with both AMD and NVIDIA platforms.

Application Integration Support

As of Merge Request!9340, GstHip exposes public APIs that allow applications to access HIP resources managed by GStreamer. This also enables applications to implement custom GstHIP-based plugins using the same underlying infrastructure without duplicating resource management.

Summary of GstHip Advantages

  • Single plugin/library binary supports both AMD and NVIDIA GPUs
  • Compatible with Linux and Windows
  • Supports multi-GPU systems, including hybrid AMD + NVIDIA configurations
  • Seamless memory interop with System Memory, GstGL, and GstCUDA
  • Provides high-performance GPU filters for video processing
  • Maintains a clean API layer via HipFooBar(...) wrappers, enabling backend-neutral development


For a project recently it was necessary to collect video frames of multiple streams during a specific interval, and in the future also audio, to pass it through an inference framework for extracting additional metadata from the media and attaching it to the frames.

While GStreamer has gained quite a bit of infrastructure in the past years for machine learning use-cases in the analytics library, there was nothing for this specific use-case yet.

As part of solving this, I proposed as design for a generic interface that allows combining and batching multiple streams into a single one by using empty buffers with a GstMeta that contains the buffers of the original streams, and caps that include the caps of the original streams and allow format negotiation in the pipeline to work as usual.

While this covers my specific use case of combining multiple streams, it should be generic enough to also handle other cases that came up during the discussions.

In addition I wrote two new elements, analyticscombiner and analyticssplitter, that make use of this new API for combining and batching multiple streams in a generic, media-agnostic way over specific time intervals, and later splitting it out again into the original streams. The combiner can be configured to collect all media in the time interval, or only the first or last.

Conceptually the combiner element is similar to NVIDIA's DeepStream nvstreammux element, and in the future it should be possible to write a translation layer between the GStreamer analytics library and DeepStream.

The basic idea for the usage of these elements is to have a pipeline like

-- stream 1 --\                                                                  / -- stream 1 with metadata --
               -- analyticscombiner -- inference elements -- analyticssplitter --
-- stream 2 --/                                                                  \ -- stream 2 with metadata --
   ........                                                                           ......................
-- stream N -/                                                                     \- stream N with metadata --

The inference elements would only add additional metadata to each of the buffers, which can then be made use of further downstream in the pipeline for operations like overlays or blurring specific areas of the frames.

In the future there are likely going to be more batching elements for specific stream types, operating on multiple or a single stream, or making use of completely different batching strategies.

Special thanks also to Olivier and Daniel who provided very useful feedback during the review of the two merge requests.



With GStreamer 1.26, a new D3D12 backend GstD3D12 public library was introduced in gst-plugins-bad.

Now, with the new gstreamer-d3d12 rust crate, Rust can finally access the Windows-native GPU feature written in GStreamer in a safe and idiomatic way.

What You Get with GStreamer D3D12 Support in Rust

  • Pass D3D12 textures created by your Rust application directly into GStreamer pipelines without data copying
  • Likewise, GStreamer-generated GPU resources (such as frames decoded by D3D12 decoders) can be accessed directly in your Rust app
  • GstD3D12 base GStreamer element can be written in Rust

Beyond Pipelines: General D3D12 Utility Layer

GstD3D12 is not limited to multimedia pipelines. It also acts as a convenient D3D12 runtime utility, providing:

  • GPU resource pooling such as command allocator and descriptor heap, to reduce overhead and improve reuse
  • Abstractions for creating and recycling GPU textures with consistent lifetime tracking
  • Command queue and fence management helpers, greatly simplifying GPU/CPU sync
  • A foundation for building custom GPU workflows in Rust, with or without the full GStreamer pipeline


As part of the GStreamer Hackfest in Nice, France I had some time to go through some outstanding GStreamer issues. One such issue that has been on my mind was this GStreamer OpenGL Wayland issue.

Now, the issue is that OpenGL is an old API and did not have some of the platform extensions it does today. As a result, most windowing system APIs allow creating an output surface (or a window) but never showing it. This also works just fine when you are creating an OpenGL context but not actually rendering anything to the screen and this approach is what is used by all of the other major OpenGL platforms (Windows, macOS, X11, etc) supported by GStreamer.

When wayland initially arrived, this was not the case. A wayland surface could be the back buffer (an OpenGL term for rendering to a surface) but could not be hidden. This is very different from how other windowing APIs worked at the time. As a result, the initial implementation using Wayland within GStreamer OpenGL used some heuristics for determining when a wayland surface would be created and used that basically boiled down to, if there is no shared OpenGL context, then create a window.

This heuristic obviously breaks in multiple different ways, the two most obvious being:

  1. gltestsrc ! gldownload ! some-non-gl-sink - there should be no surface used here.
  2. gltestsrc ! glimagesink gltestsrc ! glimagesink - there should be two output surfaces used here.

The good news is that issue is now fixed by adding some API that glimagesink can use to notify that it would like an output surface. This has been implemented in this merge request and will be part of GStreamer 1.28.



JPEG XS is a visually lossless, low-latency, intra-only video codec for video production workflows, standardised in ISO/IEC 21122.

A few months ago we added support for JPEG XS encoding and decoding in GStreamer, alongside MPEG-TS container support.

This initially covered progressive scan only though.

Unfortunately interlaced scan, which harks back to the days when TVs had cathode ray tube displays, is still quite common, especially in the broadcasting industry, so it was only a matter of time until support for that would be needed as well.

Long story short, GStreamer can now (with this pending Merge Request) also encode and decode interlaced video into/from JPEG XS.

When putting JPEG XS into MPEG-TS, the individual fields are actually coded separately, so there are two JPEG XS code streams per frame. Inside GStreamer pipelines interlaced raw video can be carried in multiple ways, but the most common one is an "interleaved" image, where the two fields are interleaved row by row, and this is also what capture cards such as AJA or Decklink Blackmagic produce in GStreamer.

When encoding interlaced video in this representation, we need to go twice over each frame and feed every second row of pixels to the underlying SVT JPEG XS encoder which itself is not aware of the interlaced nature of the video content. We do this by specifying double the usual stride as rowstride. This works fine, but unearthed some minor issues with the size checks on the codec side, for which we filed a pull request.

Please give it a spin, and let us know if you have any questions or are interested in additional container mappings such as MP4 or MXF, or RTP payloaders / depayloaders.



Some time ago, Edward and I wrote a new element that allows clocking a GStreamer pipeline from an MPEG-TS stream, for example received via SRT.

This new element, mpegtslivesrc, wraps around any existing live source element, e.g. udpsrc or srtsrc, and provides a GStreamer clock that approximates the sender's clock. By making use of this clock as pipeline clock, it is possible to run the whole pipeline at the same speed as the sender is producing the stream and without having to implement any kind of clock drift mechanism like skewing or resampling. Without this it is necessary currently to adjust the timestamps of media coming out of GStreamer's tsdemux element, which is problematic if accurate timestamps are necessary or the stream should be stored to a file, e.g. a 25fps stream wouldn't have exactly 40ms inter-frame timestamp differences anymore.

The clock is approximated by making use of the in-stream MPEG-TS PCR, which basically gives the sender's clock time at specific points inside the stream, and correlating that together with the local receive times via a linear regression to calculate the relative rate between the sender's clock and the local system clock.

Usage of the element is as simple as

$ gst-launch-1.0 mpegtslivesrc source='srtsrc location=srt://1.2.3.4:5678?latency=150&mode=caller' ! tsdemux skew-corrections=false ! ...
$ gst-launch-1.0 mpegtslivesrc source='udpsrc address=1.2.3.4 port=5678' ! tsdemux skew-corrections=false ! ...

Addition 2025-06-28: If you're using an older (< 1.28) version of GStreamer, you'll have to use the ignore-pcr=true property on tsdemux instead. skew-corrections=false was only added recently and allows for more reliable handling of MPEG-TS timestamp discontinuities.

A similar approach for clocking is implemented in the AJA source element and the NDI source element when the clocked timestamp mode is configured.



If you've ever seen a news or sports channel playing without sound in the background of a hotel lobby, bar, or airport, you've probably seen closed captions in action.

These TV-style captions are alphabet/character-based, with some very basic commands to control the positioning and layout of the text on the screen.

They are very low bitrate and were transmitted in the invisible part of TV images during the vertical blanking interval (VBI) back in those good old analogue days ("line 21 captions").

Nowadays they are usually carried as part of the MPEG-2 or H.264/H.265 video bitstream, unlike say text subtitles in a Matroska file which will be its own separate stream in the container.

In GStreamer closed captions can be carried in different ways: Either implicitly as part of a video bitstream, or explicitly as part of a video bitstream with video caption metas on the buffers passing through the pipeline. Captions can also travel through a pipeline stand-alone in form of one of multiple raw caption bitstream formats.

To make handling these different options easier for applications there are elements that can extract captions from the video bitstream into metas, and split off captions from metas into their own stand-alone stream, and to do the reverse and combine and reinject them again.

SMPTE 2038 Ancillary Data

SMPTE 2038 (pdf) is a generic system to put VBI-style ancillary data into an MPEG-TS container. This could include all kinds of metadata such as scoreboard data or game clocks, and of course also closed captions, in this case in form of a distinct stream completely separate from any video bitstream.

We've recently added support for SMPTE 2038 ancillary data in GStreamer. This comes in form of a number of new elements in the GStreamer Rust closedcaption plugin and mappings for it in the MPEG-TS muxer and demuxer.

The new elements are:

  • st2038ancdemux: splits SMPTE ST-2038 ancillary metadata (as received from tsdemux) into separate streams per DID/SDID and line/horizontal_offset. Will add a sometimes pad with details for each ancillary stream. Also has an always source pad that just outputs all ancillary streams for easy forwarding or remuxing, in case none of the ancillary streams need to be modified or dropped.

  • st2038ancmux: muxes SMPTE ST-2038 ancillary metadata streams into a single stream for muxing into MPEG-TS with mpegtsmux. Combines ancillary data on the same line if needed, as is required for MPEG-TS muxing. Can accept individual ancillary metadata streams as inputs and/or the combined stream from st2038ancdemux.

    If the video framerate is known, it can be signalled to the ancillary data muxer via the output caps by adding a capsfilter behind it, with e.g. meta/x-st-2038,framerate=30/1.

    This allows the muxer to bundle all packets belonging to the same frame (with the same timestamp), but that is not required. In case there are multiple streams with the same DID/SDID that have an ST-2038 packet for the same frame, it will prioritise the one from more recently created request pads over those from earlier created request pads (which might contain a combined stream for example if that's fed first).

  • st2038anctocc: extracts closed captions (CEA-608 and/or CEA-708) from SMPTE ST-2038 ancillary metadata streams and outputs them on the respective sometimes source pad (src_cea608 or src_cea708). The data is output as a closed caption stream with caps closedcaption/x-cea-608,format=s334-1a or closedcaption/x-cea-708,format=cdp for further processing by other GStreamer closed caption processing elements.

  • cctost2038anc: takes closed captions (CEA-608 and/or CEA-708) as produced by other GStreamer closed caption processing elements and converts them into SMPTE ST-2038 ancillary data that can be fed to st2038ancmux and then to mpegtsmux for splicing/muxing into an MPEG-TS container. The line-number and horizontal-offset properties should be set to the desired line number and horizontal offset.

Please give it a spin and let us know how it goes!



What is JPEG XS?

JPEG XS is a visually lossless, low-latency, intra-only video codec for video production workflows, standardised in ISO/IEC 21122.

It's wavelet based, with low computational overhead and a latency measured in scanlines, and it is designed to allow easy implementation in software, GPU or FPGAs.

Multi-generation robustness means repeated decoding and encoding will not introduce unpleasant coding artefacts or noticeably degrade image quality, which makes it suitable for video production workflows.

It is often deployed in lieu of existing raw video workflows, where it allows sending multiple streams over links designed to carry a single raw video transport.

JPEG XS encoding / decoding in GStreamer

GStreamer now gained basic support for this codec.

Encoding and decoding is supported via the Open Source Intel Scalable Video Technology JPEG XS library, but third-party GStreamer plugins that provide GPU accelerated encoding and decoding exist as well.

MPEG-TS container mapping

Support was also added for carriage inside MPEG-TS which should enable a wide range of streaming applications including those based on the Video Services Forum (VSF)'s Technical Recommendation TR-07.

JPEG XS caps in GStreamer

It actually took us a few iterations to come up with GStreamer caps that we were somewhat happy with for starters.

Our starting point was what the SVT encoder/decoder output/consume, and our initial target was MPEG-TS container format support.

We checked various specifications to see how JPEG XS is mapped there and how it could work, in particular:

  • ISO/IEC 21122-3 (Part 3: Transport and container formats)
  • MPEG-TS JPEG XS mapping and VSF TR-07 - Transport of JPEG XS Video in MPEG-2 Transport Stream over IP
  • RFC 9134: RTP Payload Format for ISO/IEC 21122 (JPEG XS)
  • SMPTE ST 2124:2020 (Mapping JPEG XS Codestreams into the MXF)
  • MP4 mapping

and we think the current mapping will work for all of those cases.

Basically each mapping wants some extra headers in addition to the codestream data, for the out-of-band signalling required to make sense of the image data. Originally we thought about putting some form of codec_data header into the caps, but it wouldn't really have made anything easier, and would just have duplicated 99% of the info that's in the video caps already anyway.

The current caps mapping is based on ISO/IEC 21122-3, Annex D, with additional metadata in the caps, which should hopefully work just fine for RTP, MP4, MXF and other mappings in future.

Please give it a spin, and let us know if you have any questions or are interested in additional container mappings such as MP4 or MXF, or RTP payloaders / depayloaders.



webrtcsink already supported instantiating a data channel for the sole purpose of carrying navigation events from the consumer to the producer, it can also now create a generic control data channel through which the consumer can send JSON requests in the form:

{
    "id": identifier used in the response message,
    "mid": optional media identifier the request applies to,
    "request": {
        "type": currently "navigationEvent" and "customUpstreamEvent" are supported,
        "type-specific-field": ...
    }
}

The producer will reply with such messages:

{
  "id": identifier of the request,
  "error": optional error message, successful if not set
}

The example frontend was also updated with a text area for sending any arbitrary request.

The use case for this work was to make it possible for a consumer to control the mix matrix used for the audio stream, with such a pipeline running on the producer side:

gst-launch-1.0 audiotestsrc ! audioconvert ! webrtcsink enable-control-data-channel=true

As audioconvert now supports setting a mix matrix through a custom upstream event, the consumer can simply input the following text in the request field of the frontend to reverse the channels of a stereo audio stream:

{
  "type": "customUpstreamEvent",
  "structureName": "GstRequestAudioMixMatrix",
  "structure": {
    "matrix": [[0.0, 1.0], [1.0, 0.0]]
  }
}


The default signaller for webrtcsink can now produce an answer when the consumer sends the offer first.

To test this with the example, you can simply follow the usual steps but also paste the following text in the text area before clicking on the producer name:

{
  "offerToReceiveAudio": 1,
  "offerToReceiveVideo": 1
}

I implemented this in order to test multiopus support with webrtcsink, as it seems to work better when munging the SDP offered by chrome.



A couple of weeks ago I implemented support for static HDR10 metadata in the decklinkvideosink and decklinkvideosrc elements for Blackmagic video capture and playout devices. The culmination of this work is available from MR 7124 - decklink: add support for HDR output and input

This adds support for both PQ and HLG HDR alongside some improvements in colorimetry negotiation. Static HDR metadata in GStreamer is conveyed through caps.

The first part of this is the 'colorimetry' field in video/x-raw caps. decklinkvideosink and decklinkvideosrc now support the colorimetry values 'bt601', 'bt709', 'bt2020', 'bt2100-hlg', and 'bt2100-pq' for any resolution. Previously the colorimetry used was fixed based on the resolution of the video frames being sent or received. With some glue code, the colorimetry is now retrieved from the Decklink API and the Decklink API can ask us for the colorimetry of the submitted video frame. Arbitrary colorimetry support is not supported on all Decklink devices and we fallback to the previous fixed list based on frame resolution when not supported.

Support for HDR metadata is a separate feature flag in the Decklink API and may or may not be present independent of Decklink's arbitrary colour space support. If the Decklink device does not support HDR metadata, then the colorimetry values 'bt2100-hlg', and 'bt2100-pq' are not supported.

In the case of HLG, all that is necessary is to provide information that the HLG gamma transfer function is being used. Nothing else is required.

In the case of PQ HDR, in addition to providing Decklink with the correct gamma transfer function, Decklink also needs some other metadata conveyed in the caps in the form of the 'mastering-display-info' and 'light-content-level' fields. With some support from GstVideoMasteringDisplayInfo, and GstVideoContentLightLevel the relevant information signalled to Decklink and can be retrieved from each individual video frame.



In GStreamer 1.20 times I fixed the handling of RTSP control URIs in GStreamer's RTSP source element by making use of GstUri for joining URIs and resolving relative URIs instead of using a wrong, custom implementation of those basic URI operations (see RFC 2396).

This was in response to a bug report which was caused by a regression in 1.18 when fixing that custom implementation some years before. Now that this is handled according to the standards, one would expect that the topic is finally solved.

Unfortunately that was not the case. As it turns out, various RTSP servers are not actually implementing the URI operations for constructing the control URI but instead do simple string concatenation. This works fine for simple cases but once path separators or query parameters are involved this is not sufficient. The fact that both VLC and ffmpeg on the client-side also only do string concatenation unfortunately does not help this situation either as these servers will work fine in VLC and ffmpeg but not in GStreamer, so it initially appears like a GStreamer bug.

To work around these cases automatically, a workaround with a couple of follow-up fixes 1 2 3 4 was implemented. This workaround is available since 1.20.4.

Unfortunately this was also not enough as various servers don't just implement the URI RFC wrong, but also don't implement the RTSP RFC correctly and don't return any kind of meaningful errors but, for example, simply close the connection.

To solve this once and for all, Mathieu now added a new property to rtspsrc that forces it to directly use string concatenation and not attempt proper URI operations first.

$ gst-launch-1.0 rtspsrc location=rtsp://1.2.3.4/test force-non-compliant-url=true ! ...

This property is available since 1.24.7 and should make it possible to use such misbehaving and non-compliant servers.

If GStreamer's rtspsrc fails on an RTSP stream that is handled just fine by VLC and ffmpeg, give this a try.



Last month as part of the GTK 4.14 release, GTK gained support for directly importing DMABufs on Wayland. Among other things, this allows to pass decoded video frames from hardware decoders to GTK, and then under certain circumstances allows GTK to directly forward the DMABuf to the Wayland compositor. And under even more special circumstances, this can then be directly passed to the GPU driver. Matthias wrote some blog posts about the details.

In short, this reduces CPU usage and power consumption considerably when using a suitable hardware decoder and running GTK on Wayland. A suitable hardware decoder in this case is one provided by e.g. Intel or (newer) AMD GPUs via VA but unfortunately not NVIDIA because they simply don't support DMABufs.

I've added support for this to the GStreamer GTK4 video sink, gtk4paintablesink that exists as part of the GStreamer Rust plugins. Previously it was only possible to pass RGB system memory (i.e. after downloading from the GPU in case of hardware decoders) or GL textures (with all kinds of complications) from GStreamer to GTK4.

In general the GTK4 sink now offers the most complete GStreamer / UI toolkit integration, even more than the QML5/6 sinks, and it is used widely by various GNOME applications.



Hello and welcome to our little corner of the internet!

This is where we will post little updates and going-ons about GStreamer, Rust, Meson, Orc, GNOME, librice, and other Free and Open Source Software projects we love to contribute to.

This covers only a small part of our day-to-day upstream activity, but we'll try to make time to post about interesting happenings between the everyday hustle.

Please check in regularly and bear with us while we look into adding more convenient ways to get notified of updates.

In the meantime please follow us on Mastodon, Bluesky, or (yes we still call it) Twitter.