Centricular

Expertise, Straight from the Source



« Back

Devlog

Posts tagged with #rust

Back in June '25, I implemented a new speech synthesis element using the ElevenLabs API.

In this post I will briefly explain some of the design choices I made, and provide one or two usage examples.

POST vs. WSS

ElevenLabs offers two interfaces for speech synthesis:

  • Either open a websocket and feed the service small chunks of text (eg words) to receive a continuous audio stream

  • Or POST longer segments of text to receive independent audio fragments

The websocket API is well-adapted to conversational use cases, and can offer the lowest latency, but isn't the most well-suited to the use cases I was targeting: my goal was to use it to synthesize audio from text that was first transcribed, then translated from an original input audio stream.

In this situation we have two constraints we need to be mindful of:

  • For translation purposes we need to construct large enough text segments prior to translating, in order for the translation service to operate with enough context to do a good job.

  • Once audio has been synthesized, we might also need to resample it in order to have it fit within the original duration of the speech.

Given that:

  • The latency benefits from using the websocket API are largely negated by the larger text segments we would use as the input

  • Resampling the continuous stream we would receive to make sure individual words are time-shifted back to the "correct" position, while possible thanks to the sync_alignment option, would have increased the complexity of the resulting element

I chose to use the POST API for this element. We might still choose to implement a websocket-based version if there is a good story for using GStreamer in a conversational pipeline, but that is not on my radar for now.

Additionally, we already have a speech synthesis element around the AWS Polly API which is also POST-based, so both elements can share a similar design.

Audio resampling

As mentioned previously, the ElevenLabs API does not offer direct control over the duration of the output audio.

For instance, you might be dubbing speech from a fast speaker with a slow voice, potentially causing the output audio to drift out of sync.

To address this, the element can optionally make use of signalsmith_stretch to resample the audio in a pitch-preserving manner.

When the feature is enabled it can be used through the overflow=compress property.

The effect can sometimes be pretty jarring for very short input, so an extra property is also exposed to allow some tolerance for drift: max-overflow. It represents the maximum duration by which the audio output is allow to drift out of sync, and does a good job using up intervals of silence between utterances.

Voice cloning

The ElevenLabs API exposes a pretty powerful feature, Instant Voice Cloning. It can be used to create a custom voice that will sound very much like a reference voice, requiring only a handful of seconds to a few minutes of reference audio data to produce useful results.

Using the multilingual model, that newly-cloned voice can even be used to generate convincing speech in a different language.

A typical pipeline for my target use case can be represented as (pseudo gst-launch):

input_audio_src ! transcriber ! translator ! synthesizer

When using a transcriber element such as speechmaticstranscriber, speaker "diarization" (fancy word for detection) can be used to determine when a given speaker was speaking, thus making it possible to clone voices even in a multi-speaker situation.

The challenge in this situation however is that the synthesizer element doesn't have access to the original audio samples, as it only deals with text as the input.

I thus decided on the following solution:

input_audio_src ! voicecloner ! transcriber ! .. ! synthesizer

The voice cloner element will accumulate audio samples, then upon receiving custom upstream events from the transcriber element with information about speaker timings it will start cloning voices and trim its internal sample queue.

To be compatible, a transcriber simply needs to send the appropriate events upstream. The speechmaticstranscriber element can be used as a reference.

Finally, once a voice clone is ready, the cloner element sends another event downstream with a mapping of speaker id to voice id. The synthesizer element can then intercept the event and start using the newly-created voice clone.

The cloner element can also be used in single-speaker voice by just setting the speaker property to some identifier and watching for messages on the bus:

gst-launch-1.0 -m -e alsasrc ! audioconvert ! audioresample ! queue ! elevenlabsvoicecloner api-key=$SPEECHMATICS_API_KEY speaker="Mathieu" ! fakesink

Putting it all together

At this year's GStreamer conference I gave a talk where I demo'd these new elements.

This is the pipeline I used then:

AWS_ACCESS_KEY_ID="XXX" AWS_SECRET_ACCESS_KEY="XXX" gst-launch-1.0 uridecodebin uri=file:///home/meh/Videos/spanish-convo-trimmed.webm name=ud \
  ud. ! queue max-size-time=15000000000 max-size-bytes=0 max-size-buffers=0 ! clocksync ! autovideosink \
  ud. ! audioconvert ! audioresample ! clocksync ! elevenlabsvoicecloner api-key=XXX ! \
    speechmaticstranscriber url=wss://eu2.rt.speechmatics.com/v2 enable-late-punctuation-hack=false join-punctuation=false api-key="XXX" max-delay=2500 latency=4000 language-code=es diarization=speaker ! \
    queue max-size-time=15000000000 max-size-bytes=0 max-size-buffers=0 ! textaccumulate latency=3000 drain-on-final-transcripts=false extend-duration=true ! \
    awstranslate latency=1000 input-language-code="es-ES" output-language-code="en-EN" ! \
    elevenlabssynthesizer api-key=XXX retry-with-speed=false overflow=compress latency=3000 language-code="en" voice-id="iCKVfVbyCo5AAswzTkkX" model-id="eleven_multilingual_v2" max-overflow=0 ! \
    queue max-size-time=15000000000 max-size-bytes=0 max-size-buffers=0 ! audiomixer name=m ! autoaudiosink audiotestsrc volume=0.03 wave=violet-noise ! clocksync ! m.

Watch my talk for the result, or try it yourself (you will need API keys for speechmatics / AWS / elevenlabs)!



At the GStreamer project, we produce SDKs for lots of platforms: Linux, Android, macOS, iOS, and Windows. However, as we port more and more plugins to Rust 🦀, we are finding ourselves backed into a corner.

Rust static libraries are simply too big.

To give you an example, the AWS folks changed their SDK back in March to switch their cryptographic toolkit over to their aws-lc-rs crate [1]. However, that causes a 2-10x increase in code size (bug reports here and here), which gets duplicated on every plugin that makes use of their ecosystem!

What are Rust staticlibs made of?

To summarise, each Rust plugin packs a copy of its dependencies, plus a copy of the Rust standard library. This is not a problem on shared libraries and executables by their very nature, but on static libraries it causes several issues:

First approach: Single-Object Prelinking

I won't bore you with the details as I've written another blog post on the subject; the gist is that you can unpack the library, and then ask the linker to perform "partial linking" or "relocatable linking" (Linux term) or "Single-Object Prelinking" (the Apple term, which I'll use throughout the post) over the object files. Setting which symbols you want to be visible for downstream consumers lets dead-code elimination take place at the plugin level, ensuring your libraries are now back to a reasonable size.

Why is it not enough?

Single-Object Prelinking has two drawbacks:

  • Unoptimized code: the linker won't be able to deduplicate functions between melded objects, as they've been hidden by the prelinking process.
  • Windows: there are no officially supported tools (read: Visual Studio, LLVM, GCC) to perform this at the compiler level. It is possible to do this with binutils, but the PE-COFF format doesn't allow to change the visibility of unexported functions.

Melt all the object files with the power of dragons' fire!

As said earlier, no tools on Windows support prelinking officially yet, but there's another thing we can do: library deduplication.

Thanks to Rust's comprehensive crate ecosystem, I wrote a new CLI tool which I called dragonfire. Given a complete Rust workspace or list of static libraries, dragonfire:

  1. reads all the static libraries in one pass
  2. deduplicates the object files inside them based on their size and naming (Rust has its own, unique naming convention for object files -- pretty useful!)
  3. copies the duplicate objects into a new static library (usually called gstrsworkspace as its primary use is for the GStreamer ecosystem)
  4. removes the duplicates from the rest of the libraries
  5. updates the symbol table in each of the libraries with the bundled LLVM tools

Thanks to the ar crate, the unpacking and writing only happens at stage 3, ensuring no wasteful I/O slowdowns takes place. The llvm-tools-preview component in turn takes care of locating and calling up llvm-ar for updating the workspace's symbol tables.

A special mention is deserved to the object files' naming convention. Assume a Rust staticlib named libfoo, its object files will be named as:

  • crate_name-hash1.crate_name.hash2-cgu.nnn.rcgu.o
  • On Windows only: foo.crate_name-hash1.crate_name.hash2-cgu.nnn.rcgu.o
  • On non-Windows platforms: same as above, but replacing foo with libfoo-hash

In all cases, crate_name means a dependency present somewhere in the workspace tree, and nnn is a number that will be bigger than zero whenever -C codegen-units was set to higher than 1.

For dragonfire purposes, dropping the library prefix is enough to be able to deduplicate object files; however, on Windows we can also find import library stubs, which LLVM can generate on its own by the use of the #[raw-dylib] annotation [2]. Import stubs can have any extension, e.g. .dll, .exe and .sys (the latter two coming from private Win32 APIs). These stubs cannot be deduplicated as they are generated individually per imported function, so dragonfire must preserve them where they are.

Drawbacks of object file deduplication

Again there are several disadvantages of this approach. On Apple platforms, deduplicating libraries triggers a strange linker error, which I've not seen before:

ld: multiple errors: compact unwind must have at least 1 fixup in '<framework>/GStreamer[arm64][1021](libgstrsworkspace_a-3f2b47962471807d-lse_ldset4_acq.o)'; r_symbolnum=-19 out of range in '<framework>/GStreamer[arm64][1022](libgstrsworkspace_a-compiler_builtins-350c23344d78cfbc.compiler_builtins.5e126dca1f5284a9-cgu.162.rcgu.o)'

This also led me to find that Rust libraries were packing bitcode, which is forbidden by Apple. (This was thankfully already fixed before shipping time, but we've not yet updated our Rust minimum version to take advantage of it.)

Another drawback is that Rust's implementation of LTO causes dead-code elimination at the crate level, as opposed to the workspace level. This makes object file deduplication impossible, as each copy is different.

For the Windows platform, there is an extra drawback which affects specifically object files produced by LLVM: the COMDAT sections are set to IMAGE_COMDAT_SELECT_NODUPLICATES. This means that the linker will outright reject functions with multiple definitions, rather than realise they're all duplicates and discarding all but one of the copies. MSVC in particular performs symbol resolution before dead-code elimination. This means that linking will fail because of unresolved symbols before dead code elimination kicks in; to use deduplicated libraries, one must set the linker flags /OPT:REF /FORCE:UNRESOLVED to ensure the dead code can be successfully eliminated.

Results

With library deduplication, we can make static libraries up to 44x smaller when building under MSVC [3] (you can expand the tables below for the full comparison):

  • gstaws.lib: from 173M to 71M (~2.5x)
  • gstrswebrtc.lib: from 193M to 66M (~2.9x)
  • gstwebrtchttp.lib: from 66M to 1,5M (~ 44x)
Table: before and after melding under MSVC
file no prelinking melded
gstaws.lib 173M 71M
gstcdg.lib 36M 572K
gstclaxon.lib 32M 568K
gstdav1d.lib 34M 936K
gstelevenlabs.lib 59M 1008K
gstfallbackswitch.lib 37M 2,3M
gstffv1.lib 34M 744K
gstfmp4.lib 39M 3,2M
gstgif.lib 34M 1,1M
gstgopbuffer.lib 30M 456K
gsthlsmultivariantsink.lib 46M 1,6M
gsthlssink3.lib 41M 1,2M
gsthsv.lib 34M 796K
gstjson.lib 31M 704K
gstlewton.lib 33M 1,2M
gstlivesync.lib 33M 728K
gstmp4.lib 38M 2,2M
gstmpegtslive.lib 31M 704K
gstndi.lib 38M 2,8M
gstoriginalbuffer.lib 34M 376K
gstquinn.lib 75M 23M
gstraptorq.lib 33M 2,4M
gstrav1e.lib 46M 11M
gstregex.lib 38M 404K
gstreqwest.lib 58M 1,4M
gstrsanalytics.lib 35M 1000K
gstrsaudiofx.lib 54M 22M
gstrsclosedcaption.lib 52M 8,4M
gstrsinter.lib 35M 604K
gstrsonvif.lib 46M 2,0M
gstrspng.lib 35M 1,2M
gstrsrtp.lib 59M 11M
gstrsrtsp.lib 57M 4,4M
gstrstracers.lib 40M 2,4M
gstrsvideofx.lib 48M 11M
gstrswebrtc.lib 193M 66M
gstrsworkspace.lib N/A 137M
gststreamgrouper.lib 30M 376K
gsttextahead.lib 30M 332K
gsttextwrap.lib 32M 2,1M
gstthreadshare.lib 52M 12M
gsttogglerecord.lib 35M 808K
gsturiplaylistbin.lib 31M 648K
gstvvdec.lib 34M 564K
gstwebrtchttp.lib 66M 1,5M

The results from the melding above can be compared with the file sizes obtained using LTO on Windows [4] (remember it doesn't actually fix linking against plugins):

  • gstaws.lib: from 71M (LTO) to 67M (melded) (-5.6%)
  • gstrswebrtc.lib: from 105M to 66M (-37.1%)
  • gstwebrtchttp.lib: from 28M to 1,5M (-94.6%)
Table: before and after LTO under MSVC (no melding involved)
file (codegen-units=1 in all cases) no prelinking lto=thin opt-level=s + lto=thin debug=1 + opt-level=s debug=1 + lto=thin + opt-level=s
old/gstaws.lib 199M 199M 171M 78M 67M
old/gstcdg.lib 11M 11M 11M 7,5M 7,5M
old/gstclaxon.lib 11M 11M 11M 7,7M 7,7M
old/gstdav1d.lib 12M 12M 12M 7,9M 7,8M
old/gstelevenlabs.lib 52M 52M 49M 24M 22M
old/gstfallbackswitch.lib 18M 18M 17M 11M 11M
old/gstffv1.lib 11M 11M 11M 7,6M 7,6M
old/gstfmp4.lib 20M 20M 19M 12M 11M
old/gstgif.lib 12M 12M 12M 7,9M 7,9M
old/gstgopbuffer.lib 9,7M 9,7M 9,7M 7,5M 7,4M
old/gsthlsmultivariantsink.lib 16M 16M 16M 9,6M 9,4M
old/gsthlssink3.lib 14M 14M 14M 8,9M 8,8M
old/gsthsv.lib 11M 11M 11M 7,8M 7,7M
old/gstjson.lib 12M 12M 12M 8,4M 8,2M
old/gstlewton.lib 12M 12M 12M 8,1M 8,1M
old/gstlivesync.lib 12M 12M 12M 8,3M 8,2M
old/gstmp4.lib 17M 17M 17M 9,9M 9,7M
old/gstmpegtslive.lib 12M 12M 12M 8,0M 7,9M
old/gstndi.lib 21M 21M 20M 12M 11M
old/gstoriginalbuffer.lib 9,6M 9,6M 9,7M 7,4M 7,3M
old/gstquinn.lib 94M 94M 86M 39M 35M
old/gstraptorq.lib 18M 18M 17M 9,8M 9,4M
old/gstrav1e.lib 39M 39M 37M 19M 18M
old/gstregex.lib 26M 26M 25M 14M 14M
old/gstreqwest.lib 53M 53M 49M 24M 22M
old/gstrsanalytics.lib 15M 15M 14M 9,2M 8,9M
old/gstrsaudiofx.lib 57M 57M 56M 23M 22M
old/gstrsclosedcaption.lib 40M 40M 36M 20M 18M
old/gstrsinter.lib 14M 14M 13M 8,5M 8,4M
old/gstrsonvif.lib 21M 21M 20M 11M 11M
old/gstrspng.lib 13M 13M 13M 8,2M 8,2M
old/gstrsrtp.lib 47M 47M 44M 22M 20M
old/gstrsrtsp.lib 35M 35M 33M 16M 15M
old/gstrstracers.lib 28M 28M 27M 16M 15M
old/gstrsvideofx.lib 16M 16M 35M 9,2M 15M
old/gstrswebrtc.lib 329M 329M 284M 124M 105M
old/gststreamgrouper.lib 9,6M 9,6M 9,7M 7,2M 7,2M
old/gsttextahead.lib 9,6M 9,6M 9,5M 7,4M 7,3M
old/gsttextwrap.lib 13M 13M 13M 8,4M 8,4M
old/gstthreadshare.lib 49M 49M 45M 23M 20M
old/gsttogglerecord.lib 13M 13M 13M 8,5M 8,4M
old/gsturiplaylistbin.lib 11M 11M 11M 7,9M 7,9M
old/gstvvdec.lib 11M 11M 11M 7,5M 7,5M
old/gstwebrtchttp.lib 69M 69M 63M 30M 28M

Conclusion

This article presents several longstanding pain points in Rust, namely staticlib binary sizes, symbol leaking, and incompatibilities between Rust and MSVC. I demonstrate the tool dragonfire that aims to address and work around, where possible, these issues, along with remaining issues to be addressed.

As explained earlier, dragonfire treated libraries are live on all platforms except Apple's, if you use the development packages from mainline; it's on track hopefully for the 1.28 release of GStreamer. There's already a merge request pending to enable it for Apple platforms, we're only waiting to update the Rust mininum version.

If you want to have a look, dragonfire's source code is available at Freedesktop's GitLab instance. Please note that at the moment I have no plans to submit this to crates.io.

Feel free to contact me with any feedback, and thanks for reading!


  1. See its default-https-client feature at lib.rs, you will find it throughout the AWS SDK ecosystem. ↩︎

  2. https://doc.rust-lang.org/reference/items/external-blocks.html#dylib-versus-raw-dylib ↩︎

  3. In all cases the -C flags are debug=1 + codegen-units=1 + opt-level=s; see this comment for the complete results across all platforms. ↩︎

  4. Source: https://gitlab.freedesktop.org/gstreamer/cerbero/-/merge_requests/1895 ↩︎



Some time ago, Edward and I wrote a new element that allows clocking a GStreamer pipeline from an MPEG-TS stream, for example received via SRT.

This new element, mpegtslivesrc, wraps around any existing live source element, e.g. udpsrc or srtsrc, and provides a GStreamer clock that approximates the sender's clock. By making use of this clock as pipeline clock, it is possible to run the whole pipeline at the same speed as the sender is producing the stream and without having to implement any kind of clock drift mechanism like skewing or resampling. Without this it is necessary currently to adjust the timestamps of media coming out of GStreamer's tsdemux element, which is problematic if accurate timestamps are necessary or the stream should be stored to a file, e.g. a 25fps stream wouldn't have exactly 40ms inter-frame timestamp differences anymore.

The clock is approximated by making use of the in-stream MPEG-TS PCR, which basically gives the sender's clock time at specific points inside the stream, and correlating that together with the local receive times via a linear regression to calculate the relative rate between the sender's clock and the local system clock.

Usage of the element is as simple as

$ gst-launch-1.0 mpegtslivesrc source='srtsrc location=srt://1.2.3.4:5678?latency=150&mode=caller' ! tsdemux skew-corrections=false ! ...
$ gst-launch-1.0 mpegtslivesrc source='udpsrc address=1.2.3.4 port=5678' ! tsdemux skew-corrections=false ! ...

Addition 2025-06-28: If you're using an older (< 1.28) version of GStreamer, you'll have to use the ignore-pcr=true property on tsdemux instead. skew-corrections=false was only added recently and allows for more reliable handling of MPEG-TS timestamp discontinuities.

A similar approach for clocking is implemented in the AJA source element and the NDI source element when the clocked timestamp mode is configured.



When using hlssink3 and hlscmafsink elements, it's now possible to track new fragments being added by listening for the hls-segment-added message:

Got message #67 from element "hlscmafsink0" (element): hls-segment-added, location=(string)segment00000.m4s, running-time=(guint64)0, duration=(guint64)3000000000;
Got message #71 from element "hlscmafsink0" (element): hls-segment-added, location=(string)segment00001.m4s, running-time=(guint64)3000000000, duration=(guint64)3000000000;
Got message #74 from element "hlscmafsink0" (element): hls-segment-added, location=(string)segment00002.m4s, running-time=(guint64)6000000000, duration=(guint64)3000000000;

This is similar to how you would listen for splitmuxsink-fragment-closed when using the older hlssink2.



Last month as part of the GTK 4.14 release, GTK gained support for directly importing DMABufs on Wayland. Among other things, this allows to pass decoded video frames from hardware decoders to GTK, and then under certain circumstances allows GTK to directly forward the DMABuf to the Wayland compositor. And under even more special circumstances, this can then be directly passed to the GPU driver. Matthias wrote some blog posts about the details.

In short, this reduces CPU usage and power consumption considerably when using a suitable hardware decoder and running GTK on Wayland. A suitable hardware decoder in this case is one provided by e.g. Intel or (newer) AMD GPUs via VA but unfortunately not NVIDIA because they simply don't support DMABufs.

I've added support for this to the GStreamer GTK4 video sink, gtk4paintablesink that exists as part of the GStreamer Rust plugins. Previously it was only possible to pass RGB system memory (i.e. after downloading from the GPU in case of hardware decoders) or GL textures (with all kinds of complications) from GStreamer to GTK4.

In general the GTK4 sink now offers the most complete GStreamer / UI toolkit integration, even more than the QML5/6 sinks, and it is used widely by various GNOME applications.