Centricular

Expertise, Straight from the Source



« Back

Devlog

Posts tagged with #audio

Audio source separation describes the process of splitting an already mixed audio stream into its individual, logical sources. For example, splitting a song into separate streams for its individual instruments and vocals. This can be used for example for karaoke, music practice, or isolating the speaker from background noise for easier understanding by humans or improving results of speech-to-text processing.

Starting with GStreamer 1.28.0 an element for this purpose will be included. It is based on the Python/pytorch implementation of demucs and comes with various pre-trained models with different performance and accuracy characteristics, as well as which different sets of sources they can separate. CPU-based processing is generally multiple times real-time on modern CPUs (around 8x on mine) but GPU-based processing via pytorch is also possible.

The element itself is part of the GStreamer Rust plugins and can either run demucs locally in-process using an embedded Python interpreter via pyo3, or via a small Python service over WebSockets that can run either locally or remotely (e.g. for thin clients). The used model, and chunk size and overlap between chunks can be configured. Chunk size and overlap provide control over the introduced latency (lower values give lower latency) and quality (higher values give better quality).

The separate sources are provided on individual source pads of the element and it effectively behaves like a demuxer. A pipeline for karaoke would for example look as follows:

gst-launch-1.0 uridecodebin uri=file:///path/to/music/file ! audioconvert ! tee name=t ! \
  queue max-size-time=0 max-size-bytes=0 max-size-buffers=2 ! demucs name=demucs model-name=htdemucs \
  demucs.src_vocals ! queue ! audioamplify amplification=-1 ! mixer.sink_0 \
  t. ! queue max-size-time=9000000000 max-size-bytes=0 max-size-buffers=0 ! mixer.sink_1 \
  audiomixer name=mixer ! audioconvert ! autoaudiosink

This takes an URI to a music file, passes that through the demucs element for extracting the vocals, then takes the original input via a tee and subtracts the vocals from it by first inverting all samples of the vocals stream with the audioamplify element and then mixing it with the original input with an audiomixer.

I also did a lightning talk about this at the GStreamer conference this year.



Over the past few years, we've been slowly working on improving the platform-specific plugins for Windows, macOS, iOS, and Android, and making them work as well as the equivalent plugins on Linux. In this episode, we will look at audio device switching in the source and sink elements on macOS and Windows.

On Linux, if you're using the PulseAudio elements (both with the PulseAudio daemon and PipeWire), you get perfect device switching: quick, seamless, easy, and reliable. Simply set the device property whenever you want and you're off to the races. If the device gets unplugged, the pipeline will continue, and you will get notified of the unplug via the GST_MESSAGE_DEVICE_REMOVED bus message from GstDeviceMonitor so you can change the device.

As of a few weeks ago, the Windows Audio plugin wasapi2 implements the same behaviour. All you have to do is set the device property to whatever device you want (fetched using the GstDeviceMonitor API), at any time.

A merge request is open for adding the same feature to the macOS audio plugin, and is expected to be merged soon.

For graceful error handling, such as accidental device unplug or other unexpected errors, there's a new continue-on-error property. Setting that will cause the source to output silence after unplug, whereas the sink will simply discard the buffers. An element warning will be emitted to notify the app (alongside the GST_MESSAGE_DEVICE_REMOVED bus message if there was a hardware unplug), and the app can switch the device by setting the device property.

Thanks to Seungha and Piotr for working on this!