Centricular

Expertise, Straight from the Source



« Back

Devlog

Posts tagged with #analytics

Currently most code using the GStreamer Analytics library library is written in C or Python. To check how well the API works from Rust, and to have an excuse to play with the Rust burn deep-learning framework, I've implemented an object detection inference element based on the YOLOX model and a corresponding tensor decoder that allows usage with other elements based on the GstAnalytics API. I started this work at the last GStreamer hackfest, but this has now finally been merged and will be part of the GStreamer 1.28.0 release.

burn is a deep-learning framework in Rust that is approximately on the same level of abstraction as PyTorch. It features lots of computation backends (CPU-based, Vulkan, CUDA, ROCm, Metal, libtorch, ...), has loaders (or better: code generation) for e.g. ONNX or PyTorch models, and compiles and optimizes the model for a specific backend. It also comes with a repository containing various example models and links to other community models.

The first element is burn-yoloxinference. It takes raw RGB video frames and passes them through burn; as of the time of this writing either through a CPU-based or a Vulkan-based computation backend. The output then is the very same video frames with the raw object detection results attached as a GstTensorMeta. This is essentially a 85x8400 float matrix, which contains 8400 rows of candidate object detection boxes (4 floats) together with confidence values for the classes (80 floats for the pre-trained models on the COCO classes) and one confidence value for the overall box. The element itself is mostly boilerplate, caps negotiation code and glue code between GStreamer and burn.

The second element is yoloxtensordec. This takes the output of the first element and decodes the GstTensorMeta into a GstAnalyticsRelationMeta, which describes the detected objects with their bounding boxes in an abstract way. As part of this it also implements a non-maximum suppression (NMS) filter using intersection over unions (IoU) of bounding boxes to reduce the 8400 candidate boxes to a much lower number of actual likely object detections. The GstAnalyticsRelationMeta can then be used e.g. by the generic objectdetectionoverlay to render rectangles on top of the video, or the ioutracker elements to track objects over a sequence of frames. Again, this element is mostly boilerplate and caps negotiation code, plus around 100 SLOC of algorithm. In comparison the C YOLOv9 tensor decoder element is about 3x as much code, mostly thanks to the overhead of C memory book-keeping, lack of useful data structures and lack of abstraction language tools.

The reason why the tensor decoder is a separate element is mostly to have one such element per model and to have it implemented independently of the actual implementation and runtime of the model. The same tensor decoder should, for example, also work fine on the output of the onnxinference element with a YOLOX model. From GStreamer 1.28 onwards it will also be possible to autoplug suitable tensor decoders via the tensordecodebin element.

That the tensor decoders are independent of the actual implementation of the model also has the advantage that it can be implemented in a different language, preferably in a safer and less verbose language than C.

For using both elements together and using objectdetectionoverlay to render rectangles around the object detections, the following pipeline can be used:

gst-launch-1.0 souphttpsrc location=https://raw.githubusercontent.com/tracel-ai/models/f4444a90955c1c6fda90597aac95039a393beb5a/squeezenet-burn/samples/cat.jpg \
    ! jpegdec ! videoconvertscale ! "video/x-raw,width=640,height=640" \
    ! burn-yoloxinference model-type=large backend-type=vulkan ! yoloxtensordec label-file=COCO_classes.txt \
    ! videoconvertscale ! objectdetectionoverlay \
    ! videoconvertscale ! imagefreeze ! autovideosink -v

The output should look similar to this image.

I also did a lightning talk about this at the GStreamer conference this year.



For a project recently it was necessary to collect video frames of multiple streams during a specific interval, and in the future also audio, to pass it through an inference framework for extracting additional metadata from the media and attaching it to the frames.

While GStreamer has gained quite a bit of infrastructure in the past years for machine learning use-cases in the analytics library, there was nothing for this specific use-case yet.

As part of solving this, I proposed as design for a generic interface that allows combining and batching multiple streams into a single one by using empty buffers with a GstMeta that contains the buffers of the original streams, and caps that include the caps of the original streams and allow format negotiation in the pipeline to work as usual.

While this covers my specific use case of combining multiple streams, it should be generic enough to also handle other cases that came up during the discussions.

In addition I wrote two new elements, analyticscombiner and analyticssplitter, that make use of this new API for combining and batching multiple streams in a generic, media-agnostic way over specific time intervals, and later splitting it out again into the original streams. The combiner can be configured to collect all media in the time interval, or only the first or last.

Conceptually the combiner element is similar to NVIDIA's DeepStream nvstreammux element, and in the future it should be possible to write a translation layer between the GStreamer analytics library and DeepStream.

The basic idea for the usage of these elements is to have a pipeline like

-- stream 1 --\                                                                  / -- stream 1 with metadata --
               -- analyticscombiner -- inference elements -- analyticssplitter --
-- stream 2 --/                                                                  \ -- stream 2 with metadata --
   ........                                                                           ......................
-- stream N -/                                                                     \- stream N with metadata --

The inference elements would only add additional metadata to each of the buffers, which can then be made use of further downstream in the pipeline for operations like overlays or blurring specific areas of the frames.

In the future there are likely going to be more batching elements for specific stream types, operating on multiple or a single stream, or making use of completely different batching strategies.

Special thanks also to Olivier and Daniel who provided very useful feedback during the review of the two merge requests.