As WebRTC-based communications take hold in ever more live-streamed use cases, developers will be glad to know they now have an opportunity to bring compelling, immersive spatial auditory experiences into play, whether or not extended reality (XR) technologies are involved.

The Emergence of Immersive Spatial Audio

This next phase in spatial audio evolution comes on the heels of recent initiatives by Google, Apple, Amazon, Dolby, and the many other tech suppliers leveraging advances in immersive audio technology to enable more realistic listening experiences for one-way streaming. The biggest impact so far stems from Google’s release of Resonance Audio software development kits (SDKs), which work in Android, iOS, Windows, Linux, and the Unity and Unreal game engine operating environments. The Resonance Audio SDK allows developers to dynamically model complex sound flows from, say, various animated characters and other audio sources in a game, in accord with each player’s movements in virtual space.

For example, a player entering a large, enclosed space hears cathedral-like echo effects. The noise made by an approaching enemy from behind lets the player know where the threat is coming from. Near and far battlefield sounds change as the player traverses the in-game environment.

According to The New York Times, other instances where spatial audio is either offered or soon will be included in streaming platforms such as Apple Music, Tidal, Netflix, and the Amazon-owned podcast network Wondery. Meta, too, with its heavy focus on XR technology, is reported to be developing new spatial audio tools that go beyond the current level of support offered through its Oculus and other XR platforms.

All this recent activity is spurred partly by the emergence of head-tracking technology in headphones and ear pods, as well as XR headgear, which is essential to synchronizing auditory experiences with users’ movements. The Times notes head tracking is supported by the latest iterations of Apple’s AirPods and is also slated for wide adoption in Android-based gear.

The use of immersive spatial audio lends realism to users’ engagements with content delivered over one-way HTTP streaming infrastructure. Another potential dimension to immersive spatial audio would be the support of multiple users as they verbally interact with each other in real time.

This type of interaction requires a cost-effective way to activate a high level of auditory realism on WebRTC-based streaming platforms. That challenge has now been met by Red5 Pro with the development of an immersive spatial audio solution for use on its WebRTC-based Experience Delivery Network (XDN) platform. The exciting technology will enable fully immersive real-time auditory experiences in chat rooms, watch parties, fast action games, collaborative work environments, and other scenarios that rely on real-time communications, even in crowded social scenarios. The ability to embellish real-time chat and video communications running on WebRTC platforms with truly immersive audio experiences lends a powerful new appeal to such apps. And, of course, it represents a major step toward support for the fully immersive six-degrees-of-freedom (6DoF) experiences that WebRTC platforms need to provide in their role as the streaming linchpins to networked uses of XR technologies. (The intrinsic tie-in between WebRTC streaming and XR is explored at length in this Red5 Pro blog.)

Spatial Audio Technology

Immersive spatial audio is the latest advance in a decades-long evolution of sound engineering that began when stereo recording was introduced in the 1930s and progressed in the analog era to the emergence of multi-speaker surround sound systems in the ‘70s. Subsequently, an outpouring of digital innovations brought ambient sound propagation at high orders of complexity to everything from concert halls, theaters, and living rooms to virtual online spaces.

The jump to implementation of immersive spatial audio technology in boxed and one-way streamed content relies on technical solutions that vary in detail but largely follow the same prescriptions. Basically, the techniques make it possible to mix sounds in 3D space and then associate those mixes with specific stationary sound sources and objects that move around in the scene either in timed sequences or as a result of interactions with users.

The most widely used approach to enabling this dynamic object-based mixing of sound is known as Ambisonics. Ambisonics was invented in the ‘70s but little used until VR began to ramp up ten years ago, at which point patent expirations opened the door to royalty-free usage. Google’s Resonance Audio is based on Ambisonics. In addition, audio production applications like Reaper and Super Collider allow producers to use the Ambisonic Toolkit as a plug-in to create immersive spatial audio experiences.

What’s known as the Ambisonics B-format encapsulates signal sets in ways that are somewhat analogous to a three-dimensional extension of horizontally spaced audio streams in a surround-sound stereo system. With Ambisonics, the signal sets can be manipulated in sync with the spatial positions of the objects they’re attached to in relation to any given user.

High orders of complexity in B-format signal sets correspond to the sounds in virtual space that would be picked up by an omnidirectional microphone in real space. Sound signals can be partitioned into channels relating to height, depth, reflection, reverberation, and other parameters, allowing an Ambisonics panner (a specialized software-based audio encoder) to persistently configure them to create a realistic audio experience for each user.

In addition, the platform employs what’s known as head-related transfer functions (HRTFs) to create spatially accurate virtual sounds in 3D space based on each user’s psycho-acoustical characteristics. HRFTs make this possible by conveying a listener’s head size, ear shape, and angular position to reflect what a person hears in conjunction with the difference in the time it takes a sound to reach the left and right ears and the difference in loudness experienced by each ear.

An Optimal Spatial Audio Solution for WebRTC Streaming

Now that this level of audio realism is gaining traction across a vast range of use cases in the one-way HTTP streaming domain, it’s clear the time has come to bring immersive spatial audio experiences into two-way communications streamed over WebRTC infrastructure. In fact, significant steps in this direction have already been taken by Agora, a provider of WebRTC-based infrastructure supporting applications offered in software-as-a-service (SaaS) mode. For example, Chalkboard, an app specializing in social settings for sports bettors, has created a virtual sports bar experience on the Agora platform where conversations between users are amplified against conversational background noise.

Red5 Pro is taking immersive spatial audio to a new level of versatility and scalability in WebRTC-based communications. Its ability to do so stems from unique characteristics of XDN architecture that can be employed to maintain real-time configurations of realistic audio experiences in 3D space no matter how many users are engaged or how complex the ambient environment might be.

XDN-based immersive spatial audio will raise users’ personal conversational sound levels against background conversations and other sounds wherever anyone strikes up a conversation while moving around in a virtual space. Other sounds, too, will be perceived in accord with the user’s location at each moment. And the solution will be able to replicate shifting sound patterns resulting from surface obstructions, noise bursts, or other audio-impacting conditions in virtual space.

Red5 Pro’s development of a unique immersive spatial audio system for WebRTC-based real-time interactive streaming (RTIS) exemplifies what can be accomplished in a developer-first environment that allows customers to create a solution precisely customized to fit their use cases. This contrasts with the limited DevOps flexibility enabled by most WebRTC platforms, which focus on pre-baked solutions offered in SaaS mode.

In this case, Red5 Pro has worked with a customer to create an immersive spatial audio solution that sets new standards for realism that can now be met by other entities on the XDN platform. Of course, those who want to exploit these capabilities will need to incorporate what has already been accomplished with Ambisonics or other spatial audio systems into their applications software stacks.

With such 3D dynamic sound mixing capabilities in operation, the role of XDN infrastructure is to deliver a continuously accurate audio experience to each user in real time as they communicate with other people moving around in a virtual space. This requires persistent delivery of a dedicated audio stream to each user that incorporates the updated mix of sounds dictated by where they are and who they are communicating with from one split second to the next.

One of the key challenges in the development process was the need to avoid having to continuously remix all the sounds in the virtual space for each user’s audio stream, which would incur prohibitive processing costs and unacceptable latency. While the specific solution is proprietary, we can report that we are able to support shared mixes of ambient sound that can be applied individually without remixing to realistically reflect each user’s position in the virtual space. Notably, these ambient sound mixes even include audio generated by dynamically moving non-player characters (NPCs), which each user hears based on their proximity to the NPCs.

At the same time, against this background of shifting ambient sounds, we incorporate into users’ audio streams support for any direct one-on-one communications they might have as they move around the space. Thus, we have a practical solution that creates a uniquely realistic immersive experience of WebRTC-based communications in three-dimensional space.

A key element in the XDN toolset that makes this possible is Red5 Pro’s Cauldron transcoding engine, which, as explained in this blog, supports the blending of personally directed stream segments with the primary streams in real time. These segments, which we call Brews, can be added in the transcoding process without incurring delays usually caused by the need to translate coding to the languages understood by processors.


Immersive spatial audio is the latest example of the innovations on the XDN platform that are transforming RTIS use cases. Personalization of user experiences is occurring with graphics overlays, multi-camera viewing options, targeted advertising, presentations of purchasing options in gameplay and other e-commerce settings, and much else.

Everything happens in real time with no limits on scalability. The XDN architecture enables the orchestration of a hierarchy of Origin, Relay, and Edge Nodes in support of multi-directional streaming across any number of endpoints at sub-400ms to as low as 50ms latency end-to-end. The real-time infrastructure can be instantiated to operate with fail-safe redundancy at global distances across multiple cloud platforms.

To learn more about the immersive spatial audio solution or other aspects of XDN technology, contact us at Or feel free to schedule a call.

  • Share: