Google Meet is a well-used video conferencing solution. As we here at Red5 Pro are quite interested in all things live streaming our System Architect, Davide, decided he would take a look under the covers to see how it works. Specifically, he wanted to explore how Meet handles the audio channels. To be clear, Davide’s approach was a pure black-box reverse engineer of Google Meet. He didn’t have access to their backend or their source code, nor did he decompile anything to find out how this works. He used tools like Chrome’s WebRTC-internals to observe how the system functions and used deductive reasoning as to how it might be working.
TLDR – It appears that Google Meet uses an approach throughout the call where it sends only the three loudest participants using three separate WebRTC audio tracks (A, B, and C).
Before we get too far, what is mix-minus?
Mix-minus is an audio engineering term that refers to how the audio is delivered over an internet connection. The process of sending a feed over the internet means that there is a slight delay between when something is said and when it is heard. Since that audio feed is mixed together with all the other audio feeds there is the potential for the the audio from the speaker to get caught up in the same speaker’s microphone. At worst, this causes an ear splitting feedback loop and at best this causes a very distracting echo.
The process of mix-minus subtracts the current speaker’s audio input from the complete audio feed they receive. This is also know as a “clean feed”. Essentially it is a method for ensuring that everyone on the call won’t get any painful or annoying echos. While there are more straightforward applications of mix-minus such as with a telephone caller dialing in to a radio station, implementing mix-minus for a conference call is a little more difficult considering all the extra inputs from the other callers involved.
This brings us to our next question: how does Google Meet implement mix-minus with WebRTC?
The test involved creating a conference with seven participants in Google Chrome, each in separate tabs. Davide then monitored the sessions with Chrome WebRTC internals. All the participants were initially muted and then one participant was unmuted and instructed to talk. As that participant spoke, the WebRTC internals showed that each of them was mapped to only one of the three audio tracks. The same results were achieved with the other participants speaking one at a time. Whenever a participant was the only one talking, their corresponding audio track was receiving data.
Then two participants that had been previously mapped individually to the same audio track were unmuted. For example, Participant 1 spoke and was mapped to track A, then Participant 2 spoke and was mapped to track A as well. Then both Participant 1 and Participant 2 were unmuted and spoke at the same time. When they spoke it resulted in two tracks receiving data. At that point, one of the participants was remapped by the Meet platform to different tracks. After that, each of those two participants spoke individually and were then remapped by the Meet platform to different tracks. So if participants 1 and 2 were initially mapped to track A, when they were talking together tracks A and B received data at the same time. After that, when participant 1 was talking alone I was seeing data only on track A, while for participant 2 only on track B. Therefore, one of the participants had been remapped to a different track.
Repeating the same test with three participants that were initially mapped to the same track replicated the same result with the three participants being remapped to different audio tracks. Additional testing revealed that when two participants are mapped to different tracks, they remain on different tracks the whole time. This suggests that despite multiple participants being mapped to an audio track, the audio track itself can only transport the audio of a single participant. When both participants try to use the same audio track, one is remapped to a different track. That further suggests that Google Meet does not send a mixed stream to the participants.
The three audio tracks (A, B, and C) seem to be shared between participants even though each may see them under a different name. In fact, after running a few tests that forced the remappings it was apparent that even renaming the tracks of one participant resulted in the remappings looking identical to the ones of another participant.
Figure 1: High-level diagram of the architecture.
Figure 1 shows a high-level diagram of the architecture that Google Meet may be using. The set of presenters’ audio streams is fed to a Detector that determines the four loudest tracks with their level and feeds their packets to an Audio Engine. When a presenter subscribes to a conference it will receive the three loudest tracks fed to the Audio Engine. If one of these tracks belongs to the presenter, it will be swapped with the remaining one. In this way, each presenter will get a mix-minus audio stream.
Using an example, if there are six presenters and 1, 2, 3, and 4 are talking and the "loudest" are 1, 2, and 3 the presenters would get the streams:
- Presenter 1: 4, 2, 3
- Presenter 2: 1, 4, 3
- Presenter 3: 1, 2, 4
- Presenter 4: 1, 2, 3
- Presenter 5: 1, 2, 3
- Presenter 6: 1, 2, 3
Also, instead of sending the three separate audio tracks to each presenter, it is possible to send a single track that carries the correct mix selected between the four that would be generated on the server-side.
We wanted to undertake this examination in order to study how other solutions have effectively prevented feedback loops in their application. We are considering integrating a similar approach to our own product. Do you have other ideas or suggestions for how to implement mix-minus with WebRTC? Let us know. Message email@example.com or schedule a call.