Building a Fail-Safe Real-Time Streaming Infrastructure

webrtc
SHARE

One of the great benefits distributors should expect from real-time streaming on an open standards-based cloud platform is fail-safe performance enabled by automated cross-cloud redundancy and other persistent-quality mechanisms. Robust performance has become a given of public cloud services, but all too often, failures still occur, whether as a result of network disruptions, server malfunctions… Continue reading Building a Fail-Safe Real-Time Streaming Infrastructure


One of the great benefits distributors should expect from real-time streaming on an open standards-based cloud platform is fail-safe performance enabled by automated cross-cloud redundancy and other persistent-quality mechanisms.

Robust performance has become a given of public cloud services, but all too often, failures still occur, whether as a result of network disruptions, server malfunctions or datacenter-wide breakdowns induced by power outages and other causes. Such occurrences are especially damaging in real-time live streaming situations, where, without recourse to instantaneous recovery, much of the value proposition is lost.

Fortunately, thanks to ecosystem-wide adherence to open standards, it’s now possible to build a truly fail-safe real-time streaming infrastructure. This can be done with utilization of a real-time streaming platform that neutralizes these failure points not only by switching to alternative resources under control of a given cloud service provider but also by automatically enabling instant recourse to other cloud operators’ resources.

Of course, there are other aspects to achieving fail-safe performance that must be supported by a real time streaming platform. This discussion will cover some of those bases as well.


Automating Redundancy

When it comes to the fundamental requirements of ironclad resiliency, there are several things to look for to determine whether a real-time streaming platform can tap resources as needed in cross-cloud configurations as well as within the footprint of a given cloud operator.

The starting point is the set of mechanisms that underpin creation of a real-time streaming infrastructure. The streaming platform anchoring that infrastructure must be able to run on any combination of standards-compliant public and private clouds with automated scaling of resources on an as-needed basis across core, midway and edge locations. One good example of this kind of infrastructure is Red5 Pro’s cross cloud distribution system.

Cross cloud automated scaling of resources can be done if the platform is able to translate the commands of its operations system (OS) to the API calls of whichever Infrastructure-as-a-Service (IaaS) cloud operators a distributor chooses to engage in infrastructure support. This allows the OS to execute load balancing across the entire infrastructure and to spin instances of resource utilization up or down based on real-time variations in traffic volume without manual intervention.

With these attributes in place to support automatic scaling, the real-time streaming platform is well suited to support the automated redundancy that’s essential to fail-safe operations. With persistent performance monitoring of all engaged resources, the platform can instantaneously shift processing from a malfunctioning component within a node to another appliance in that node, or, in the event of the entire node going offline, move the processing to another node with no disruption to the flow or increase in latency.


Maximizing Cross-Cloud Options

Cloud resources available for failover redundancy should include all nodes that can be accessed for use with the real-time streaming infrastructure, including any that have been contracted for backup purposes from alternative IaaS providers. These backup arrangements might be with cloud providers whose APIs have been pre-integrated for interaction with the streaming platform’s OS.

But there will be instances when a distributor wants to leverage other cloud resources, possibly to cut costs or because there’s a dearth of primary or backup resources from IaaS operators that have been pre-integrated with the platform. Indeed, the ability to expand cloud options to the broadest possible extent is the ultimate guarantor of fail-safe performance.

This requires a real-time streaming platform that can leverage multiple cloud resources to spin up new resources on the fly no matter whether the container- or hypervisor-based versions of virtualization technology are in play with any given IaaS provider. There’s a way to do this through interaction with the Terraform open-source multi-cloud toolset provided by Hashicorp, which opens a global reservoir of over 200 cloud resources for use in fail-safe redundancy as well as all other aspects of real-time streaming operations.

Terraform makes this possible by translating IaaS resources into a high-level configuration syntax that allows IaaS APIs to be abstracted for access through a Terraform Cloud API specific to each cloud operator. By leveraging those APIs, a real-time streaming platform can manage any combination of contractually available Terraform-compatible IaaS resources as holistically integrated components of the live streaming infrastructure.

In the case of container-based cloud instances, the real-time streaming platform’s Terraform controller interfaces with the Terraform Kubernetes Provider, which performs the abstractions necessary for conformance with the containerized environment. If the IaaS provider relies on virtual machines running on hypervisors, the real-time streaming platform interfaces with the Terraform OpenStack Provider.


Minimizing the Impact of Packet Losses in Real-Time Streaming

Beyond the redundancy essential to meeting the most rigorous SLA stipulations, there are some other capabilities distributors should look for when seeking fail-safe performance from a real-time streaming platform. For example, when the platform uses WebRTC in conjunction with RTP (Real-Time Transport Protocol) as the transport layer for real-time streaming, which relies on UDP (User Datagram Protocol) rather than TCP (Transmission Control Protocol), there needs to be a means of compensating for dropped packets, which is what TCP was designed for.

RTP, by avoiding the buffering and retransmission modes employed with TCP, enables the sub-half-second latencies essential to real-time streaming while providing timing information with packet sequence numbering that allows receivers to reconstruct the original sequence. WebRTC can take advantage of this information to compensate for UDP packet losses through implementation of either Packet Loss Concealment (PLC) measures like Forward Error Correction (FEC) or negative acknowledgment (NACK) messaging, or some combination of both.

NACK has the advantage of replacing the most essential dropped packets through retransmission while ignoring others, whereas PLC creates replacements for all dropped packets based on processing that estimates what should have been in the missing packet. A well-designed implementation of NACK only activates the process when the network is experiencing high rates of packet loss.

In such instances, the streaming platform caches each second’s worth of packets as they’re transmitted to a receiver from the server, allowing time for the server to receive a call from the receiver for any dropped packets that the client, via processes supported through the browser, determines to be essential. This maximizes quality of the streamed content with minimum use of processing power while limiting the attendant minor increases in latency to the most troublesome instances.


Load Testing

Another important element to the real-time streaming process is load testing, which allows the platform to automatically assess how many concurrent connections a streaming system’s architecture can support. This contributes to ensuring the most precise allocation of resources is maintained throughout the lifespan of the real-time streaming infrastructure.

A powerful approach to load testing in real-time streaming scenarios has been adapted for use with WebRTC, RTSP (Real-Time Streaming Protocol) and RTMP (Real-Time Messaging Protocol) from the HTTP-based “Bees with Machine Guns” system originally developed by the Chicago Tribune for load testing its website operations. These Bees tools leverage AWS to spin up Elastic Compute Cloud (EC2) instances to bombard application servers much like a denial-of-service (DoS) attack does. The volume of attacks is controlled by the number of “subscribing” pseudo clients that request video to be streamed from relevant servers at any one time. Red5 Pro took this core load testing suite and added the above streaming protocols, which are free to use in your own projects.


Fallback to ABR Streaming

Another contribution to fail-safe live streaming performance involves a fallback mechanism that instantaneously shifts streaming to a conventional CDN in any situation where there is no immediate alternative within the instantiated real-time streaming infrastructure for reaching users impacted by a node failure. This, of course, will impose higher latency on the viewing experiences of affected users, but it prevents a complete disruption of service.


Delivering on the Fail-Safe Performance Mandate

All of the capabilities essential to building a fail-safe real-time streaming infrastructure are baked into the Red5 Pro real-time streaming platform.

Fully automated, end-to-end redundancy is coordinated through the Red5 Pro Stream Manager, which manages all node clusters across all cloud participants in the real-time streaming infrastructure through controllers designed to work with each IaaS provider’s APIs. Instantiation of cross-cloud redundancy is facilitated through pre-integrations with AWS, Microsoft Azure and Google Cloud Platform and interactions with the Terraform open-source multi-cloud toolset via Red5 Pro’s Terraform Cloud Controller. In addition, the Red5 Pro Stream Manager can be manually integrated to work with the APIs of any cloud provider that isn’t integrated with Terraform.

As to the other facets of performance assurance:

  • Red5 Pro employs NACK as described above.
  • The adaptations of the Bees tools for real-time streaming were performed by Red5 Pro.
  • And, in the case of fallback to conventional TCP streaming as a last resort in the event of edge node failure, Red5 Pro’s platform reverts to adaptive bitrate formatting for streaming over HLS.

The net result is a real-time streaming platform that not only supports transmission of live content at 200-400 ms. latency to any number of users anywhere in the world. It does this with the fail-safe persistence that is essential to the success of anyone whose business depends on delivering live content in real time.

To learn more about the fail-safe performance mechanisms used with Red5 Pro contact info@red5.net or schedule a call.