/

Troubleshooting Red5 Pro Autoscaling


The following describes log settings that can be modified on your instances to troubleshoot different problems.

NOTE: if you modify <configuration> at the top of the red5pro/conf/logback.xml file to <configuration scan="true" scanPeriod="60 seconds"> that setting allows you to edit logging levels without restarting the server. Debug logging will add overhead to your servers.

Stream Manager

The following loggers can be modified or added to your red5pro/conf/logback.xml file to troubleshoot autoscaling issues from the Stream Manager side:

  • <logger name="com.red5pro.services.streammanager" - logging for all stream manager operations, including broadcast/subscribe requests, API calls, scale out/in operations, and websocket proxy. It is recommended to change the setting to level="INFO" first, and then to level="DEBUG" if INFO doesn't return the information you are looking for.

Troubleshooting specific cloud platforms:

  • AWS cloud controller: <logger name="com.red5pro.services.cloud.aws.component.AWSInstanceController"
  • AWS cloud API: <logger name="com.amazonaws"
  • Google Cloud controller: <logger name="com.red5pro.services.cloud.google.component.ComputeInstanceController"
  • Azure cloud controller: <logger name="com.red5pro.services.cloud.microsoft.component.AzureComputeController"
  • Simulated cloud controller: <logger name="com.red5pro.services.simulatedcloud.generic.component.GenericSimulatedCloudController"

Terraform

Terraform is only involved in the deployment and removal of nodes. Terraform Service logging is written to the /usr/local/red5service/red5.log file (or whatever the path is to your terraform service), and should include useful information about any problems that terraform encounters while trying to deploy (or terminate) and instance (e.g., if you created a disk image for a lower instance type than you specified in your launch configuration policy).

A single terraform server can only perform one action at a time - so it is important to make sure that one action is completed before initiating a second action. For this reason, when replacing a nodegroup it is best to:

  1. Create the new nodegroup
  2. Check the nodegroup nodes' statuses, and wait until they all come back as inservice
  3. Delete the original nodegroup

Logging on Nodes

It is recommended that during your development phase setting the conf/logback.xml file to use <configuration scan="true" scanPeriod="60 seconds"> - this will allow you to modify logging levels on individual nodes without having to create a new disk image.

To troubleshoot node-to-streammanger or intra-node communication, modify/add the following logging entries:

<logger name="com.red5pro.cluster.plugin" level="DEBUG"/>
<logger name="com.red5pro.cluster.plugin.ClusterPlugin" level="DEBUG"/>
<logger name="com.red5pro.clustering.autoscale" level="DEBUG"/>

For troubleshooting transcoding and ABR subscribing, modify the following entries as well:

for WebRTC ABR:

<logger name="com.red5pro.webrtc.stream.FlashToRTCTransformerStream" level="DEBUG"/>
<logger name="com.red5pro.webrtc.stream.RTCBroadcastStream" level="DEBUG"/>

and for RTSP ABR:

<logger name="com.red5pro.rtsp.RTSPMinaConnection" level="DEBUG"/>

Other Tips

Nodegroup Log Collection

Copy the following into a file, then make that file executable.

#!/bin/bash
SM_DOMAIN='<your-streammanager-url>'
API_VERSION='4.0'
NODE_GROUP='<nodegroup-id>'
API_PASS='<streammanager-api-token>'
PATH_TO_SSH_KEY='<full/path/to.ssh-key>'
SSH_USER='<username-for-ssh-into-nodes>'

log_i() {
    log
    echo "[INFO] ${@}"
}
log() {
    echo -n "[$(date '+%Y-%m-%d %H:%M:%S')]"
}

array=()
current_time=$(date '+%m%d_%H%M%S')
log_i "Create log folder ./logs_${current_time}"
mkdir ./logs_${current_time}

result=$(curl --silent "https://${SM_DOMAIN}/streammanager/api/${API_VERSION}/admin/nodegroup/${NODE_GROUP}/node?accessToken=${API_PASS}")

resp=$(echo $result |jq -r '.[] | [.role, .address] | join(" ")' | awk '{print $2}')

for resp_index in $resp
do
    role=$(echo $result |jq -r '.[] | [.role, .address] | join(" ")' | grep $resp_index | awk '{print $1}')
    log_i "Start download logs from $role with IP: $resp_index"
    mkdir ./logs_${current_time}/${role}_${resp_index}
    scp -o StrictHostKeyChecking=no -C -r -i $PATH_TO_SSH_KEY $SSH_USER@$resp_index:/usr/local/red5pro/log/* ./logs_${current_time}/${role}_${resp_index}/ &
    array+=($!)
    sleep 0.2
done

for pid in ${array[*]}
do
    while true;
    do
        if ps -p $pid > /dev/null
        then
            sleep 0.5
        else
        break
        fi
    done
done

This bash shell script can be run from a terminal session and will copy the logs from all of the nodes in whichever nodegroup you specify. You will need to modify the following values to run the script:

  • SM_DOMAIN - the domain URL of the stream manager
  • API_VERSION - stream manager API version (currently default is 4.0)
  • NODE_GROUP - the id of the nodegroup from which to pull the logs; to get the active nodegroups run the list nodegroups API call
  • API_PASS - stream manager access token
  • PATHTOSSH_KEY - full path to the SSH key used to access the nodes
  • SSH_USER - in general this is root for Digital Ocean and ubuntu for AWS, Azure and GCP

System requirements - will need to install jq (brew install jq) if you don't have it already.

Rolling Logs

It is strongly recommended that your servers are configured to use rolling logs so you don't run the risk of filling up a server with huge log files.

Retrieving logs from nodes removed from nodegroups

The instancecontroller.deleteDeadGroupNodesOnCleanUp setting in stream manager/WEB-INF/red5-web.properties is set to true by default. If you set this to false then the stream manager should stop your VMs but not terminate them (This is not supported by the Terraform cloud controller.). This allows you to grab the logs from nodes that have been removed from a nodegroup - with the caveat that you need to configure the logback append settings on the node images to true (that is set to false by default, which means that the logs get overwritten when the instance is started).

<appender class="ch.qos.logback.core.FileAppender" name="FILE">
    <file>log/red5.log</file>
    <append>true</append>
    <encoder>
      <pattern>%d{ISO8601} [%thread] %-5level %logger{35} - %msg%n</pattern>
    </encoder>
  </appender>

Instance Quotas

Most cloud platforms have regional and/or global quotas for VMs (especially with new accounts). Make sure that your account has sufficient allowance in the region(s) where you plan to depoloy your autoscaling solution to allow for growth.

Cluster Communication

If you have origins and edges across multiple regions, you may want to increase the edge reporting frequency to improve the communication. On the edge/origin disk image, modify the {red5pro}/conf/cluster.xml file, changing the proxyPingInterval from the default 10000 (10 seconds) to as low as 4000 (4 seconds) (<property name="proxyPingInterval" value="4000" />)

Stream Manager Red5 Pro Service Failure

If the red5pro jsvc service is failing to start on your stream manager, try verifying that your stream manager has access to the database.