About the Real-Time Communication Interface

The Web Real-Time Communication (WebRTC) standard supports sending real-time video, audio, and arbitrary data between two peers. Amazon supports WebRTC to enable real-time streaming of audio, video, and (optionally) arbitrary data between Alexa and your Smart Home device. Alexa communicates commands over the Real-Time Communication (RTC) data channel to your device, and then your device responds and reports state back over the data channel. To enable audio and video, you implement the Alexa.RTCSessionController interface in your Alexa skill. If you want to include data channel support, you also implement the Alexa.RangeController interface for cameras that support pan, tilt, and zoom.

Users can communicate remotely with your devices by using a FireTV or any Echo device, such as an Echo Dot, Echo Plus, Echo Show, or Echo Spot. Users can also view live feeds from a camera in the Alexa app.

Alexa.RTCSessionController signaling

The following sequence diagram shows the WebRTC signaling protocol between Alexa and your Smart Home skill.

Diagram showing order of directives and events for RTCSessionController communication using InitiateSessionWithOffer, AnswerGenerated, SessionConnected, and SessionDisconnected.

Session Description Protocol Offer/Answer format

The RTCSessionController interface uses the Session Description Protocol (SDP) to negotiate session capabilities between peers.

Offer/answer exchange example

Each media track has a set of Interactive Connectivity Establishment (ICE) candidates. The example shows ICE candidates of type host. If your devices aren't routed through a public gateway, also include either server-reflexive by using STUN or relay candidates by using TURN.

v=0
o=- 3747690900 3747690900 IN IP4 0.0.0.0
s=a 2 z
c=IN IP4 0.0.0.0
t=0 0
a=group:BUNDLE audio0 video0
m=audio 1 RTP/SAVPF 96 0
a=candidate:1 1 UDP 2013266430 xxx.xxx.xxx.xxx 8620 typ host
a=candidate:2 1 TCP 1010827775 xxx.xxx.xxx.xxx 45351 typ host tcptype passive
a=candidate:3 2 UDP 2013266429 xxx.xxx.xxx.xxx 50066 typ host
a=candidate:4 2 TCP 1010827774 xxx.xxx.xxx.xxx 65157 typ host tcptype passive
a=candidate:5 2 TCP 1015022078 xxx.xxx.xxx.xxx 9 typ host tcptype active
a=candidate:6 1 TCP 1015022079 xxx.xxx.xxx.xxx 9 typ host tcptype active
a=setup:actpass
a=extmap:3 http://www.webrtc.org/experiments/rtp-hdrext/abs-send-time
a=rtpmap:96 opus/48000/2
a=rtcp:9 IN IP4 0.0.0.0
a=rtcp-mux
a=sendrecv
a=mid:audio0
a=ssrc:118039096 cname:user2571875795@host-433aaf59
a=ice-ufrag:AGVf
a=ice-pwd:h3JAYGhIaQ/Nvyaz9dLoz9
a=fingerprint:sha-256 34:D4:54:17:0C:95:2A:79:FF:72:10:21:E9:6E:F3:77:86:2F:8D:6C:33:45:BA:14:1D:43:01:D7:CD:0A:1A:84
m=video 1 RTP/SAVPF 99
a=candidate:4 1 UDP 2013266430 xxx.xxx.xxx.xxx 8620 typ host
a=candidate:5 1 TCP 1015022079 xxx.xxx.xxx.xxx 9 typ host tcptype active
a=candidate:4 2 UDP 2013266429 xxx.xxx.xxx.xxx 50066 typ host
a=candidate:6 1 TCP 1010827775 xxx.xxx.xxx.xxx 45351 typ host tcptype passive
a=candidate:5 2 TCP 1015022078 xxx.xxx.xxx.xxx 9 typ host tcptype active
a=candidate:6 2 TCP 1010827774 xxx.xxx.xxx.xxx 65157 typ host tcptype passive
b=AS:500
a=setup:actpass
a=extmap:3 http://www.webrtc.org/experiments/rtp-hdrext/abs-send-time
a=rtpmap:99 H264/90000
a=rtcp:9 IN IP4 0.0.0.0
a=rtcp-mux
a=sendrecv
a=mid:video0
a=rtcp-fb:99 nack
a=rtcp-fb:99 nack pli
a=rtcp-fb:99 ccm fir
a=ssrc:3643559644 cname:user2571875795@host-433aaf59
a=ice-ufrag:AGVf
a=ice-pwd:h3JAYGhIaQ/Nvyaz9dLoz9
a=fingerprint:sha-256 34:D4:54:17:0C:95:2A:79:FF:72:10:21:E9:6E:F3:77:86:2F:8D:6C:33:45:BA:14:1D:43:01:D7:CD:0A:1A:84

Supported communication types

The RTCSessionController interface supports one-way (half-duplex) or two-way (full-duplex) communication. For an audio-only scenario, such as an Echo Plus connecting to a front door intercom, communication must be two-way. For an audio and video scenario, such as an Echo Show connecting to a front door camera, Alexa supports 1-way video communication and 1-way or two-way audio communication.

  • Half-duplex communication allows users to communicate in two directions, but not simultaneously. For example:
    • A walkie-talkie
    • A push-to-talk door intercom
      If your device doesn't have acoustic echo cancellation support, choose half duplex.
  • Full-duplex communication allows users to communicate in two directions simultaneously. For example:
    • A telephone
    • A telephone door intercom

Supported resolutions

The supported resolutions are 480p to 1080p.

Prerequisites and service level requirements

Low latency is critical to an optimal user experience. To use the RTCSessionController, you must meet to following requirements:

  • Alexa requires your device to support live streaming for at least one minute.

  • When your skill receives an offer, you must respond with a SDP answer within six seconds.

  • Your device or platform must be WebRTC-compliant, or support the suite of protocols by WebRTC and all supported resiliency mechanisms used in WebRTC as follows:
    • Negative acknowledgment (NACK)
    • Picture loss indication (PLI)
    • Full intra-request (FIR)
    • Receiver estimated maximum bitrate (REMB)
  • For resource considerations, you must support bundling and rtcp-mux. You use a bundle to send audio and video over the same connection to reduce the number of open sockets.

  • To support full duplex communication, your device must employ effective algorithms for acoustic echo cancellation (AEC) and noise suppression.

  • To support half duplex communication, you can use the Push to Talk feature through the typical live view scenario. Declare isFullDuplexAudioSupported as false in the discovery response.

  • To support video, you must use one of the following video codecs:
    • H264 (up to profile high, level 4.1)
  • To support audio, you must use one of the following audio codecs:
    • Opus (preferred codec)
    • PCMU/PCMA or G.711
    • AAC-LC, HE-AAC
  • For ICE candidates, you can use either UDP or TCP but you must use IPv4. Trickle ICE isn't supported. You must gather all ICE candidates up front and send them in the SDP answer.
  • Minimize the number of ICE candidates to allow your device to respond within six seconds.

Cameras that support pan, tilt, zoom

To enable an Alexa user to use the pan, tilt, and zoom features, your device must implement the RTCSessionController and the RangeController interfaces. Alexa uses the Alexa.RangeController directives to request your camera do the following:

  • Pan – Rotate the camera on the horizontal plane, left and right
  • Tilt – Rotate the camera on the vertical plane, up and down
  • Zoom – Change the view to see a smaller area with more detail (zoom in) or more area with less detail (zoom out).

Specify the camera range

A camera can implement all or any subset of pan, tilt, and zoom. To specify what properties the camera supports, set the instance field in the Alexa.RangeController to Camera.Pan, Camera.Tilt, Camera.Zoom.

For each instance, specify the minimum and maximum ranges that your camera supports. For pan and tilt, specify ranges as the percentage of the Field Of View (FOV) of your camera. For example, if your camera has a 90 degree horizontal FOV, and can rotate 360 total degrees, the range of motion is 400%. The range represents the number of times you can fit the FOV in the total range. You can define your total supported range as 0 – 400 or –200 – 200. If you use –200 – 200, Alexa can use zero for the direction, straight ahead. For zoom, specify the range from 0 – maximum zoom in percent.

The following diagram shows the camera field of view before and after the user asks Alexa to pan to the right. Here, the camera has a 90 degree FOV and can rotate 360 degrees.

After the request, the camera rotates 90 degress to the right.

The following example shows the Alexa.RangeController interface for a camera that supports pan. Send these properties in the discovery response. For details about the properties, see Alexa.RangeController.

Copied to clipboard.

{
    "type": "AlexaInterface",
    "interface": "Alexa.RangeController",
    "version": "3",
    "instance": "Camera.Pan",
    "capabilityResources": {
        "friendlyNames": [{
                "@type": "text",
                "value": {
                    "text": "Camera Pan",
                    "locale": "en-US"
                }
            },
            {
                "@type": "text",
                "value": {
                    "text": "Camera Rotation",
                    "locale": "en-US"
                }
            },
            {
                "@type": "text",
                "value": {
                    "text": "Rotation",
                    "locale": "en-US"
                }
            }
        ]
    },
    "properties": {
        "supported": [{
            "name": "rangeValue"
        }],
        "retrievable": true,
        "proactivelyReported": true
    },
    "configuration": {
        "supportedRange": {
            "minimumValue": -200,
            "maximumValue": 200,
            "precision": 1
        },
        "presets": [{
                "rangeValue": -200,
                "presetResources": {
                    "friendlyNames": [{
                        "@type": "text",
                        "value": {
                            "text": "Far Left",
                            "locale": "en-US"
                        }
                    }]
                }
            },
            {
                "rangeValue": 0,
                "presetResources": {
                    "friendlyNames": [{
                        "@type": "text",
                        "value": {
                            "text": "Center",
                            "locale": "en-US"
                        }
                    }]
                }
            },
            {
                "rangeValue": 200,
                "presetResources": {
                    "friendlyNames": [{
                        "@type": "text",
                        "value": {
                            "text": "Far Right",
                            "locale": "en-US"
                        }
                    }]
                }
            }
        ]
    }
}

Respond to pan, tilt, and zoom directives

To request pan, tilt, and zoom, Alexa sends a SetRangeValue or AdjustRangeValue directive to your device over the RTC data channel. Respond immediately without waiting for movement to complete. After the motion completes, partially completes, or fails, send an asynchronous an Alexa.ChangeReport event with the current position of the camera. Send the Alexa.ChangeReport event to Alexa over both the RTC data channel and the Alexa gateway.

  • If the camera can fulfill the requested motion, respond with an Alexa.Response and include the final position. Here, the final position is the requested position.
  • If the camera can partially fulfill the requested motion, respond with an Alexa.Response and include the final position. For example, if the request contains a range that's outside of the camera range, the final position is the furthest extent the camera can move.
  • If the camera knows it can't fulfill the request, respond with an Alexa.ErrorResponse.
  • After any completed full or partial position change, due a request from Alexa or from any an external source, send an Alexa.ChangeReport event with the current position of the camera.
  • If the requested motion fails following an Alexa.Response, send an Alexa.ChangeReport event with the current position of the camera.

Examples

The following examples show the request and response payloads for pan and tilt requests.

Pan to center request

{
  "directive": {
    "header": {
      "namespace": "Alexa.RangeController",
      "instance": "Camera.Pan",
      "name": "SetRangeValue",
      "messageId": "<message id>",
      "correlationToken": "<an opaque correlation token>",
      "payloadVersion": "3"
    },
    "endpoint": {
      "scope": {
        "type": "BearerToken",
        "token": "<an OAuth2 bearer token>"
      },
      "endpointId": "<endpoint id>",
      "cookie": {}
    },
    "payload": {
      "rangeValue": 0
    }
  }
}

Pan to center response

Copied to clipboard.

{
  "event": {
    "header": {
      "namespace": "Alexa",
      "name": "Response",
      "messageId": "<message id>",
      "correlationToken": "<an opaque correlation token>",
      "payloadVersion": "3"
    },
    "endpoint": {
      "scope": {
        "type": "BearerToken",
        "token": "<an OAuth2 bearer token>"
      },
      "endpointId": "<endpoint id>"
    },
    "payload": {}
  },
  "context": {
    "properties": [
      {
        "namespace": "Alexa.RangeController",
        "instance": "Camera.Pan",
        "name": "rangeValue",
        "value": "0",
        "timeOfSample": "2017-02-03T16:20:50.52Z",
        "uncertaintyInMilliseconds": 0
      }
    ]
  }
}

Tilt down 20% request

{
  "directive": {
    "header": {
      "namespace": "Alexa.RangeController",
      "instance": "Camera.Tilt",
      "name": "AdjustRangeValue",
      "messageId": "<message id>",
      "correlationToken": "<an opaque correlation token>",
      "payloadVersion": "3"
    },
    "endpoint": {
      "scope": {
        "type": "BearerToken",
        "token": "<an OAuth2 bearer token>"
      },
      "endpointId": "<endpoint id>",
      "cookie": {}
    },
    "payload": {
      "rangeValueDelta": -20,
      "rangeValueDeltaDefault": false
    }
  }
}

Tilt not supported response

Copied to clipboard.


{
  "event": {
    "header": {
      "namespace": "Alexa",
      "name": "ErrorResponse",
      "messageId": "<message id>",
      "payloadVersion": "3"
    },
    "endpoint":{
      "endpointId": "<endpoint id>"
    },
    "payload": {
      "type": "INVALID_VALUE",
      "message": "Camera doesn't support tilt."
    }
  }
}