Customers have now connected over 300 million devices to Alexa. With this rapid adoption, we have developed mechanisms and best practices to improve the availability of partner skills, and ensure that we deliver a reliable backed infrastructure that customers can rely on.
Alexa SH developers are now required to attain 99.93% availability, which will dictate a positive and reliable SH experience for customers, while developers benefit from more customer engagement and brand protection. As providers of SH capabilities and services, both Amazon and developers share the responsibility and value of providing near real-time device status and control to our millions of customers’ Alexa-enabled devices across the world. As we expand our supported devices and capabilities, we must continue our obsession to improve our end-user experience. Amazon is dedicated to maintaining our current level of support and issue mitigation strategies while also expanding in areas of availability, efficiency, and operational excellence.
We have a responsibility to our shared customers to provide highly available services and well executed operations strategies to mitigate any failures. Our goal is to ensure developers meet 99.93% availability across all devices, which equates to 6 hours, 8 minutes of downtime per year (or ~31 minutes of downtime per month). This goal is essential to keeping Alexa skills available and enabling customers to use Alexa to control their ubiquitous devices at any time. For context, availability is defined as the total time available for use dividing by the total time.
The first step to improving availability is to share our knowledge with developers. Specifically, about how enhancing skill and device quality contributes to keeping cloud services highly available. Together, we can provide highly available services along with a unified and efficient method for shared issue resolution of both Amazon and developer instigated outage events. Developers must have as much visibility as possible where Alexa can provide it, which include the insights Alexa has gained over the years to manage areas where developers may not be able to see as clearly. Insights such as control success rate, latency, reporting state, and change reporting are included in the Alexa Developer Console. As additional details are necessary, they will be included in the console.
A three step recommendation for a highly available infrastructure for Alexa
Work with Alexa (WWA) currently has program guidelines for high availability (outage reduction, low latency, and low skill invocation failures). Alexa will begin tracking more specific skill and device quality guidelines in 2023 to ensure quality remains high while limiting customer impact for outage events. Below are three recommendations to ensure your infrastructure is highly available.
Step 1: Resiliency and Disaster Recovery (DR) is Key for Business Continuity and a Positive CX
Resiliency refers to an organization’s ability to maintain acceptable service levels through severe disruptions, or outages. Gartner (2020) reported that 85% of senior IT leaders strongly associate resilience with their roadmap4. Several factors play roles in a highly resilient infrastructure including communication, Standard Operating Procedures (SOPs), and monitoring (automation, KPIs)3. Other factors include the need to scale and backup services in case of an outage.
- Plan for Disaster Recovery (DR). Take advantage of the cloud service provider you are using for backups and redundancy and use KPIs to limit downtime and data loss, which take significant recovery time and negatively impact the customer experience (CX). Use an active-active strategy for highly critical services affecting your customers2. Best practices for a DR plan is outlined in the AWS Well-Architected documentation, if using AWS. If you are using another cloud service provider, please reach out to your representative regarding the DR plan they have available.
- Utilize KPIs and monitor the progress on a continuous basis2. Most of our developers focus on availability and reliability KPIs such as Uptime. However, additional KPIs may be used to improve outage response times and limit customer impact. Below are a few KPIs along with their corresponding descriptions for your reference.
- Uptime - represented as a percentage of time your services are available and functional usually depicted as 99.93%. This is the most important metric pertaining to availability for Alexa partners.
- Mean Time to Detection/Discovery (MTTD) - The time between an outage occurring and operations being alerted
- Mean Time to Acknowledge (MTTA) - the average time between operations being alerted and operations acknowledging the event
- Mean Time to Recovery/Resolve/Repair (MTTR) - The time between an outage occurring and recovery of a service
- Mean time between failures (MTBF) - the average time between failure events
- Incidents over time - The average number of incidents that occur over a set time (e.g. per quarter/half year/full year)
- Service Level Agreement and Objective (SLA/SLO) - promises (internal or external) on a specified level of availability defined by agreed upon collected metrics
Step 2: Operational excellence, monitoring and incident management
The difference between having a general SOP to mitigate outages and having an efficient SOP is the time it takes to recover from an outage. A prolonged recovery period or extended outage has a negative effect on customer trust. This may result in fewer brand purchases and negative reviews of existing services and products. Below are recommendations to ensure you have defined an efficient standard operating procedure (SOP) to help mitigate outages should they occur.
- Ensure you have 24x7 support available to support outages. For most partners, this includes an operations team that is capable of addressing the outage within a few minutes and are capable of reverting the changes that caused the outage. Customers expect Alexa skills to work around the clock and if you have a skill supported across multiple regions (North America, Europe, and Far East) there is an expectation from customers that the skill and device will always be online and available for use.
- Create an email alias to notify the appropriate team(s) about the outage. Let’s face it, turnover and transition between teams within organizations is not going away. However, your security teams can configure their directory server to address those gaps by using email aliases (i.e. user groups) without having to manage each individual email address. This inherently utilizes automation as new employees join the company or existing employees switch teams. Alexa will request aliases for three main purposes:
- On-call Email Alias. The on-call email alias would be used when Alexa is alarmed about a service outage that is preventing customers from using your Alexa skill. The email from Alexa to this alias will be informative and will request that someone from this alias responds to acknowledge the incident
- Escalation Email Alias. This alias would be used by Alexa to escalate to the business leaders to ensure customer impact is acknowledged and attended to as a priority. This alias would only be used if there is no timely response from the on-call email alias, keeping in mind that timely recovery from the event is critical to the customer experience
- Technical Contact Alias. This alias would be used when Alexa sees a regression in a developer’s service. Regression signals may result from latency, account linking, discovery, control, etc.
- Automate alarms to notify you of the regression should there be an internal service issue/outage. For instance, a service outage may include a control success rate drop below 70%, a latency spike for over 10 minutes, account linking completion rate drops, etc. These are all signals that you can monitor on a continuous basis. Monitoring performance indicators and configuring alarms enable proactive measures to limit the outage duration or limit outages altogether.
- Conduct a post mortem to determine the root-cause and take action on them. This is an essential step to improving outages and preventing them from reoccurring. Many companies have a board review with a list of questions that are discussed to ensure proper mechanisms such as alarms, communication channels, etc. are available for a more effective resolution moving forward. This is also a key activity towards continuous improvement and mitigates repeated outage events.
- Practice responding to outage incidents23. You can plan outage simulations, performance testing, and regression testing throughout the year to practice incident response to ensure that your 24x7 support team is responsive, alarms are working, and that you are incorporating lessons learned from past outages to continuously improve your response time. This exercise would focus on key areas of improvement throughout the outage cycle, including internal communication within your company and external communication (partners, customers, etc.) while ensuring the outage is minimized.
- Have a 30-day minimum log retention policy. Having a shorter retention policy shortens the window to identify redundant events. Use messageIDs to diagnose specific events that map to directives sent by Alexa.
Step 3: Alexa device setup experience (discovery and account linking)
Device setup includes device discovery and account linking and is the first step in the customer journey when controlling a device using Alexa’s technology. Having a good setup experience motivates customers to reuse the device. While this section will focus primarily on the setup experience, other skill quality best practices blogs such as latency, state and change reports, and control are included in separate blog posts for quick reference should you run into any issues with your device or skill performance.
- Set up App-to-App Account Linking. Alexa supports four integrations for account linking (3P to Alexa for iOS and Android, and Alexa to 3P for iOS and Android). The 3P to Alexa integrations (iOS and Android) are a requirement to obtain a WWA badge as it delivers the best CX.
- Review the App-to-App Account Linking best practices blog. The blog includes sections for CX recommendations and error message handling with links to the technical documentation. An optimal CX makes it easier for customers to self-service when they are going through setup or if they have account linking issues.
- Review the App-to-App Account Linking Troubleshooting guide. This guide includes common issues reported by customers and includes corresponding solutions. As a first step, review this guide to troubleshoot App-to-App Account Linking issues.
- Ensure your OAuth server is highly available. Ensure you are continuously monitoring and have set up alarms in case of OAuth server errors to ensure there is redundancy or multi-region failovers
- Monitor and automate account linking. Use a test account to ensure there are no broken links in your Alexa integration that would prevent customers from account linking.
- Review the discovery technical documentation. Proactively manage endpoints to keep Alexa up to date with the customer’s device status by reviewing the Alexa Discovery APIs and Send Events in the Alexa Event Gateway documentation. Also, gracefully handle errors by reviewing the Alexa.ErrorResponse API documentation.
The best experience for customers
A DR strategy is the first step to creating and maintaining a resilient infrastructure. A DR strategy will provide customers with a consistent CX, creates higher customer engagement and satisfaction, and protects your brand from poor experiences. You’ll gain critical insights to improve your products and services over time while giving you the visibility needed to react to outages. The DR strategy should include monitoring and alarms throughout your infrastructure in addition to SOPs to ensure the appropriate teams are clear about their roles and responsibilities. You may opt to implement continuous improvement strategies by focusing on KPIs such as uptime, MTTD, and MTTR while including post-mortems. Conduct outage simulations or testing following changes to your infrastructure (upgrades, deployments, etc.) to ensure the response levels are optimal to reduce downtime. Communicate the overall availability strategy internally to ensure there is buy-in across teams. Reach out to your Amazon representatives at AWS and Alexa to ensure there is alignment for KPIs and initiatives as it may affect your certification status and overall skill quality. Internal and external communication will enable us to operationalize and enforce the minimum levels of support required to provide the best experience for shared customers.
Additional resources for AWS and non-AWS managed cloud architectures can be found on our AWS Well Architected resource page here. This resource and cloud agnostic tool can help provide guidance and options for improvements in design efficiency, performance, redundancy, and overall deployed cost reductions.
- (n.d.). Availability [Review of AWS Well-Architected]. Amazon Web Services; Amazon Web Services, Inc. Retrieved November 28, 2022, from https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/availability.html
- (n.d.). AWS Well-Architected [Review of AWS Well-Architected]. Amazon Web Services; Amazon Web Services, Inc. Retrieved November 28, 2022, from https://aws.amazon.com/architecture/well-architected
- Kamin, Daud Alyas, "Exploring Security, Privacy, and Reliability Strategies to Enable the Adoption of IoT" (2017). Walden Dissertations and Doctoral Studies. 4382. https://scholarworks.waldenu.edu/dissertations/4382
- Witty, R. & Young, P. (2020). Resilience Is an Urgent C-Suite Priority. Gartner. https://www.gartner.com