Is it us or them: Using flow data to monitor your providers.

Posted February 28, 2019 in 

image

Run a production service on a major cloud provider at scale for long enough and you are bound to run into what I like to call the “us or them” issue? Your application is down and your health checks are failing but you don’t know if your developers broke something or if your cloud provider infrastructure is failing.

Cloud providers are pretty good at eventually owning up to major issues. If an entire availability zone or service falls on its face, there are a number of nice dashboards that providers offer to inform you. But what about the gray failure modes in the infrastructure itself such as noisy neighbors, overloaded top of rack switches, or rebooting hypervisors? Or a 3rd party service that becomes slow or unreachable in specific parts of the cloud infrastructure (where you happen to be running)? These types of failures happen frequently enough to impact your service and but they are isolated enough that your cloud provider isn’t exactly advertising them in a dashboard when they do.

Of course, if you’ve architected your application correctly, it can survive some of these issues and degrade gracefully while the problem is occurring. And you can sit back and cross your fingers that things will come back to normal eventually. In the worst case, you can fail away from the misbehaving systems and hope the weather is better in some other part of the cloud infrastructure.

Debasis Das, Senior Software Engineer at Twilio, recently gave a great talk on this phenomenon at the Chaos Engineering Meetup.

Debasis pointed out in his talk that Twilio is able to achieve four nines (99.99%) of uptime in this way, a budget of 52.6 minutes of downtime per year. He also noted that getting to five nines would mean only 5.26 minutes of downtime and would require pretty radical changes to their operations! One of those changes would be a means of knowing where their providers are having problems in real time so they can make adjustments quickly.

This is one of areas where flow data is particularly useful. Since its comprehensive and extremely granular, it can answer a number of questions about the cloud infrastructure by observing its interactions with your application. Are packet drops or errors spiking between specific services? Is the issue occurring only within a specific subset of instances or containers? Is response time spiking to a 3rd party cloud service? Is it only happening in a particular availability zone?

In Debasis’s case, he was able to use flow data to make inferences about the behavior of the underlying cloud to deal with an issue occurring in Twilio’s voice signaling application (a primarily UDP application using the SIP). This application is responsible for connecting calls so it’s always mission critical. Twilio uses end-to-end testers to directly measure the reliability of their API but when problems are discovered, flow data helps them tell a more detailed story.

Lets walk through one example Debasis shared. He explained an issue that occurred a few months back that affected their service in the Asia-Pacific region. Looking at the flow data of the service in Figure 1, it’s possible to see a small but subtle increase in transmitted traffic.

Figure 1: Graph of UDP Traffic received [RX] and sent [TX] for the service across all zones

Since the flow data captures the connections between every pair of instances, its possible to dissect it further and group it by availability zone. In Figure 2, it is clear that the issue directly affected only us-east-1d to ap-southeast-2b traffic.

Figure 2: Graph of UDP Traffic received [RX] and sent [TX] for the service between availability zones.

At this point, it’s useful to breakdown the flow data directly to the instance level. Figure 3 shows the flow graph broken down by traffic from us-east-1d to specific instances within ap-southeast-2b. From this data, it’s possible to see that one single instance was the primary source of the additional transmitted traffic and it was unable to reach us-east-1d during the incident. Interestingly, it’s also possible to tell that this particular instance had no problem reaching other availability zones (like us-east-1e) and that other instances in ap-southeast-2b could reach us-east-1d. The problem was isolated not just to a particular instance but to the path from that instance to another part of the service running in a different geography.

Figure 3: Graph of UDP Traffic Received [RX] and sent [TX] for the service to specific instances in ap-southeast-2b

With this information, it seemed likely they were suffering from a networking issue within the cloud itself. It is also possible to craft a targeted response by relocating or restarting the troubled instance rather than draining one (or more) availability zones and prolonging the issue. The data also provides specific information that can be shared with the cloud provider to accelerate their troubleshooting process as well.

We were really thrilled to see Debasis share this war story. It’s a great example of how flow data can help you drill into really complex problems in your applications and develop insights into problems in your cloud provider as well.