An interesting thing I learned about cloud downtime tracking from Niall Murphy

Posted on 20/12/202208/04/2023 by Bernard Pietraga

How Cloud Providers Track Downtime, Bernard Pietraga blog post, Niall Murphy

One year ago. I was doing a gig for one of the startups in San Francisco. During this time, the SRE team I was part of had the opportunity to meet and chat with Niall Richard Murphy, thanks to the meeting set up by JJ Tang. I managed to get some of his ideas about infrastructure. One of his answers to my questions stuck in my mind. Do the cloud providers use social media to find out if they are down or not?

Niall Richard Murphy, instigator of the Site Reliability Engineering books, made an impact on the world of computing. His publications created a whole new role in departments of large corporations. It also attracted the attention of start-ups looking to build more robust systems. I’m a site reliability engineer. Niall’s trends and inventions put food on my table.

How do cloud providers know when they are down?

On the surface, this sounds like a simple question. Properly monitor the infrastructure, set up alerts and viola! We have updates on the status page. In reality, interpreting signals is hard work. It is easy to get lost in the noise.

The week before I had the chance to meet the expert I mentioned, we had an outage. Our cloud-managed clusters were going haywire, and I remember the whole team of experts struggling to figure out what was wrong. The configurations were fine. We were not deploying any disruptive changes. Even if we did deploy them, they would automatically revert. We were tearing our hair out trying to figure out what went wrong. It took until one of the engineers said, "oh, our cloud provider is down" pointing to some engineering portal Tweet. We went to the official status page and everything was green, like a Christmas tree. We went to the down detector site and the same story, no concordant information. But the messages on social media kept coming and we kept getting more and more confirmation. It took half an hour to get confirmation from the official sites. How come they didn’t know?

When I had a chance, I asked Niall:
Does the cloud providers use social media for readiness signals?

He has worked at Google, Amazon and Microsoft, seen it all, and seemed the best person in the room to answer this question. Here is a paraphrase of his answer.

Cloud providers constantly monitor social media for information about downtime. It is one of the most reliable methods. All major providers scrape data for further processing. Some even use it for ML or NLP analysis.

This is an interesting perspective. Rather than just relying on the metrics, people are actually down detectors of services. Crowdsourced data from social media is consumed by big data pipelines. Pipelines that are designed to take human chatter and output information about the collected topics. The output is needed to attract the attention of engineers working for these Big Co’s.

It is not surprising that cloud providers monitor social media platforms for outages or other issues that may affect their products or services. For example, if a cloud provider notices a large number of users reporting problems with a particular service on social media, they may investigate the issue and work to resolve it as quickly as possible.. Interestingly, Niall pointed out that this is one of the most reliable ways to tell if part of the infrastructure has actually failed.

It’s worth noting that in addition to monitoring social media platforms, cloud providers often use a range of tools and techniques to monitor the performance and availability of their products and services. These may include monitoring tools that track the performance of their services in real time, as well as tools that allow them to track and analyse user behaviour and usage patterns. Still, the good old let’s see what people are saying is a decent way to measure if things are going sideways.

It reminds me of financial HFTs and another boutique quantitative trading firm scraping data from social media sites with the aim of getting market sentiment or alpha. It has become the norm. Without it, market participants are likely to lose their edge over the market.

The reductionist picture presented here doesn’t do justice to the hurdles of monitoring the inner workings of the cloud. The point is to show that the human factor still plays a role. Even if not in the way we expect. Think about it the next time you express your frustration about the cloud on social media.

Sidenote: Both Niall Murphy (Stanza) and JJ Tang (Rootly) work on products improving SRE experience.