In software engineering, the details really do matter.
Our world was quickly reminded of that with the recent AWS outage that was caused by human error, a small yet powerful command that was, unfortunately, committed incorrectly:
Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.
And the world came to a screeching halt and organizations lost hundreds of millions of dollars in revenue.
I deeply empathize with AWS and my heart goes out to the team. Although I’ve never messed up at that level or scale, I personally have messed up production environments in the past and even thinking about it makes my stomach tie in knots!
The challenge, of course, is that there are so many details related to our work and the fact that they are coming at us from many different sources. Some of these sources pump out data every single minute while others once a week.
The timing, the disparity of consistency, and the pace at which these systems send us data doesn’t make any of them less mission-critical! But, as you know, qualifying how important they are and prioritizing them into actions and decisions can be difficult.
As such, we’re “sweating the small stuff,” as they say and taking care of cataloguing all the data that comes into an EngOps universe. Nothing, at this point in time, can be thrown away until we identify, categorize, and share them intelligently to our end-users.
Deployments are one such example:
So, for starters, adding the AWS location, the operating system , the average CPU and memory usage seems like “table stakes”:
Going a step farther in another iteration of adding “Total Network In / Out” seemed like another opportunity to provide fine-grain views:
Will all of these make sense for all engineering individuals and/or teams? Not necessarily. And higher-ups or senior leadership do not necessarily need to see these types of details either.
But, at this point, we want to do our diligence and capture and show everything before we start building in things like “intelligent views” based on roles, teams, and business / operating units.
We’ll continue to share our iterations as we build them! We’re moving fast and we couldn’t be more excited about where we’re headed.
Also published on Medium.