In the last decade, there hasn't been a bigger or more misunderstood buzzword in the development and operations worlds than "DevOps." Countless companies have roles for "DevOps Engineers" and rarely does it mean the same thing for any two of them. In many companies, a DevOps engineer is not much more than a Linux system administrator who works alongside a development. In others, a DevOps engineer is just the fancy new title given to system administrators whose job doesn't really change much at all. For others, DevOps engineers are software developers or system operators who learn how to write runbooks for Puppet or Chef, and that becomes their job description—but none of these are really DevOps.
DevOps is not a role or a job title, but rather a guiding set of principles aimed at helping development, product, and operations teams be more agile. When I joined Merit in January of 2019, there were no dedicated DevOps, infrastructure, or Site Reliability Engineers (SRE) on the team. The infrastructure had been built by software developers, and all of the manifests that declared the infrastructure were stored in version control. While there was a gap in the team's operations knowledge, Merit was already running as a DevOps shop, in the loosest sense of the term.
In the year that has passed since then, my main focus has been to implement a culture that embraces DevOps and the pillars it prescribes, while improving and maturing our infrastructure so that we can handle the scale of being the world's premier verified identity provider. I will share some of these learnings so that you can implement them in your own organizations.
At Merit, we have been lucky enough that, for the most part, organizational silos did not and still do not exist. As we’ve matured and formed our own infrastructure team, we have had to take special care to make sure that the silos are not formed. This was done by doing three things: form an infrastructure team, cohabitate infrastructure and platform code, and have only the development teams take ownership of the infrastructure changes.
Before we started forming the infrastructure team, I invited any engineers who wished to learn more about Merit's infrastructure and the DevOps/SRE disciplines in general, to join me for a session to discuss the DevOps pillars, SRE fundamentals, tools, etc. Next, we addressed that because the Version Control System (VCS) repository that we were storing our infrastructure manifests in was separate from our monorepo that our platform code was stored in, it created an artificial divide between the platform and the infrastructure. This made it more difficult to maintain both items in parallel, so we have slowly been strangling the static manifests out of our infrastructure repository and migrating them to the monorepo, typically adding a bit of automation in the process.
Lastly, rather than having a team that lords over the infrastructure, we asked that all development teams take ownership of the infrastructure changes needed for the successful rollout of new features. To help track and audit these changes, we’ve implemented a RFC (Request for Comment) process, that all infrastructure changes must go through to allow the infrastructure and product teams the visibility into the changes that are going on around us.
One of my mentors once said to me "It doesn't matter if you screw something up. What matters is that when you fail hard, you recover quickly." At Merit, we already had a blameless post-mortem process, with a focus on action items and remediation that we have carried through until today.
We have leveraged Kubernetes to provide high levels of redundancy and resiliency in our system. If we accept that at some point a service will fail, then we can avoid an outage by making sure that that service, and all of our services, are redundant. We've also put quality gates on all of our code check-ins, requiring that not only do linters and builds pass, but also end to end, unit, and integration tests. We run a "health check" pipeline against production regularly and require the drafting of design docs or RFCs before features are designed and released. This has allowed us to think about failure as just an obstacle to be overcome, and to have a contingency plan when it inevitably happens.
Implementing gradual change should be something ingrained in your culture and should help you make decisions throughout the software development lifecycle. Over the course of the last year, we have moved from releasing once a week to twice a week, and are currently building confidence so that we can do daily deployments with zero downtime. This will allow us to quickly see if a faulty build has been pushed to production, and roll back without interrupting service.
This might seem counterintuitive at first, but this is methodical. Things sometimes break unexpectedly, and so migrating these resources gradually reduces our risks and the collateral damage from a failure.
Tooling, automation, and maintaining your infrastructure as code are force multipliers for any development. A common motto among SREs is to "automate the years' work away," and at Merit, one of our core values is a "rising tide lifts all boats." With this belief in mind, we view our infrastructure engineers, and the tooling and automation that we leverage, as force multipliers. That is to say, any tool that we adopt should be either reducing toil or increasing the efficiency of our engineering team.
While we always had our static Kubernetes manifests in our infrastructure repository, they weren't really code; they could not be linted for anything except for structure, they could not be tested, and could not be modularized, and so we had many repeated lines of "code" in our repository—once for testing, once for staging, once for production, etc. As we've migrated our infrastructure to our monorepo, we've also created Terraform modules that can be imported and repeated across all of our environments. Our Prometheus and Alertmanager configs are written in Terraform's DSL, and generated using Terraform's `yamlencode` function. Configurations for our servers are templated out, with variables interpolated at deploy time.
At Merit, we have put considerable effort into fulfilling the three pillars of observability: monitoring, logging, and tracing. We deployed Prometheus six months ago, and today it is not only monitoring our Kubernetes clusters and our application, but also our Redis servers, databases, logging stack, and even our HTTPS endpoints and SSL certificates using the Blackbox Exporter. For logging, we standardized on Elasticsearch as our log aggregator, with Fluentd doing log parsing and routing. Fluentd is also responsible for pushing logs to our data warehouse, allowing our product teams to join logs against it. For tracing, we implemented the OpenTracing specification in our application, and are using Zipkin to capture and aggregate traces. This level of visibility allows us to pinpoint problematic functions in our codebase, and make meaningful, intentional improvements.
Finally, everything coalesces through Prometheus’ Alertmanager and Grafana to PagerDuty, which is responsible for “last mile” delivery of alerts to the appropriate teams. This allows us to take a proactive approach to reliability, allowing us to fix problems before our customers will even notice them.
Everything that we have done up until this point has really been done with the goal of getting to table stakes on our infrastructure. Only today can I really say that our infrastructure has begun to mature, and we still have a long way to go. We've learned a lot on this journey, and hopefully sharing our learnings with you can help lay the groundwork for your team to adopt DevOps into your culture.