Your brain on DevOps: Terraform, Spinnaker, and Psychologically Safe Science
Mark Zuckerberg’s famous 2012 “Code Wins Arguments” mantra ain’t wrong. But, if you’re anything like me, you’d rather lose ALL the arguments while your team thrives and outcomes soar. In 2020, “every company is a software company” took on a new meaning mid-pandemic. I’d like to outline an org-friendly new Hacker Way that incorporates all of the software development lifecycle (SDLC) learnings of the past two decades. I’ll call it “safe science,” and aim to answer two questions:
- How has psychological safety influenced the development of software infrastructure?
- Beyond all the buzzwords, why does DevOps work?
Across the SDLC, the best decisions and most innovative leaps happen when we center our process around working prototypes. Often painfully, software engineers discover the shortfalls of interacting with imaginary software, and learn to experiment with hands-on tools. While Agile processes such as Scrum have popularized these concepts, the earliest version I know comes from Andy Hunt’s classic The Pragmatic Programmer: From Journeyman to Master. His “tracer bullet development” approach stressed creating prototypes to validate hypotheses. He advocated providing clear, traceable roadmaps to software solutions.
But, “code wins arguments” has clear limitations. We’ve tried living by this myth of meritocracy. While from Zuckerberg’s point of view, “hacker culture is extremely open and meritocratic,” a truly open culture for all requires another essential ingredient: psychological safety.
After all, what kind of environment do we want to show off our working code in? Facebook engineers harnessed dopamine release in users with their seminal invention of the “Like” button. How can we ensure that developers who create working code and give productive feedback on it have the same dopamine-fueled reward response, and thus learn to repeat that behavior?
First, we can recognize that psychological safety is crucial to innovation and the development of high-quality, profitable software. Google’s re:Work project identifies it as, “far and away the most important” dynamic to predict team success:
“Individuals on teams with higher psychological safety are less likely to leave Google, they’re more likely to harness the power of diverse ideas from their teammates, they bring in more revenue, and they’re rated as effective twice as often by executives.”
Researchers have positively associated psychological safety at work with learning behavior, job performance and satisfaction, and productive conflict. In profiling high-performance technology organizations, Accelerate’s authors use Ron Westrum’s typology of organizational culture (30) to investigate culture in the orgs surveyed for the State of DevOps Report. They find that those with “generative culture” enjoy better safety outcomes, along with both stronger software delivery performance and overall performance (37).
Organizations with generative culture exhibit high cooperation and strong information flow. They promote risk-sharing and inter-team cooperation. They embrace novel ideas and problem-solving approaches. Instead of blaming, they treat failures as investigative opportunities. This echoes the original definition of the term “psychological safety,” coined in 1999 by Amy Edmondson as “a shared belief that the team is safe for interpersonal risk taking.”
Psychological Safety in the Brain
Why does this matter? Science again! Neuroscience, more precisely. Attachment and relationship theorists have studied these phenomena extensively. As we feel stressed by a perceived threat, like a reprimand or a combative teammate, our amygdalae signal our dorsal vagus nerves to enact C.Y.A. mode (the Fight/Flight/Freeze response). John Gottman calls this “flooding.” As Stan Tatkin puts it, “the ambassadors go offline” as the limbic system mutes or hijacks input from the highest-functioning, expensive-to-run parts of the brain. The brain regions that do perspective-taking, analytical reasoning, and creativity literally stop working. Tatkin shares this handy key of those “ambassadors” in his book, Wired for Love:
At work, this might show up with fewer fireworks, but with the same neurological processes. Edmondson calls the work version of our survival response the “Anxiety Zone,” and has found that, in each moment we operate in this fearful or conservative “impression management” mode, “we rob ourselves and our colleagues of small moments of learning, and we don’t innovate.”
I repeat, what kind of environment do we want to show off our working code in? Not one that activates the amygdala, as this drastically undercuts talent; imagine paying the best and brightest only to deactivate their smartest brain regions as they clock in. Thus, return on talent investment is best realized in a psychologically safe environment.
The On-Demand Difference
What does a software development team with a “safe science” culture of both psychological safety and working prototyping look like, exactly? The Accelerate authors explain that, in high-performing organizations, “teams can deploy to production (or to end users) on demand, throughout the software delivery lifecycle.” (47)
Author Nicole Forsgren explains in a DevOps Chat that their analysis describes organizations with “elite” performance scores as those doing on-demand deployment:
“Some of these companies are deploying thousands of times a day … [but] when I say elite [performers], I’m not saying elite is thousands, I’m saying on demand and four [deployments a day] seems fine. Making time for changes, it’s less than a day.”
Why is the ability to deploy quickly on demand more significant than the frequency or velocity of actual deployments?
For developers, that ability takes shape in software tooling. The best tools engage our reward systems (“It works!”). They link self-directed learning with psychological safety by empowering developers to self-serve and create software-defined resources. Over time, the leverage such tools create is compounded as engineers abstract and automate their real work. Early VM-based examples like Vagrant offered new experimentation opportunities. The success of Jenkins, still strongly adopted worldwide, triggered a wave of self-service CI tools that revolutionized development. Infrastructure as Code (IaC) tools expanded those early capabilities by allowing us to more easily manage and reproduce them at scale through programmatic execution. This paved the way for immutable infrastructure.
The Safe Science Stack
More recently, an open source stack has emerged which realizes on-demand deployment capability comprehensively. I dub this the “safe science” stack. A software engineer at DeliveryHero, a global leader in food delivery services, calls out the stack his team depends on on Reddit:
“We use Terraform, Helm, and Spinnaker. Terraform’s main purpose is to provision resources (e.g., create and change infrastructure). Helm is to deploy and upgrade services (e.g., changes in environment variables or resource limits). Finally, we use Spinnaker to manage service deployments for several environments (e.g., staging and live) using pipelines.”
Mingliang Liu and Prashant Murthy share their key justifications for using Terraform as they tell their story of building HBase in the public cloud to support Salesforce’s big data workloads:
“For both mutable and immutable deployments of BigData stack in public cloud, we need to codify our infrastructure, so it can be versioned, reviewed, and audited. At Salesforce, Terraform is widely used to safely and predictably create, change, and improve cloud resources. We chose it because it is open source, advocates declarative configuration files, and supports multiple public cloud platforms.”
In a follow-up post, the authors demonstrate their use of Spinnaker pipelines to manage workloads which deploy stateless and stateful applications to both VM-based and Kubernetes infrastructure.
This stack and its variations receive frequent message board mentions and conference talk shout-outs. The special “on-demand” quality Forsgren highlights elevates these addicting, dopamine-triggering tools with staying power. We can understand “On demand” in several senses:
- Developers can use the tool to directly deploy real resources and rapidly prototype.
- Developers can decide which public cloud to connect the tool to based on preference and use case, or leverage a private cloud.
- Developers can augment or fix the tool on-demand because it is open source.
1. Rapidly Deploy Real Resources
Terraform, for example, with its empowering plan stage and brain-friendly colorized logs, has that certain ‘je ne sais quoi’ that says, “I’m safe enough to play with, and powerful enough to realize your idea and impress your team with a real prototype!”
Such tools seem to consistently facilitate dopamine release in developers in a way that it’s difficult to argue with. In my days at Puppet, this made HashiCorp the elephant in the room. Terraform brilliantly encapsulates elements of on-demand deployment, immutable infrastructure, cloud utilization, and repeatable software-defined processes in a familiar command line interface. Using such a tool is rewarding! As one developer writes, “From using it sparsely just a few years back, we’ve now reached a stage where every single component of all our environments is managed using Terraform. Hard to believe? Here’s all our infrastructure code to prove it!”
Boom. Because the tool’s artifacts are portable IaC, I can leverage this developer’s code, experiment with it on-demand, and start building prototypes to solve my problems. This tool spread like wildfire because it answers the call to realize psychological safety with productive action.
2. Enable the Multi-cloud
Tools that give users the agility to manipulate infrastructure in multiple public clouds, or avoid lock-in as they build in one provider, can easily gain a foothold. The vendor lock-in fears of today echo early criticism of Microsoft, disagreement over the Xen hypervisor framework, and the perennial dispute over proprietary versus open source software.
In the case of Terraform, critics have derided the requirement to write distinct HCL for different providers, and recommend using vendor solutions like AWS Cloud Formation, or Azure Resource Manager to provision infrastructure native to that cloud provider. Embrace of the multi-cloud inevitably complicates workflows. But still, customers continue to ask for multi-vendor resiliency, so much so that open source projects like Terraform, Spinnaker, and Kubernetes attract integration investment from cloud providers who need business from those leery of lock-in. AWS even tacitly acknowledged this demand by deeming the ability to write custom providers for Cloud Formation as table stakes; because Terraform was often able to support new AWS service workflows before Cloud Formation could provide early access to them, it was critical for AWS to enable users to extend their solution to remain competitive.
In a panel on Site Reliability Engineering (SRE), when talking about ethical implications of IaaS and SaaS, SRE Craig Sebenik touched on the thorny ethics around open source and multi-cloud:
“One of the other ethical considerations to weigh is the R&D costs … The big cloud providers’ customers want more and more services to be offered. The providers offer them, which starts competing with the original creators. One [example], is Amazon recently announced its managed Kafka service yet Confluence, the original creator of Kafka, has its own service. They’re competing. AWS is using Confluence/LinkedIn’s R&D over the past decade to create a service. If you’re an AWS customer, even though it might be easier to use the hosted Kafka from AWS, is it more ethical to use Confluence?”
Terraform and Spinnaker increasingly level the playing field by encouraging integration investment beyond the most popular U.S.-based public clouds, and both accommodate deployments to Tencent Cloud, Oracle Cloud, Huawei Cloud, DOCS, CloudFoundry Cloud, and Alicloud.
3. Benefit Downstream with Open Source
Ethical concerns like this one fuel the movement away from proprietary leadership in software innovation, and towards open source projects as the seat of power and innovation. On the flipside of Sebenik’s remark, large companies like Amazon and Google also directly fund R&D for open source projects in hopes of capturing value from that innovation. They must do so in response to customer demand for interoperability. The best open source projects attract multiple technology industry stakeholders, as well as end-user-company stakeholders, such as large finance and media companies invested in digital transformation.
This level of activity attracts myriad small-scale contributors and tinkerers to expand project use cases. I spoke with a DevOps engineer from a media company who likes to give back by contributing to open source. In 2020 he contributed to Spinnaker, and if he contributes to another open source project this year, he says, it will be Terraform or Kubernetes. He’s pushed a few small patches to Terraform in the past, including one that triggers a diff on plan execution when a provider is coded incorrectly.
This developer’s ideas for Terraform improvements come from his observation of frequently occurring plan log errors. He’s doing the kind of collective learning-by-doing that happens in open source, and can result in incredibly mature and empowering features that rapidly improve as they’re used. Engineers notice this momentum, and it builds hopeful trust around vibrant OSS projects: “It’s nice seeing things advance so quickly; I know I’ll miss improvements and bugfixes in Spinnaker if I’m not looking at every release,” he says.
Psychological Safety & Tooling in the Enterprise
Compared to established companies, startups enjoy a faster and lighter path to DevOps adoption as they scale. They need not address the technical debt that builds in an organization over time. Still, cultural debt accumulates quickly in any company that can’t create safety. Azimo, a money-transfer startup-turned profitable company servicing over 2 million customers, leverages Terraform as part of a “risk-friendly workplace”. A tech lead from Azimo recently shared his perspective on psychological safety:
Risk-friendly workspace is not only about people and communication. To improve psychological safety, you can also introduce some techniques directly in your tech stack. First — automate testing and deployment processes. When you decide to introduce a new solution like code architecture or replace a 3rd party integration, your tests will be there to ensure that your product still works. Your tech stack also has to evolve — to adapt to new processes, bigger teams, greater complexity. Solutions that you picked at the beginning won’t be enough … At Azimo, we could migrate RxJava to RxJava2 (hundreds of files) within one 2-weeks sprint, without harming our customers.
It simply wouldn’t be possible without proper automation … and controlled rollout process. Often product delivery is delayed by very (very very) cautious software engineers who multiply edge cases and possible failures in their heads. But sometimes you won’t see a full picture until your product reaches end-users. Controlled rollout process and remote feature flags will give your engineering team more control “afterward”. They can sleep well, knowing that even when they fail, it won’t affect hundreds of thousands of users at once. To make it even better, you need to know that when something fails, this information comes to you first. With feature flags, you will give your engineering team control, but with proper data, you will give them the freedom to make decisions more autonomously.
The software delivery industry, and increasingly, open source projects, strive to answer the question: How can large, established organizations deploy on-demand and promote psychological safety, while still safeguarding the business? Is it possible?
The people I spoke to do just that. They use safe science tooling at enterprises to deliver services with a wide variety of use cases, including:
- Prototyping infrastructure and delivery for a new video streaming service concept to validate it in the marketplace
- Engineering technology control solutions on a security team for a large international bank with thousands of software developers
- Provisioning infrastructure for SaS and Managed-hosting commercial software product offerings at scale
- Automating a schedule to stand up and tear down virtual classroom environments for software development and DevOps education
Large Organizations, Unique Requirements
Enterprise technology leadership at these organizations seeks a balance of mitigating the worst outcomes, while encouraging innovative solutions. In terms of innovation, we’re beginning to quantify the impact of the psychologically unsafe SDLC. Edmondson explores this at three large organizations — Wells Fargo, Volkswagen, and Nokia — in Chapter 3 of The Fearless Organization, “Avoidable Failure” :
“Handicapped by a culture of fear,” and thus “with Nokia’s senior executives in the dark about where the company and its technology really stood, the company simply could not learn fast enough to survive.” (66)
Making potential opportunity cost for large organizations clear, Forsgren summarizes findings from the 2019 Accelerate State of DevOps Report:
For the first time, we found evidence that enterprise organizations (those with more than 5,000 employees) are lower performers than those with fewer than 5,000 employees. Heavyweight process and controls, as well as tightly coupled architectures, are some of the reasons that result in slower speed and the associated instability.
Accelerate ties that instability to psychological safety when it discusses blame-ridden failure inquiry: “Accident investigations that stop at ‘human error’ are not just bad but dangerous. Human error should, instead, be the start of the investigation.” (39) Finger-pointing evolves into learning when investigators can assume that the path to production has been architected with safety, visibility, and compliance in mind.
While minimizing negative reinforcement does increase psychological safety, an ounce of prevention may be worth a pound of cure in this case. Cloud Operations engineers at Salesforce describe how they model architecture recommendations, and use Spinnaker pipelines to create consistency across environments with similar security requirements. By templatizing pipelines, the business can ensure that the same logging and monitoring mechanisms are used for like infrastructure. From this homogeneity comes manageability at scale, and a path that guards developers against blame-worthy failures.
Codifying policy and automating role-based access to computing infrastructure deemed “safe” promotes psychological safety. In enterprise environments, creating this safe path to production hinges on the ability to build regulatory compliance, access control, and monitoring visuals into delivery automation. Instead of putting employee brains into the less-innovative “Anxiety Zone” to enforce policy, safe science software encodes compliance and safety into delivery tooling.
Thus, to leverage Terraform, DevOps Engineers and their managers, particularly in banking and health care organizations, want to use RBAC to limit which roles can deploy what, and where. Organizations using Terraform with Spinnaker often seek to “reduce blast radius” and ensure that each pipeline can only make changes to specific infrastructure. Spinnaker’s automated canary deployment policies use the same principle, and target a small group of endpoints first with software updates to minimize failure impact.
Large-scale Spinnaker users may have even more complicated authentication needs. Technology leadership at a large broadcasting company point to dynamic identity configuration as a necessary ingredient for faster development cycles. Triggering hundreds or thousands of Terraform plans in parallel from each pipeline, this organization not only needs autoscaling support, but also broad access control through integration with IAM roles, Docker repositories, an identity manager like Okta or Google SSO, and Github, Kubernetes, and cloud provider accounts. Portable abstraction around this access also promotes psychologically safe infrastructure. Tools like Vault allow developers to authenticate across the toolchain, and obtain secrets programmatically. This unlocks the ability to build dynamic resources, responsive to developer workflows to the point of consistent achievement of software delivery SLAs (service-level agreements).
SREs, Putting it All Together
These ingredients, integrated with safe science tools and platforms, allow enterprises to deliver on a promise of psychological safety. Site Reliability Engineers (SREs) stitch tools like Terraform and Spinnaker into a landscape that codifies the software delivery culture. They build psychological safety for developers as they set up the path to production. SREs understand the true challenges and opportunities in evolving an organization’s software delivery game. They protect business interests by optimizing availability and performance with delivery tooling, monitoring, and process. Meanwhile, in partnership with CI and other DevOps practices, they lower the barriers to success for application developers.
Now more than ever, lowering the barrier to success and empowering experimentation are vital. To keep our companies and societies alive during challenging economic times, we must squeeze all the juice from the lemon, so to speak. Avoid actual squeezing at all costs. Instead, practice safe science.