Minimizing Tech Debt With IaC

Rob Schoening

Ways to Eliminate Remediation and Rework for IaC

“A rolling stone gathers no moss."Publilious Sylus

Summary

As organizations adopt Infrastructure-as-Code (IaC) platforms – like Terraform, CloudFormation, and Kubernetes – to manage cloud infrastructure, tech debt can accumulate quickly. With the right approach, we can save ourselves from much of the time and toil involved with security remediation.

Use of IaC Across Teams and the Potential for Tech Debt

The phrase “technical debt” originated when developer Ward Cunningham described the need for budgeting resources to refactor a product that had been delivered. Now, tech debt is widely accepted as the unwanted technical side-effects created as the result of rushed development.

For better or worse, modern application development values speed and innovation over security. Worse, security problems tend to be found late in the development process, after services and have been developed and deployed to customers. At this stage, solutions to problems that may have been trivial at the outset can require unplanned remediation work.

As organizations adopt IaC as the preferred mechanism to provision and manage cloud infrastructure, there is a high potential for technical debt to be created along the way. There are a few reasons for this:

  1. The number of cloud resources continues to grow. More cloud services. More cloud spend. This is obvious, but the trend is not changing.
  2. There are simply more people authoring IaC. As organizations restructure their development teams to decentralized models that value freedom and responsibility, the side-effect is that there are more teams and individuals authoring and defining infrastructure. The approach needed to deliver quality software with a small team is different than the approaches that are needed to deliver the same level of quality with a large organization.
  3. IaC authors are increasingly not infrastructure experts. Application development teams are authoring their own IaC. On balance, this is a good thing for the industry because it removes gatekeeping bottlenecks of centralized infrastructure and platform teams from the delivery process. However, there is substantial tribal knowledge (“don’t do that…bad things will happen”) that becomes difficult to replicate. It is not reasonable to expect everyone to be an expert on infrastructure, much less infrastructure security.
  4. IaC is not particularly enjoyable to author or review. Unlike application code, which tends to be a source of pride for developers, IaC is more like a necessary evil to get to the desired outcome. With changes, it’s hard to reason about the impact in code review. Is this change going to result in a security remediation project 90 days from now? That is often a very difficult question to answer. Often it is simply ignored.

As an industry, we need to acknowledge these changes and accept that we need a new approach to enable organizations to use IaC safely and efficiently at scale.

Terraform by Stack Overflow

IaC has found success because it balances efficiency with repeatability. IaC is committed to Git so that change can be managed like any other code change. That much is great.

But the actual process of authoring declarative configuration, like Terraform, CloudFormation and Kubernetes, is a dramatically different experience than authoring application code.

Developers with some expertise in their language of choice can typically author working code with little more than a basic text editor.

IaC domain specific languages are intricate and require constant reference to documentation or other working code, usually the latter. Few people remember which stanzas of a Kubernetes manifest accept security critical configuration. So what do they do? They copy from prior code – if it exists – and resort to Google and Stack Overflow if it doesn’t.

The reality: Your infrastructure is a patchwork of copy-and-paste of unknown origin. Do you trust that?

A Rolling Stone Gathers No Moss

With IaC, it is easy to omit configuration that will lead to security tech-debt. Problems with application code are typically visible as “bad code” – things that you can see. IaC is different.

Most IaC security mistakes are mistakes of omission. It’s very easy to say “Looks good to me” and be factually correct about that statement. But the problems lurk in the code that wasn’t written.

Take, for example, ElasticSearch and encryption at rest. If you are provisioning an ElasticSearch cluster in a console, you’ll be presented with options for encryption at rest. It’s difficult to miss the options that are available in the console. If you chose not to provision with encryption at rest, there is a 90% likelihood that it was a conscious choice.

Here are the relevant options from the AWS Console:

By contrast, a minimal stanza to create an elastic search cluster looks like the following in terraform.

Unless you are working from a template that has all the options available or you consult the documentation, you simply don’t know what you don’t know.

The encryption-at-rest setting isn’t visible to you because you didn’t write it yet! You didn’t even know that you needed it.

You apply this configuration, find that the cluster works, and move on.

What isn’t always obvious is that once deployed, the inertia of the tech debt of stateful systems is immense. Encryption at rest can’t be added after the fact. Once the cluster persists data, in order to be fixed, the cluster needs to be recreated and data reloaded. This involves coordination if not some service interruption.

So once it’s deployed to the cloud, you are acquiring tech debt. If you fix it before it is live, it is nothing more than a boring security defect that is trivial to fix. A rolling stone gathers no moss.

The Solution: Test IaC Like You Test Application Code

It is hard to remember now, but a decade ago, there was often strong resistance to testing application code with unit and integration tests in a continuous integration environment. The value of running unit and integration tests in CI was often dismissed unless it could be shown that it could completely displace other forms of testing.

This was the wrong way to look at the problem. In reality, you achieve a positive ROI on unit- and integration testing almost immediately by reducing the volume of defects that flow downstream to other more labor-intensive testing processes.

You don’t have to be perfect, just sufficiently better so that you don’t waste everyone’s time with broken builds and broken code. The same applies to IaC. Now we have the chance to do the same with Terraform, Kubernetes, CloudFormation, and other IaC systems.

This is where Soluble enters the picture. Our platform, Soluble Fusion, orchestrates static IaC assessments early in development. We schedule static analysis of your IaC as part of your standard deployment process. As each commit is pushed and each Pull Request (PR) is opened, Soluble performs a security assessment and pushes the result back to your Git provider where the developers can see it, along with clear guidance of how to fix the issues without leaving their workflows.

The result:

  • IaC authors get fast feedback on their changes as they make them
  • Developers can fix security issues before the security team sees them or knows that they exist
  • Development teams look good to security because they deliver fewer problems, not because they are responsive to remediation
  • Security review is policy-driven and not subject to the whims and graces of the code review staff
  • IaC code review workload is minimized so highly skilled individuals can focus their efforts on higher-value work
  • Reduced vulnerable attack surface with a lower volume of out-of-policy configurations pushed live
  • Security remediation activity costs go down because there are simply fewer items to remediate

So if you don’t do these things, what happens?

Mistakes are made. Configuration vulnerabilities are typically identified days, weeks, or months after the change. By the time they are caught, the authors have moved on to other tasks. Finding and allocating the right resources to do the necessary remediation is now a small project. Those projects, unless deemed critical, sink to the bottom of the backlog to become technical debt that accumulates over time. Unless the service is retired early, the risk and cost to fix only increases with time.

Are you using IaC, and do you want to minimize tech debt and save your future self from the drudgery of remediation work? We can help. Connect to the platform to get started, or contact us for a demo.