Introducing BOOM for DevSecOps
Baseline Objectives and Optimization Measurements
The winners are usually the guys who get 5% fewer of their planes shot down, or use 5% less fuel, or get 5% more nutrition into their infantry at 95% of the cost. – How Not To Be Wrong.
TL;DR BOOM – Baseline Objectives and Optimization Measurements – is a framework for measuring DevSecOps capabilities. Its cornerstone is six security measures: basic counts, burndown rates, arrival rates, departure rates, survival rates and escape rates.
You can also watch the video of my recent talk: Kubernetes Security: Doing More With Less in Uncertain Times from the Silicon Valley ISSA August meeting to learn more about this framework that helps you optimize your security program for fast moving software development teams.
Bullet Holes And Bombers
During WWII, the US Navy continually lost bombers to anti-aircraft artillery. In hopes of building a more resilient aircraft, officers suggested they examine all planes returning from battle. They came to the conclusion that reinforcements should go above the wings and tailpiece, where there were more bullet holes (see image above).
That approach seemed reasonable to everyone…except for one statistician by the name of Abraham Wald. Wald believed that the secret to building a more anti-aircraft-proof bomber lay in the planes that DIDN’T make it back. It turns out Wald was right. Planes that had damage in low impact areas (as seen above) survived. Planes that didn’t make it back had bullet holes in critical areas, where pilots sat and where the engines were located.
What was the problem here? The Navy officers were using the wrong object of measurement. If they had followed through with their plans, armor would have gone everywhere but where it was needed. The engines and cockpit would have stayed exposed, leading to a colossal waste of life and money.
We discuss the problems of having the wrong object of measurement in our first book. It is a frequent and costly error that can plague whole industries. How could battle-hardened military experts be so wrong about something so important?
The Ease of Self Deception
We confuse what’s easy to measure, like bullet holes in returned planes, with what is important to measure, like bullet holes in downed planes. We deceive ourselves by measuring only what is obvious, forgetting the goal of our measurement. Wald’s statistics training prepared him to catch this form of self deception. As security people, the question we have to ask is, are we using the right objects of measurement in security? Have we been deceived into focusing on the wrong things year after year? And am I the only one having these thoughts?
The BOOM Origin Story
I know for a fact that many security practitioners out there have similar questions. And when I say many, I mean thousands (based on what I can tell from our book’s reception). We all share an intuition that there is more than meets the eye in the security operations, optimization, and risk management game. That intuition (and frustration) led me to co-author a book dedicated to security measurement. That book led to the BOOM Framework, a second book on metrics, and lots of (100+) consulting engagements.
What I heard in my engagements, and what BOOM seeks to answer, are questions like:
- Am I remediating security risk fast enough?
- Am I creating more or less attack surface over time?
- Am I reducing the time to live of threats and vulnerabilities?
- Am I focused on the right risks?
BOOM makes answering these questions simpler.
BOOM stands for Baseline Objectives and Optimization Measurements. The purpose of BOOM is to make your security programs stronger. It does that by helping you measure, monitor and optimize your security capabilities. And while BOOM was born out of a need to measure modern DevSecOps productivity, it can be applied to traditional security environments.
There are two approaches at play in BOOM: Basic and Advanced. Or if you like: Deterministic and Probabilistic. Most people should start with Basic, which uses simple arithmetic to generate counts and averages.
“BOOM Advanced” uses what will look like “Data Science” to most readers, although the correct term is “Predictive Analytics.” Our first book is a predictive analytics book for making decisions when you have near zero data. BOOM’s advanced methods use predictive analytics to help you make better decisions with small and large data alike. The focus of those measures is forecasts and capability optimization.
Baselines And Objectives
The first component of BOOM is baselines - the fundamental measurements behind the BOOM framework. These measures baseline your capabilities over time, ultimately showing if your capabilities are optimizing (improving), scaling (keeping up) or degrading.
These baselines are used to both define and measure the second component of BOOM - objectives. Objectives define the capability “outcome” you seek to improve. An example objective might be: Reduce the time to live of customer facing exploitable vulnerabilities. You would use one or more baseline to support this objective.
In the table below, you can see how we might assemble the six basic baselines to support this objective. While this example uses all six baseline metrics, BOOM is designed to give you the flexibility to bring in only the baselines you need based on the data you have. It would be a waste of time and effort to shoehorn in every baseline for every objective. In many cases, you may only need one or two baselines to satisfy your measurement and optimization outcomes. Or, you may only have access to enough data to support a few measures. The goal is to start, not to be perfect.
The Six Baselines Defined
The first baseline above is a current count of critical vulnerabilities. For example, “As of today, you have 30 open critical vulnerabilities.” Counts – like all of the baselines – have a trend component. Trends look backwards in time, be it a week, month, or longer. In this case, the current critical vulnerability count is trended against the previous week’s rate.
Counts can also have time dimensions. For example, showing the count of critical vulnerabilities greater than 24 hours in age etc.
Burndown is a ratio of risk removed over risks added. For critical customer facing risks, you would expect that ratio to trend toward 100% over time. Imagine that in month one, you had 100 new critical vulnerabilities added (or grandfathered in from previous months) and 50 remediated. That is a 50% burndown rate. Next month there are 10 new and 25 fixed. The overall burndown efficiency for the two months is 68% (75/110) with a positive trend of 18%.
Advanced baselines support forecasts. Forecasts predict future rates using available risk indicators. Indicators in this case might include things like count of new container images, counts of packages, image age, and other risk predictors. Forecasts and optimization are discussed in more detail in upcoming articles.
The Soluble Fusion platform automatically establishes and tracks these baseline metrics for your team. The screenshot below shows the count, time to live, and a burndown baseline. While only a handful of baselines are present in this example, it demonstrates how they can work together to tell a more complete story. The critical vulnerability rate is unchanged over the previous week. Despite that, the average age of vulnerabilities is increasing. Lastly, there is 0 burndown.
While these are baselines without a specific objective, taken together, they indicate a potentially underperforming capability. The Soluble platform is designed to help your team track your capabilities, and guide both your security and engineering teams toward constant improvement. To get a jumpstart, you can engage directly with our team of cloud native security experts to gain rapid visibility into your Kubernetes security posture, and establish the baselines and metrics that will drive the optimization of your security practices.
Arrival And Departure
Basic baselines for arrivals and departures provide a simple total count of events (be it threats or vulnerabilities) for each month. The trend is against the previous month. They are taking the raw count of critical vulnerabilities (in this case) that are popping on and off in a given month for any reason, be it planned remediation or some other change-inducing phenomena. In cloud native environments, the state of infrastructure changes dynamically with high velocity. This is particularly true in Kubernetes-based environments.
Advanced predictive analytics methods for risk arrival and departure rates forecast the ebb and flow of your attack surface and your overall capabilities control of it. It also weighs the value of risks and other environmental factors in controlling arrival and departure rates.
Basic baselines for survival analysis focus on average time to live (TTL) of vulnerabilities and threats. This can be seen at play in the Soluble example above. Rates are compared against the previous time period’s average TTL to create a trend.
Advanced: Survival Analysis, TTL analysis, and engineering failure analysis are all related analytics for measurement of event survival times. In this case, they measure the complete range of survival times as a probabilistic curve (seen below). The goal in these types of measurements is to make it easy to get more complete answers to risk survival time. For example: “50% of critical vulnerabilities live for 48 hours or longer, 10% live for two weeks or longer and 1% live for one year or longer.” You may have an improvement in the average TTL of critical vulnerabilities while seeing those at the 1% ranges growing in age. Pure averages would obscure your capabilities’ true performance.
Below is an example of survival analysis that shows quarter-over-quarter trending. There is a lot going on in this chart. In Q2, 80% of threats lived for 20 days or longer. In Q3, it was 72%. The colored dotted lines surrounding the solid lines represent the uncertainty on the measures.
The goal of escape rates is to measure risk movement. For example, did vulnerabilities escape(move) from pre-production into production. Or, did a vulnerability escape from inside to outside the organization etc.
You can think of movement involving two dimensions: location and time. Location, as stated, can be pre-prod to production, inside to outside, east to west and etc. These can be difficult to count if the data you have does not include location based context. Time based escape rates are easier and measure the rate with which things escape out of some SLA based context. “We have a 10% critical vuln SLA escape rate.”
Advanced: The approach below can be thought of as a single scrum team’s critical vulnerability escapes. It could be time and or location based. The green balls are vulnerabilities that were remediated before movement. The red ones are vulnerabilities that were known and not fixed before movement. Arrows represent movement.
The assumption is that you are uncertain about the total amount of vulnerabilities and the rate of escapes. This uncertainty about the true rate can be effectively measured and reflected as a probability distribution as seen in the graphs above. The middle line on the curve shows the average expected escape rate, and the outer lines show the range of our uncertainty about the true value. Month over month, data accumulates and the rate starts to baseline. Note how the distribution is less spread out (more pointy) in the third month. This increased precision is a function of more data.
More advanced approaches can help you measure across scrum teams, even when the data may be small. Below is escape rate data on 50 scrum teams. The vertical red line is the overall average escape rate, which is roughly 10%.
The black dots are the average escape rates per team. The horizontal bars express the uncertainty about the average given the amount of data per team. The very first team at the top of the chart has an expected average around 20%. But, given that the amount of data is somewhat sparse, it may be as low as 16% or as high as 25%. When the data is sparse then the uncertainty is spread out. The more data the less uncertainty.
In the next articles, I will focus on Optimization and Measurements – the third and fourth letters in the BOOM framework. Optimization is a method for figuring out what matters most in controlling an objective. Measurements define the various analytical methods used in BOOM. I will shield you from all the mathy complexity and focus on the concepts behind them.
In subsequent articles, I will continue to unpack each section of the BOOM framework, leading up to the publication of my next book, “The Metrics Manifesto: Confronting Security With Data”, which will outline the method in excruciating detail alongside code samples that put the ideas into practice. As always, if you find this topic interesting and would like to learn more about how we use these methods at Soluble, or would like to engage with our services for BOOM help, don’t hesitate to contact us!