The first post discussed why traditional security concepts are insufficient for the cloud era. This post introduces Security Chaos Engineering (SCE) as an alternative approach to assess cloud security. We first define the term and discuss the background of its development, then give a generic overview about the process and some examples for use cases.
CE is “experimenting on a distributed system to build confidence in its capability to withstand turbulent conditions in production”. This signifies that instead of waiting for systems to fail, developers should proactively introduce failures through experiments under controlled conditions. In CE’s eyes, every system will fail at a certain point in time. As a result, a service failure can be considered as a normal behavior of a system and thus an opportunity for learning and improving the system itself. CE focuses on the availability of systems and wants to minimize downtime for customers while simultaneously increasing resiliency.
We define SCE as “discipline of instrumentation, identification and remediation of failure within security controls through proactive experimentation to build confidence in the system’s ability to defend against malicious conditions in production”.
It expands CE by not focusing the availability of the system only but including their integrity and confidentiality. Hence, this approach also ensures e.g., the integrity and confidentiality of credit card data stored in the system. Security Chaos Engineering experiments are based on experiments as scientific method. Their goal is not to validate what is already known to be true or false as tests do. Rather they aim for gaining new insights about the current state of systems.
We summarize the process of implementing SCE experiments as follows:
- Document steady state
The steady state defines the normal operational state of the system, including a definition of all relevant variables and factors.
- Design hypothesis
Potential faults are codified as hypothesis to verify that security tools, controls as well as attributes behave as expected. The hypotheses formulates as follows: “In the event of X, we believe that our system will respond with Y.” A hypothesis represents each fault.
- Contain blast radius
Before implementing the experiment, the company must consider potential adverse impacts it could have. To minimize those impacts, it is advisable to start with small, simple experiments and increase the maturity step by step.
- Develop fallback plans
A fallback scenario includes the configuration of the system which, in case an experiment fails, a company can reestablish. This increases the confidence in the system and the experiment in case of failure.
- Code and execute experiments
The experiments are coded and selected for execution based on their criticality for the company. Reestablishment of steady state takes place after experiment conduction.
- Analyze the outcome of the experiments
Experiment outcome categorization: prevented, remediated, detected. An alert at the end of the experiment contains detailed contextual information for understanding the conditions leading to the observed outcome of the experiment.
- Automate experiments for continual use
Due to fast changes within systems and infrastructure, continuous experimentation is necessary. Hence, as a final step regular experiment automation and execution takes place. Developers do not have to actively execute experiments repeatedly but can focus on their regular workload.
At the CE conference Conf42 in 2021, Aaron Rinehart and David Lavezzo introduced an example for SCE, a misconfigured port injection
The experiment conducts an unauthorized port change in an AWS EC2 security group. The expectation is that the firewall will immediately detect and block this change. However, this happened in only 60% of the cases. The cloud native configuration management on the other hand caught and blocked the change almost every time. Both also send usable log data to the security logging tool and an alert went off and to the Security Operations Center (SOC). However, the SOC did not know how to handle this alert as metadata was missing to allocate it to the account it has been sent from. Besides this experiment, there are further examples such as disabling noncritical roles and functions in an API, removing resource segmentation, overwriting building logs, deleting log files or modifying boot files for potential experiments.
To sum up, Security Chaos Engineering uses proactive experimentation to check whether a system behaves as expected to ensure its availability, as well as integrity and confidentiality. The goal is to discover unknown vulnerabilities in a system. Thus, the prerequisite to establish SCE is to perceive the own system as secure as possible. Even though SCE matches the security demands of clouds, it should not replace traditional security approaches but enhance them. We briefly summarize SCE and important traditional approaches to clarify the differences of both concepts.
If you are interested in more details on how A&B security experts can help establish a Security Chaos Engineering culture in your company havc a look at our SCE Program or contact us at Alice&Bob.Company!
Resources used and interesting content on this topic:
- Rinehart, Aaron, and Nwatu, Charles – Security Chaos Engineering: A new paradigm for cybersecurity (2018) (https://opensource.com/article/18/1/new-paradigm-cybersecurity last accessed 13.06.2022)
- Rinehart, Aaron, and Kelly Shortridge – Security Chaos Engineering (2020)
- Rinehart, Aaron – Security Chaos Engineering: How to Security Differently (2021) (https://www.verica.io/blog/security-chaos-engineering-how-to-security-differently/, last accessed 13.06.2022)
- Basiri, Ali, et al. “Automating chaos experiments in production.” 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 2019
- Basiri, Ali, et al. “Chaos engineering.” IEEE Software 33.3 (2016): 35-41.
- Torkura, Kennedy A., et al. “Security chaos engineering for cloud services: Work in progress.” 2019 IEEE 18th International Symposium on Network Computing and Applications (NCA). IEEE, 2019
- Torkura, Kennedy A., et al. “Cloudstrike: Chaos engineering for security and resiliency in cloud infrastructure.” IEEE Access 8 (2020): 123044-123060.
- Torkura, Kennedy A., et al. “Continuous auditing and threat detection in multi-cloud infrastructure.” Computers & Security 102 (2021): 102-124
- Combs, Veronica (2021): Security chaos engineering helps you find weak links in your cyber defenses before attackers do (https://www.techrepublic.com/article/security-chaos-engineering-helps-you-find-weak-links-in-your-cyber-defenses-before-attackers-do/, last accessed 14.06.2022)
- Podjarny, Guy, and Rinehart, Aaron: Security Chaos Engineering – What is it and why should you care? (https://www.devseccon.com/the-secure-developer-podcast/ep-67-security-chaos-engineering-what-is-it-and-why-should-you-care, last accessed 14.06.2022)