New Delhi, April 8 -- As AI-assisted code generation compresses software development cycles, engineering teams are shipping larger volumes of code faster than traditional quality assurance processes can handle. The resulting gap between deployment speed and production reliability is pushing enterprises to rethink how they approach testing. In a conversation with TechCircle, Umasankar Mukkura, founder of Chaos Native and VP of Product at Harness, spoke about why resilience testing is becoming a mandatory gate in modern software delivery pipelines, and where chaos engineering fits into that shift.

Edited Excerpts:

Why is the traditional approach to testing no longer sufficient for modern, distributed systems?

Testing comes after the code, and code is no longer being written the way it used to be. It is now being written with the help of LLMs (large language models), and that is where the major shift is happening. Code is being generated at around 10 times the speed, with 10 times the volume. So if you are testing with a fixed number of people over a fixed amount of time, that model no longer holds, you are receiving ten times the volume.

Software delivery has to be made AI-native or AI-enabled, too. There are many types of testing. Functional testing can be handled to some extent by the code itself, but there is a lot more that needs to happen before the code reaches production. It needs to be secure, resilient, and reliable. You need to bring in AI agents to help fast-track testing across all these dimensions. The question we focus on at Harness is: how do you use AI to deliver the code you write as quickly and securely as possible into production, and then manage its reliability once it is there?

What are the biggest gaps today between how fast engineering teams ship software and how reliably those systems perform in production?

Even before the AI shift, the industry had already moved to a cloud-native world, Kubernetes-based systems with a growing number of microservices delivering at faster and faster cycles. The number of APIs you need to manage and the number of dependencies have multiplied. This is where testing beyond traditional quality-based testing becomes necessary, because you often do not know who you depend on or how quickly those dependencies are changing. The unknown unknowns become too many.

That is why chaos engineering is particularly useful here. In chaos engineering, you do not need to know the precise list of APIs. You introduce a fault and ask: is my system resilient? You increase the load and ask the same question. You simulate a disaster. The resilience is tested in a methodical way using AI, regardless of how fast code is arriving or how many dependencies have changed.

Modern testing is about chaos testing, better load testing, and disaster recovery (DR) testing. We group all of these under what we call resilience testing.

Chaos engineering has been around for some time, but what limitations has the industry encountered that are pushing teams toward this broader resilience testing model?

The word chaos tends to be received with some resistance, people say they already have enough chaos, so why would they introduce more? But in the last five years, the industry has genuinely understood the need for chaos engineering. Eight or nine years ago, I spent a lot of time explaining what chaos engineering is and why it matters. Today, most of the industry understands the concept. The challenge has shifted from introduction to scale.

The question now is: how do you get 70 percent of your applications through chaos testing? That requires a change in approach, away from deep infrastructure-based chaos testing and toward application-based chaos testing, which requires fewer permissions to introduce faults. You can be more precise about mimicking a fault without having to take down a large network segment or load balancer.

AI has helped teams move toward an application chaos model that they are more comfortable adopting. For example, throwing specific exceptions into a given module rather than taking down part of the network to recreate an incident. This approach of application chaos has helped organisations move faster along the path of adopting chaos engineering.

How have cloud-native architectures, microservices, and distributed APIs changed the nature of failures that teams need to anticipate?

The distributed and cloud-native nature of Kubernetes means that failures are introduced naturally by design. In earlier systems, you had to physically switch off hardware to simulate the deletion of a machine. In cloud-native architecture, pod deletes, where a container instance is removed and respun elsewhere, happen as a normal part of system reconciliation. When traffic on one node increases, workloads shift to a larger node. A pod is deleted and spun up on the other side.

That is exactly what warranted chaos engineering. Around five to seven years ago, I created an open-source project called LitmusChaos specifically for cloud-native systems. It was donated to the CNCF (Cloud Native Computing Foundation), where it is hosted as an incubating project, and thousands of organisations are now using it. Chaos engineering is a natural requirement for cloud-native systems because pods keep deleting and workloads keep moving across hardware without users realising it.

Once organisations understood that chaos engineering was necessary, the next question became how to do it more efficiently. That is when AI and enterprise-grade resilience testing platforms became important, discovering all microservices, which are often in the hundreds, and automatically generating chaos experiments for the most common scenarios so teams can select and run the relevant ones rather than spending time figuring out what their microservices are and how to build experiments around them. AI can help with all of those optimisation layers.

Are we seeing new categories of failures today that did not exist in pre-cloud systems?

Yes, definitely. Application chaos has evolved considerably, and configuration changes themselves are now treated as a form of chaos. For example, if a Lambda function is supposed to have a timeout of five seconds, what happens if it is set to two seconds or ten seconds? That configuration change is itself a chaos scenario.

There is also a new category emerging around AI agents, which are becoming more widespread and operate based on prompts. You can introduce chaos into an AI agent by mutating the query, giving it a twisted or malformed input, and observing whether the agent crashes or goes into a degraded state. Prompt engineering as a source of chaos is a new fault type that I expect to appear in the market in the near term. Most chaos platforms will likely support this.

Do you think engineering teams are unintentionally introducing more risk into production systems as AI accelerates code generation and deployment?

Yes, that is precisely why the outer loop of testing, testing beyond what happens inside the development environment, is so important. It is entirely unintentional. There is enormous pressure on engineering teams to operate at ten times the speed, and when you generate that volume of code, resilience risks can creep into production without anyone intending it.

That is where resilience testing needs to function as a gate before code reaches production, and it has to be automated. It cannot be manual. Use as much AI and automation as possible to minimise the chances of resilience risks reaching production. Otherwise, the equation becomes counter-productive, you are efficient at generating code, but you may not be as efficient as you once were at keeping your systems resilient. A small failure could trigger a major outage weeks later.

How does continuous resilience validation fit into AI-powered development pipelines? Is it a core control layer?

It is a core control layer. In my view, it is an expected and mandated gate in the software delivery cycle, much like security scanning. You do not release code without checking for vulnerabilities; resilience testing should carry the same status.

Within continuous resilience, there are essentially three types of testing. First, testing resilience against small, expected failures. Second, testing against heavy load and a combination of load and chaos. Third, DR (disaster recovery) testing, not just for data centre outages, but for smaller-scale failures such as availability zone or regional failures, which cloud providers can experience momentarily, given how many regions they now operate.

AI agents are themselves becoming part of making this process more efficient. Rather than running hundreds of tests in every pipeline cycle, AI can identify which chaos experiments are relevant to the specific changes introduced in a given deployment and run only those. That optimisation is important because running everything every time can be time-consuming and counterproductive.

At Harness, the resilience testing product includes AI agents called AI Reliability Agents, which are designed to optimise the chaos testing process itself.

How has the open-source ecosystem shaped the evolution of chaos engineering and resilience practices?

Open source is central to this ecosystem. As the founder of LitmusChaos, I can say that open source has played a major role. I wear two hats, co-maintainer of LitmusChaos, and VP of Product at Harness, focused on helping organisations do chaos testing more efficiently.

Many organisations are still sceptical about chaos testing and are not prepared to allocate a budget for enterprise tools. That is where open-source tools like LitmusChaos serve an important function. SREs (site reliability engineers) or QA engineers can introduce basic chaos testing using Litmus into their pipelines and demonstrate value internally. Once they have done that, the questions shift to scale, security, governance, service discovery, and AI enablement, and that is when enterprise tooling becomes relevant.

LitmusChaos is being used by around 2,000 organisations, which would not have been achievable in the same timeframe through commercial channels alone. Open source handles the initial adoption and spread; enterprise tools help organisations scale, govern, and go faster.

Are enterprises still building in-house resilience tooling, or is the market consolidating around platforms?

I would say 10 to 20 percent of organisations will continue to build and run their own chaos tooling, primarily because they operate without a dedicated budget for this. That is unlikely to change entirely. But as organisations grow larger and recognise the financial risk of underinvesting in resilience testing, they tend to look for platform consolidation, a software delivery platform where security, faster delivery, and resilience reliability are all addressed together.

Larger organisations are actively looking for that kind of consolidation. Smaller or unbudgeted organisations remain tied to open-source tools, and that will continue. Every new startup, for example, finds it far easier to download Litmus, add it to their pipelines, and operate that way for the first six to twelve months without going through procurement cycles.

I am also seeing banks that have been running their own approach for three to four years now, recognising that moving to AI-based resilience testing would take them two to three years to build internally, and choosing instead to work with a platform like Harness where the capability already exists.

Will the term 'chaos engineering' fade away, replaced by resilience testing?

It will remain a core discipline. Resilience testing, as I see it, is a holistic category that encompasses two to three distinct types of testing. One is chaos testing. Another is load testing, not just performance or scaling tests, but load testing is mandated in every pipeline cycle because of the volume of code changes. There is always the possibility that new code will fail under load. And then there is the combination: what happens when high load and a small failure coincide? Can a zone failure under load trigger a disaster declaration?

Chaos, load, and DR testing together are what we now term resilience testing. Two quarters ago, I renamed the product from Harness Chaos Engineering to Harness Resilience Testing, which now covers chaos, load, and DR. Resilience testing is a broader discipline than chaos alone.

Do you see a future where systems become self-healing and autonomously resilient, and what role do platforms like Harness play in that?

Self-healing and autonomous resilience are separate disciplines from continuous resilience testing. They are also typically managed by different people. Self-healing is primarily the responsibility of developers and platform engineering teams. The efficiency of self-healing will improve over time.

But resilience testing as a discipline will need to grow in importance over the next three to five years, because there is a significant amount of resilience debt accumulating as AI-generated code enters production without adequate testing. That debt needs to be addressed with every pipeline cycle. I do not think resilience testing will fade away, if anything, it will become more important in the near term.

Beyond three to five years, I cannot predict with confidence. But in the next two to three years, resilience testing is going to receive considerably more attention and investment.

Published by HT Digital Content Services with permission from TechCircle.