Previously in the series, we saw that AI tools have limitations. We also discussed maximizing our productivity by using them to write code - without shooting ourselves in our legs.
In this post, we’ll cover one more piece of the puzzle: how they can (or can’t) help us to refactor.
This post was heavily influenced by thoughtworks’ excellent podcast episode Refactoring with AI.
The Goal of Refactoring
If we want to boil it down, refactoring has a simple goal: to make it easier to change the code.
People usually make this identical to improving code quality. Most of the time, this is indeed what we want to achieve. Higher-quality code is easier to understand, reason about, and, as a result, modify.
However, there is a case where we decrease code quality during refactoring.
Sometimes, we need to restructure the code to introduce new behavior later. For
example, consider we have a Dog
class:
class Dog {
String communicate() {
return "woof";
}
String move() {
return "run";
}
}
We realize that we want to have other types of pets, so we introduce a Pet
superclass:
abstract class Pet {
abstract String communicate();
abstract String move();
}
class Dog extends Pet {
@Override
String communicate() {
return "woof";
}
@Override
String move() {
return "run";
}
}
The code got unnecessarily complicated - for the current feature set. However, now it’s much easier to introduce cats to the codebase:
class Cat extends Pet {
@Override
String communicate() {
return "meow";
}
@Override
String move() {
return "climb";
}
}
In the rest of the article, we’ll focus on code quality improvement because that’s the primary motivation for refactoring.
Measuring Code Quality
We want to quantify code quality objectively. Otherwise, it’d be hard to reason whether refactoring improved it.
CodeScene has a metric called code health. Adam Tornhill, the CTO and founder of the company, explains it in the following way:
The interesting thing with code quality is that there’s not a single metric that can capture a multifaceted concept like that. I mean, no matter what metric you come up with, there’s always a way around it, or there’s always a counter case. What we’ve been working on for the past six to seven years, is to develop the code health metric.
The idea with code health is that, instead of looking at a single metric, you look at a bunch of metrics that complement each other. What we did was that, we looked at 25 metrics that we know from research that they correlate with an increased challenge in understanding the code. The code becomes harder to understand.
What we do is, basically, we take these metrics, there are 25 of them, stick them as probes into the code, and pull them out and see what did we find. Then, you can always, weigh these different metrics together, and you can categorize code as being either healthy, or unhealthy. My experience is definitely that when code is unhealthy, when it’s of poor quality, it’s never a single thing. It’s always a combination of factors.
Improving Code Quality With AI
Now that we can measure code quality, we can benchmark AI tools’ capabilities. Fortunately, Adam did the heavy lifting for us:
Basically, the challenge we were looking at was that, we have a tool that is capable of identifying bad code, prioritizing it, but then of course, you need to act on that data. You need to do something with the code, improve it. This is a really hard task, so we thought that maybe generative AI can help us with that. What we started out with was a data lake with more than 100,000 examples of poor code. We also had a ground truth, because we had unit tests that covered all these code samples.
We knew that the code does, at least, what the test says it does. What we then did was that we benchmarked a bunch of AI service as like Open AI, Google, LLaMA from Facebook, and instructed them to refactor the code. What we found was quite dramatic, in 30% of the cases, the AI failed to improve the code. Its code health didn’t improve, it just wrote the code in a different way, but the biggest drop off was when it came to correctness, because in two-thirds of the cases, the AI actually broke the tests, meaning, it’s not the refactoring, it has actually changed the behavior. I find it a little bit depressing that in two-thirds of the cases, the AI won’t be able to refactor the code.
What does that mean to us? We can’t use AI to refactor the code?
Fortunately, that’s not the case. Even though we can’t unquestionably trust AI to improve our code, we can utilize it in a controlled fashion:
- Detection: Static code analyzers are very efficient but have limited capabilities. They identify bad practices as patterns - which are sometimes hard to detect. AI tools can deal with more complex cases - especially if they work with the abstract syntax tree instead of the code itself.
- Suggestions: Once we detect a problem, we can suggest solutions. Generative AI can shine in that, too.
- Localized refactoring: If we restrict the scope of the refactoring, tools have to work with a much less complex problem space. Less complexity means less room for errors - namely, breaking the tests.
The commonality in these techniques is that we have control over what is happening in the codebase - which is crucial for today’s AI tools.
Conclusion
The tools that are available today can’t reliably improve code quality. Or even ensure correctness. For these reasons, tests and manual supervision are still essential when we use AI tools.
We should take Adam’s results with a grain of salt, though. The tools he tried weren’t optimized for refactoring. We saw that code quality and correctness are quantifiable through code health and the number of failing tests. Once AI tool creators optimize their products toward these metrics, we can expect them to improve - significantly. I can’t wait to see it.