Piloting GitHub Copilot

As a software engineer, I have been skeptical about AI-assisted coding. The hype cycle is off-putting, giving the impression that many generative AI applications are solutions in search of problems. Combined with the obscene resource usage required by training and running these models, I have generally refrained from using them, criticising them from the sidelines like the curmudgeon I am.

Yet part of me fears that I am resisting modernisation. Rejecting AI tools like GitHub Copilot out of hand because of firmly held yet untested beliefs about their lack of value.

So when the company I work for decided to roll out a pilot program for Copilot, I jumped at the chance to use it for a month. Would my initial skepticism and hostility be justified? Or have I been missing out? (Hint: you're going to get a boring and nuanced answer).

What is Copilot?

By now, most developers will know what Copilot is and what it does. But just in case, let me recap quickly. Copilot is a plugin you can install into your IDE (I use IntelliJ), which provides two key capabilities: in-IDE code suggestions, and a context-aware chat feature you can ask about your codebase or request to generate code for you.

I'll talk about these in turn.

Code suggestions

Code suggestions have been hit and miss for me. At times, Copilot has confabulated functions, variables, and libraries that do not exist. Other times, it seemed to read my mind and did exactly what I wanted to do, how I wanted to do it.

This is a problem for me. I have no desire to become a code reviewer for Copilot. I think that would be more taxing, slower, and less fun than writing the code myself. So unless Copilot can consistently provide me with useful suggestions, I will not use it. It also means that I prefer not to use Copilot to generate code when I do not have a clear idea in my head of what I want it to generate. I want it to output code that is already in my head, rather than replace my thinking. Otherwise I am relegated to a code reviewer, which doesn't appeal to me.

Additionally, using Copilot to generate code without having a conceptual model in mind can be risky. This feels like an abdication of responsibility, which I don't think software engineers should accept.

That said, I found that there are a few ways in which Copilot enhanced the way I work and which did not make me feel uncomfortable.

Context

Copilot becomes much better at generating code if it has seen "examples" of what I want it to generate. For instance, when doing a relatively sweeping refactor, after changing code in similar ways in a few places myself, Copilot will start suggesting the same changes (adapted to the individual contexts) in other places. This is a serious time saver: I am making Copilot do grunt work for me that I would otherwise have to do myself, while knowing exactly what output I want from it. There is no ambiguity: I can see at a glance whether Copilot has done what I want, in which case I accept the suggestion. If not, I ignore it.

Test coverage

I want to have comprehensive test coverage; good enough to give me confidence that something will complain if I accept a suggestion with a subtle mistake. I need to be able to blindly press tab when suggestions "look about right" and have a high degree of confidence that my test suite will save me from having accepted buggy code. Again, I want to avoid becoming a code reviewer for Copilot. My tests should be my reviewer.

This is why I am very careful with how I use Copilot to generate tests. I need to trust that my tests are testing the right things, in order to speed up my development workflow in implementation code. I much prefer writing tests myself to establish this trust so that I can go wild with suggestions in implementation code.

In terms of TDD cycles, I want to do the "red" and the "refactor", but I might let Copilot loose in the "green" part. This is the inverse of how I often see Copilot used: many people want to offload the boring work of writing tests, doing the implementation themselves. That is not for me.

Type safety

Third, I need a codebase that uses typing effectively. For example, Copilot is generally pretty good at generating test data. It can create data classes for mocked return values or expected results, and it is mostly accurate. But every so often it will get things mixed up. It might put values in the wrong place (e.g., pass two entity guids into the incorrect positional arguments). Or it might pass in Integers instead of Longs.

I do not want to have to be vigilant about these things. I find it harder to detect these issues in code review than to simply avoid them by writing the code myself and looking at the function signatures. So, in order for me to cede some control to Copilot, I need to trust that my IDE will show a squiggly line when generated code does not adhere to the types.

Sometimes this means having to use branded types; e.g., if a function takes 6 UUIDs for 6 different entities, I might create newtypes for each of them wrapping the UUID type, so that I have increased type safety. Sure, this is a bit of additional work if the codebase doesn't already do this. But perhaps this is another good practice that becomes more important when using AI assistants?

So on balance, do I find Copilot's suggestions useful? Yes, I do. Used in the way described, they speed up my development in specific circumstances. Now, I personally find the guardrails I have to put in place (test coverage and type safety) important anyway, so this isn't really a downside. But I am sure opinions here will vary.

Chatting with Copilot

In addition to code suggestions, the Copilot plugin also provides chat functionality. You can ask Copilot to generate code for you, to explain things about your codebase or libraries you are using, and to provide examples of how to use certain pieces of software.

I found myself using this feature much less than code suggestions.

As an engineer, I am used to reading documentation and source code to understand how things work. It makes me uncomfortable to outsource this information-gathering to an AI assistant. I am responsible for the technical decisions that I make and the ways in which I introduce and use software in my projects, and if I end up asking Copilot to tell me how to use something or even what to use, then I will have to validate this information myself. That sounds like double work, in most cases.

I understand other engineers find Copilot chat helpful to bounce ideas back and forth, talking about design patterns or architectural decisions, but I personally found Copilot's answers to be too generic to be useful. It would often just spit out regurgitated lists of pros and cons for particular architectural choices and then tell me that I'd have to "look at the specific system" to make the decision that makes most sense.

While true, being told this by an LLM isn't particularly helpful. I'd rather read a book on the topic or talk to a colleague who can be an actual sounding board or devil's advocate.

Where I did feel the chat feature came in handy was when I wanted it to generate code, specifically in two scenarios.

Docstrings

I have found the chat feature to be useful in generating docstrings for functions that are already written. When variables are well-named and the logic of a function is easy to follow, Copilot does a good job of creating documentation for me. It might need very light editing, but more often than not the docstring can be accepted verbatim. This is a nice time saver and makes it easier to make sure all my public interfaces are documented.

Test data transformation

I also found the chat helpful for doing data transformations for me. Let me explain by giving an example. I was testing a piece of functionality and had set up my test case -- mocking out a dependency, and providing it with some mocked data to return. I just needed to write my assertions.

Now, the expected data was a list of about 24 data classes. Normally I'd spend a few minutes writing them out by hand, or letting the test fail, pasting in the actual data from the failure's output, and editing it to make each line into the required data class. But now, I just pasted the output from the test failure into the Copilot chat and asked it to transform it into the required data classes for me. It did this accurately and saved me at least a couple of minutes of menial work.

So, what do I think of Copilot after a month of using it? I think it is a useful tool that can speed up my development workflow, but only in specific circumstances. I need to have a clear idea of what I want Copilot to generate. I need to have a codebase that I trust will detect subtle errors, so that I can accept suggestions without having to become a code reviewer.

I am on the fence whether Copilot's utility justifies its significant resource requirements. Copilot is not a silver bullet. For me, it falls short of "supercharging developer productivity", as touted by GitHub itself and many other proponents of AI-based everything.

Sure, it makes me a bit faster. Sure, it prevents me from having to do some grunt work that I'd rather not. But is that worth spending a small country's worth of energy on training and running these models? I am not sure. Perhaps advances in open source models can assuage these concerns.

That said, I think I am a little less skeptical than I was before. Yes, the hype cycle is annoying. And there are significant problems around licensing and resource use that many people gloss over. It doesn't help that all things AI have now been co-opted by the web 3 and crypto bros. But LLM-based coding assistants do seem to have actual utility. And I can see this utility increasing as they are developed further.