Evaluating GitHub Copilot for Business

How we evaluated the impact of GitHub Copilot for 3 months

Tobias Deekens
Tobias Deekens
Principal Engineer, Frontend Architecture, commercetools
Published 12 September 2023
Estimated reading time minutes

GitHub announced its launch of GitHub Copilot for Business in February of this year. This announcement immediately caught our attention and interest, and engineers across the organization shared their desire to use this product. After aligning internally on an adoption strategy, we decided to evaluate GitHub Copilot for three months to learn how it can help us be more productive. This blog post describes our path to evaluating and adopting GitHub Copilot.

Evaluating GitHub Copilot for Business

We at commercetools use various programming languages and tools to build our different products — from Scala to TypeScript and PHP to Go and Rust. We value making educated choices about technology decisions so we can pick tools that make us more productive. Furthermore, as our company grows, we want to retain our collaborative mindset, which is ingrained into our company values. We see a huge impact toward collaboration in software engineering by the breadth of tools built with generative AI and are eager to embed them into our daily routines.

Why evaluate first and not just adopt?

You may be wondering why we evaluated an omnipresent and successful product such as GitHub Copilot for three months instead of just adopting it for all engineers. That's because, as a company policy, commercetools believes in a pragmatic approach to the adoption of AI. AI is a widely supported initiative, but the usage and determination is bottom-up. We want our teams to evaluate and decide how AI can enable productivity and functionality. In this case, the engineering department investigated GitHub Copilot just as we would any other new tool. In doing so, we involved those that would actually be using and affected by the tool in order to get their pragmatic opinions.

Moreover, the incoming flux of enhanced and new products backed by generative AI increases the importance of thoroughly evaluating each tool’s impact individually, as well as how they can be used together. Is it, for example, valuable to use Replit Ghostwriter, Codeium, CodeComplete and GitHub Copilot in tandem, or should we rather complement one with something different such as Mintlify or Wrap? This is a question one can only answer after exploring such tools in practice and not only through scanning their marketing websites.

To perform an informed adoption, we wanted to clearly understand the expected and actual impact of GitHub Copilot across our engineering organization. This included Frontend Engineers, Backend Engineers, Site Reliability Engineers, Test Automation Engineers and those working on documentation.

How we evaluated GitHub Copilot

After having decided that we wanted to perform a controlled evaluation, we first settled on a meaningful duration for it: Three months which span over two quarters felt ideal. Throughout this time, we hoped to get a comprehensive snapshot of the engineering cycle, including the end of a quarter where teams often roll out new functionality across our products.

After having settled on a duration, we needed a sample size. With 150 engineers, running an evaluation with just five to 10 of them can easily lead to skewed results. As a result, we wanted to aim for 30 to 35 engineers to join the team of evaluators, in turn yielding a representation of 20 to 25%. Lastly, we wanted to involve as many disciplines as possible to get a heterogeneous group using different tools and languages.

We were now ready to share our plans through an internal blog post. Through it, we covered the process and linked a Google form allowing those who wanted to, to sign up. After a week, 34 people across the organization did so. This roughly matched our desired sample size, and we luckily didn’t have to adjust our pool of evaluators retroactively. Everybody was then added to an Email list and Slack channel to share updates. To grant access, all members were added to a dedicated team on GitHub, giving them access to GitHub Copilot.

With all of this setup, we got out of everybody's way and just let them do their work and use GitHub Copilot in the process. Only after a week did we briefly check in to ensure that everybody had successfully installed and integrated GitHub Copilot into their editor of choice. For the coming weeks, people shared their impressions and code samples in Slack or on Pull Requests while we remained in the background preparing a larger final survey.

Throughout the duration of our evaluation, we remained in touch with GitHub in the background. The evaluators shared interesting statistics with us, such as the average code acceptance rate. Moreover, we managed to get GitHub Copilot Chat for the last two weeks of our evaluation, which allowed us to peek into the future of GitHub Copilot being more collaborative. We are excited to see where the future of GitHub Copilot goes and where its different offerings take us.

The results and outcome

We anticipated GitHub Copilot to be convenient to integrate into daily workflows and easy to use. We hoped our suggestions would be useful across programming languages and not get in the way of people. Throughout our evaluation, we were not disappointed in any of these expectations, but we also noticed room for improvements and the quality of suggestions varied a lot by the type of work somebody was performing.

In more detail, our main survey turned out to be 15 questions long and focused around three key areas:

  1. Was GitHub Copilot used continuously?

  2. Does GitHub Copilot make us more productive?

  3. Does GitHub Copilot not pose major risks or downsides to us?

It writes release notes for me! This is the best thing ever!
An evaluator after getting an early-morning coffee

Around these three key areas, we drilled deeper with questions such as:

  1. How often did you use GitHub Copilot during our trial?

  2. Did your usage of GitHub Copilot change during the three months?

  3. How often did you have to adjust the suggestions by GitHub Copilot?

  4. In what tasks did you see your biggest productivity gains?

  5. Should we evaluate other tools using generative AI this year to improve our productivity?

There was a daily wrestle between GitHub Copilot vs. regular IntelliSense.
An evaluator watching an everyday struggle

Having asked all these questions, these were the main takeaways:

  • 57% used GitHub Copilot every day; everybody else used it every other day.

  • 95% stated that GitHub Copilot makes them more productive.

  • 63% claimed that their usage increased over time.

  • 67% stated that the suggestions were helpful.

  • 82% stated that the suggestions were rarely problematic.

  • 60% claimed that GitHub Copilot was sufficient as an AI coding assistant.

  • 80% do not expect other tools to be significantly better.

  • 100% would like to continue using GitHub Copilot.

At times GitHub Copilot seems asleep with many VS Code windows open. Then it yells 50 lines of code at you.
An evaluator being hit by a rapid suggestion burst

In addition to these numbers, we also managed to gather more qualitative insights around where GitHub Copilot shines, as well as where it did not manage to impress:

  • Succeeds at writing tests (72%).

  • Helps in refactoring code (42%).

  • Shines in autocompletion, boilerplate and scaffolding (~60%).

  • Struggles with complex business logic (82%).

  • Is not powerful when code context matters (43%).

  • Should be considered carefully with performance or security-related topics (27%).

  • Is not helpful with highly specialist or modern frameworks (14%).

GitHub Copilot is exactly smart enough to at times be dangerous too
An evaluator after GitHubCopilot suggested to load 40,000 entities from a database one-by-one

As we evaluated GitHub Copilot for a longer period, we also saw areas for improvements:

  • You can’t give it feedback on a suggestion yet.

  • It can’t be configured to not work in certain folders or situations.

  • It doesn’t work very well across file boundaries.

  • Homogeneous refactoring across a larger code base doesn't work well with it.

Those are a lot of numbers, but they certainly helped us understand the usefulness of GitHub Copilot across our organization and helped us make the decision to adopt the tool. Once enabled, GitHub Copilot was used continuously, and the usage even increased. Suggestions were often accepted and of good quality. Users were able to embed it easily into their existing work environments and got huge productivity gains out of it — all of which means for us that we will continue to roll it out wider across our organization in the coming weeks.

To read more from Tobias Deekens, checkout his blog post 3 years of sustaining Open Source through our donation program. And if you're interested in how we work at commercetools, we might have just the right position for you in our Careers page.

Tobias Deekens
Tobias Deekens
Principal Engineer, Frontend Architecture, commercetools

Tobias Deekens is a retired basketball player and lousy guitarist, as well as a developer, avid teacher and spontaneous speaker with strong experience in frontend development and architecture. He feels great joy in mentoring while working with diverse teams in agile environments.

Latest Blog Posts