Results from Claude Code Pilot at Varjo

This post was originally shared on LinkedIn. I’ve expanded it here with additional context from the pilot.

We recently concluded a pilot of Claude Code at Varjo. There’s a wide range of claims being made about AI coding tools right now, so I wanted to share what we actually observed.

Quantitative and qualitative results

Working on a large, complex C++ codebase, our pilot participants showed a measurable increase in commit frequency and code output compared to a control group. We compared participants against both their own historical baseline and non-participating peers over the same period.

These quantitative results were consistent with the qualitative feedback. Engineers reported that the tool helped them work with unfamiliar technologies faster and described a shift in their day-to-day work from writing code alone to something closer to a pair-programming dynamic — directing and reviewing rather than typing from scratch.

One thing the data made clear: effective adoption takes weeks or months, not days. This is not a tool you install and immediately see a step change in output. The engineers who reported the most benefit were those who invested time in developing prompting discipline and integrating the tool into their existing workflow.

Most valuable use cases

The largest reported time savings came from tasks involving unfamiliar technology, high complexity, or high volumes of repetitive work:

Working with unfamiliar technology or codebases. Engineers used the tool to prototype and build with technologies they hadn’t previously worked with — in some cases tackling tasks that had been deferred because the ramp-up cost was considered too high.
Test writing, refactoring, and boilerplate. Generating comprehensive tests, performing complex refactoring, and scaffolding new code were consistently cited as areas where the tool reduced time spent on work that is important but often deprioritized.
Enabling contributions during fragmented time. Team leads and others with limited hands-on coding time reported that the tool’s ability to hold context across interruptions allowed them to make meaningful contributions in shorter work sessions.
Debugging. The tool was effective at identifying certain classes of bugs — particularly those that require methodical analysis across a large codebase.

Challenges and limitations

The most common frustration was unreliable output: the tool sometimes made errors, claimed tasks were complete when they were not, or accepted incorrect premises without pushback. This means effective use requires breaking work into small, verifiable steps and checking results at each stage. That overhead is real, and it changes the nature of the work — you spend less time writing code and more time directing and reviewing it.

The recommendation from the pilot was that structured prompting and incremental verification are not optional. They are the difference between productive use and wasted time.

ROI

Despite using API pricing, the pilot delivered a positive ROI. We estimated the time-savings threshold needed for a positive return using average internal engineering cost, and the development metrics indicated we exceeded it.

Whether the specific metrics we tracked — primarily commit-based — fully capture business value delivery is a fair question. But even where the increased output consisted of improved test coverage or supporting infrastructure rather than direct feature work, I consider that a net positive. Only a fraction of engineering effort goes toward new feature implementation. Acceleration in tests, tooling, and infrastructure contributes directly to codebase health and long-term velocity.

Our pilot ran on models prior to Sonnet 4.5. As both the models and our usage patterns improve, I expect the economics to continue shifting in favor of broader adoption.

What we did next

Based on the results, we expanded access to the full R&D team and ran internal workshops to share the prompting and workflow practices that emerged during the pilot.