Evaluation is the new moat
Model access is commoditized. The defensible asset is your rubric, your golden dataset, and the discipline of running both against every change. That's where domain expertise compounds.
Source · Synthesis from frontier-lab evaluation practice, 2026
“If two teams have the same model and the same data, the team with a sharper rubric ships better outputs — every release, forever.”
Strategic implications
- 1Treat evaluation as a first-class product surface, not a QA afterthought. Version it, publish it, defend it.
- 2Domain experts become rubric authors. The org chart should make space for them next to product and engineering.
- 3Open-source your prompts and skills; keep your evals private. The asymmetry compounds.
Pick the highest-stakes prompt in your codebase. Write 20 graded examples for it this week. That's your moat starter pack.
Personal opinion, not analysis. Dated above; revisit if the conditions change.
Why the moat moved
In 2023, the moat was model access. Closed-weight frontier models were gated behind enterprise sales motions, and the ability to even call the API was a competitive edge.
That edge is gone. Frontier-class capability is now a commodity input — open weights, multiple vendors, falling per-token prices. If your competitive story still rests on "we use the best model," you don't have a story.
The moat moved one layer up: to the rubric.
What a rubric actually is
A rubric is a structured way of asking "is this output good?" applied consistently across thousands of generations. Concretely:
- A set of evaluation criteria, weighted, written by domain experts.
- A golden dataset of inputs paired with expected output characteristics.
- Tooling that runs the rubric against every prompt change, model change, or skill change.
- A culture that treats failing rubric runs as build failures.
Two teams with the same model and the same data will produce different quality outputs because their rubrics differ. The team whose rubric catches more failure modes ships better outputs every release. That gap compounds.
What this implies for hiring and org design
Domain experts who can write good rubrics become as load-bearing as engineers. They sit between product and ML — they decide what "good" means.
In a startup that takes this seriously, the org chart changes:
- A "Head of Evals" role appears, reporting to engineering or to product.
- ML engineers spend more time on eval infra than on prompt tuning.
- PMs partner with domain experts to ship rubric updates as part of release notes.
Counter-argument
The honest pushback: rubrics are expensive to build, brittle to maintain, and often reflect today's failure modes rather than tomorrow's. A team that over-invests in rubrics can ossify around them, missing capability shifts that invalidate their eval set.
The mitigation isn't to skip rubrics — it's to version them like code, retire old criteria deliberately, and treat rubric-set drift as a first-class metric.