Methodology · v0.3 · last updated Apr 15, 2026
How I evaluate agent skills
The rubric I apply to every skill published on startupengineering.io. Public so you can verify my work, critique my method, or apply it yourself.
Purpose
Startup engineers are being asked to adopt agent skills without good information about whether they work. Marketing content overpromises. Raw GitHub repos underexplain. I run a consistent evaluation process across every skill I publish, so readers can compare skills against a common standard and see exactly how I reached my conclusions.
This rubric is not definitive. It captures what I currently think matters when evaluating a skill for startup engineering contexts. It will evolve as I learn what the rubric misses. Every version is dated and preserved, so any evaluation you read on this site can be traced back to the exact methodology that produced it.
What the rubric covers
The rubric has two parts. The first evaluates the skill file as an artifact — is it well-written, well-structured, and likely to guide an agent correctly. The second evaluates the skill empirically — does an agent following this skill actually produce good behavior on real scenarios.
A skill can pass the first part and fail the second. A well-written skill that no agent follows correctly is not useful. A poorly-written skill that produces correct behavior is not reproducible — it probably works by accident. Both parts matter, and they're reported separately so readers can weight them as they see fit.
Part 1: Spec compliance
Eight criteria drawn directly from the agentskills.io best practices. Each is a binary pass/fail with cited evidence. The source of these criteria is not me — it's the official standard. My contribution is rigorous, reproducible application.
Provides defaults, not menus
What this checks: Does the skill recommend one specific default approach when multiple are possible, rather than listing options and leaving the agent to choose?
How I grade it: Pass if the skill names a default with optional fallbacks. Fail if the skill presents multiple options as equally valid without specifying which to prefer.
Example evidence: A skill that says "Use pdfplumber for text extraction. For scanned PDFs, use pdf2image with pytesseract instead" passes. A skill that says "You can use pypdf, pdfplumber, PyMuPDF, or pdf2image" fails.
Favors procedures over declarations
What this checks: Does the skill tell the agent how to approach a class of problems, rather than what to produce for a specific instance?
How I grade it: Pass if the skill's process section contains generalizable steps that work across varied inputs. Fail if the skill hardcodes specific answers that only apply to one scenario.
Example evidence: A skill that says "Join tables using the _id foreign key convention, apply filters from the user's request, aggregate numeric columns" passes. A skill that says "Join orders to customers on customer_id, filter where region = EMEA, sum the amount column" fails.
Has a gotchas section
What this checks: Does the skill include environment-specific facts that defy reasonable assumptions — the non-obvious corrections the agent would miss without being told?
How I grade it: Pass if the skill has a gotchas section (or equivalently-named) with at least one concrete, testable item. Fail if the skill omits this or fills it with generic advice ("handle errors appropriately").
Example evidence: "The users table uses soft deletes. Queries must include WHERE deleted_at IS NULL" is a gotcha that earns a pass. "Be careful with the database" is too vague to pass.
Uses templates for output format
What this checks: Does the skill provide a concrete template when specific output format matters, rather than describing the format in prose?
How I grade it: Pass if the skill includes a literal output template (typically a markdown code block showing the desired structure). Fail if the skill describes the output only in prose.
Example evidence: A skill that shows a template with placeholders like "## Executive summary\n[One-paragraph overview of key findings]" passes. A skill that says "Write a report with an executive summary section, followed by findings and recommendations" fails.
Sized under 500 lines
What this checks: Is the SKILL.md file within the recommended 500-line limit from the agentskills.io specification?
How I grade it: Pass if under 500 lines. Fail if over. Count includes all lines of the primary SKILL.md file, excluding optional references in separate files.
Example evidence: A 312-line file passes. A 650-line file fails.
Specificity calibrated to fragility
What this checks: Does the skill match the specificity of each instruction to the fragility of the task it covers — prescriptive where sequence matters, flexible where multiple approaches work?
How I grade it: Pass if the skill's instructions show deliberate calibration. Fail if the skill is uniformly prescriptive (micromanaging flexible tasks) or uniformly flexible (offering guidance where strict steps are needed).
Example evidence: A skill that specifies an exact migration command with explicit flags, but leaves code style as general guidelines, passes. A skill that prescribes exact code style down to variable naming passes only if the project genuinely requires that strictness.
Has a rationalizations table
What this checks: Does the skill include a table of common rationalizations (excuses agents might use to skip steps) paired with counter-arguments?
How I grade it: Pass if the skill has a rationalizations table with at least two entries, each pairing an excuse with a rebuttal. Fail if the skill omits this or only lists excuses without counter-arguments.
Example evidence: A table with rows like "'I'll add tests later' → Tests written after implementation tend to test the code's behavior rather than its requirements" passes. A bulleted list of warnings without paired rebuttals fails.
Uses progressive disclosure for large references
What this checks: When the skill has extensive reference material, does it split that material into separate files that load only when needed?
How I grade it: Pass if the skill is short (under 200 lines) or uses references/ directory for longer reference material. Fail if the skill is over 300 lines with all content inline and no references/ split.
Example evidence: A 200-line SKILL.md with "Read references/api-errors.md if the API returns a non-200 status code" passes. A 400-line SKILL.md with all error handling inline fails.
Scoring for Part 1
Each criterion is binary: pass or fail. No partial credit. If I'm uncertain whether a criterion passes, I look for additional evidence; if still uncertain, I mark it fail and note the uncertainty in the evaluation's findings section.
The summary is a count out of 8: "Passes 7 of 8 spec criteria." No weighting between criteria. I don't assert that one criterion matters more than another. Readers can weight them according to their own priorities.
A skill with a low spec compliance score isn't necessarily a bad skill. Non-compliance often means the author was writing before the best practices crystallized, or that their judgment diverged from the standard for specific reasons. The score is information, not a verdict.
Part 2: Empirical evaluation
Does an agent following this skill actually produce the behavior the skill describes? I run the skill against a set of scenarios on multiple agents, record what happens, and publish the results as a matrix.
Scenario design
Each skill evaluation uses 12 scenarios spanning four required categories:
Happy path scenarios test that the skill produces correct behavior on typical inputs. At least 2 scenarios per evaluation.
Red flag trigger scenarios test that the skill refuses or redirects when its stated red flags are hit. At least 3 scenarios per evaluation, each testing a distinct red flag.
Edge case scenarios test that the skill handles unusual but plausible inputs. At least 3 scenarios per evaluation.
False activation scenarios test that the skill correctly declines when the user's request doesn't match the skill's scope. At least 1 scenario per evaluation.
The remaining scenarios fill out the test set with skill-specific cases. Every scenario is reproducible: given the same inputs, another researcher running the same agent should get similar results.
Agent selection
Each evaluation runs the scenarios against at least 4 agents. The default agent set as of v0.3 is Claude Sonnet 4.5, Claude Opus 4.7, GPT-5, and Gemini 2.5. Versions and release dates are documented in every evaluation's methodology footer.
I pick agents based on actual usage among startup teams, not completeness. I may add agents (Codex, Llama-based agents, open-source runtimes) when their adoption warrants it. The agent set can change between evaluations, but each individual evaluation uses a fixed set named in its methodology.
Running the scenarios
Each scenario is run 3 times on each agent. I record the majority outcome. If all three runs produce different outcomes, I mark the result partial and note the variance in the trace.
Agent temperature is set to 0.3 unless the skill explicitly requires a different setting. System prompt is minimal — just enough to load the skill. I don't engineer the prompts to improve results, because I'm testing the skill's behavior, not my ability to prompt-engineer around it.
Each run is logged: the input, the agent's response, the reasoning trace if available. Traces are preserved so readers can inspect any individual run.
Grading a run
Each run is graded pass, partial, or fail:
Pass: The agent followed the skill's procedure and produced the intended outcome.
Partial: The agent partially followed the procedure or produced a partially-correct outcome. Specifically: missed one step, flagged an issue at the wrong severity, or deviated slightly from the skill's output format without losing substance.
Fail: The agent violated the skill's procedure or produced an incorrect outcome. Specifically: skipped a required step, missed a red flag the skill explicitly names, or produced output that contradicts the skill's template.
I grade traces myself. I don't use automated grading, because the distinction between partial and fail often requires judgment. My grading rules are documented here, but the application is human, which means it's fallible. Evaluations are open to correction — if you think I graded a run wrong, file an issue on the content repo.
Reporting results
Results are published as a scenario × agent matrix. Each cell shows pass, partial, or fail for a specific scenario run against a specific agent. Hovering a cell reveals the trace.
Summary percentages per agent are computed as: pass = 1.0, partial = 0.5, fail = 0, divided by scenario count. These summaries are convenient but compressed — readers who want precision should look at the matrix cells, not the percentages.
What the rubric does not measure
This rubric is deliberately limited. There are things it doesn't measure, and you should know what those are before relying on the results.
Long-term production stability. I test skills in isolated scenarios, not in sustained production use. A skill that passes my evaluation may still fail under real-world conditions I didn't test: production traffic patterns, interaction with other skills, unusual inputs over time.
Agent-agnostic portability. I test specific agents, not all possible agents. A skill that passes on Claude may fail on a model I didn't test.
Domain-specific correctness. My scenarios are generic. A code-review skill tested on a generic PR might behave differently on a 10,000-line codebase with unusual conventions.
Cost and latency. I don't currently measure token cost, latency, or resource usage. These matter for production decisions and are on the roadmap but not yet part of the rubric.
Team-fit and adoption friction. My rubric doesn't address whether your team will actually use the skill correctly, whether it conflicts with your existing review culture, or whether it requires training. These are real constraints I can't evaluate from outside your team.
My evaluations give you information about how a skill behaves under my test conditions. Whether that information applies to your situation is your judgment call.
Version history
- v0.3Apr 15, 2026Current
- Added false activation as a required scenario category
- Clarified partial vs fail grading rules
- Added section on what the rubric does not measure
Contribute
This rubric is public and version-controlled. You can:
— File an issue if you think a criterion is unclear, missing, or wrongly-weighted. Link: github.com/selvaganapathyc/startupengineering-content/issues
— Propose a change via pull request on the rubric document. Significant changes become the next version, with credit in the version history.
— Apply the rubric yourself. If you evaluate a skill using this rubric and publish your results, I'll link to your evaluation from the skill's page on startupengineering.io.
The rubric gets better by being used and critiqued. Your input is welcome.
Citing the rubric
If you reference the rubric or apply it to your own evaluations, please cite the specific version:
Ganapathy, S. (2026). Rubric for evaluating agent skills for startup engineering contexts, v0.3. startupengineering.io/rubric.
Attribution matters because the rubric evolves. An evaluation done against v0.3 differs from one against v0.5 in ways that affect comparability. Naming the version makes the difference traceable.