Featured image of post Claude Skills V2 — A Skill System Evolved with Benchmarking and Automated Evaluation

Claude Skills V2 — A Skill System Evolved with Benchmarking and Automated Evaluation

A breakdown of the key changes in Claude Code Skills V2. A built-in benchmarking system measures skill effectiveness numerically, and Skill Creator now automates everything from test case generation to iterative improvement. New frontmatter options and improved implicit triggering are also covered.

Overview

Anthropic has announced a major update to Claude Code Skills. The most prominent change is the introduction of a built-in benchmarking system. You can now quantify whether a skill actually improves output quality through A/B testing, and Skill Creator V2 automates the entire lifecycle from test case generation through iterative improvement. New frontmatter options also provide fine-grained control over how skills execute.

Two Skill Categories: Capability Uplift vs. Inquiry Preference

Anthropic has formally divided skills into two categories.

Capability Uplift Skills

Skills that enable the model to do something it fundamentally cannot do on its own. Specific API call patterns and external tool integrations fall here. This type of skill may become unnecessary as the model improves — once the model absorbs the capability itself, the skill is redundant.

Inquiry Preference Skills

Skills that enforce a user’s specific workflow or preferences. Examples: “always respond in Korean,” “follow the security checklist on every PR review.” This type will never be deprecated, because it captures requirements that are inherently user-specific, regardless of how powerful the model becomes.

This classification matters because of the benchmarking system described next. Capability Uplift skills can be retired based on benchmark results when the model has absorbed the underlying capability.

Benchmarking System: Proving a Skill’s Value with Data

This is V2’s flagship feature — a built-in evaluation system that quantitatively measures whether a skill actually improves output quality.

How It Works

Multi-agent support allows A/B tests to run simultaneously. One agent with the skill and one without perform the same task, and results are compared against evaluation criteria.

Example Auto-Generated Evaluation Criteria

Seven criteria Skill Creator automatically generated for a social media post generation skill:

#CriteriaDescription
1Platform coverageWas a post generated for every specified platform?
2Language matchWas it written in the requested language?
3X character limitDoes the X (Twitter) post respect the character limit?
4HashtagsWere appropriate hashtags included?
5Factual contentIs the content factually consistent with the source material?
6Tone differentiationIs the tone appropriately differentiated per platform?
7Tone complianceDoes it follow the specified tone guidelines?

If scores differ meaningfully with and without the skill, the skill has value. If scores are similar, the model already has the capability and the skill is unnecessary.

Skill Creator V2: Automate the Full Lifecycle

With Skill Creator upgraded to V2, it goes beyond simple generation to automate the entire skill lifecycle.

Installation and Usage

  1. Run /plugin
  2. Search for “skill creator skill” and install
  3. Describe the desired skill in natural language
  4. Automatic: skill generation → test case generation → benchmark execution → result review

The Automated Loop

Improving existing skills is also supported. Hand an existing skill to Skill Creator and it benchmarks current performance, identifies areas for improvement, and optimizes iteratively.

Built-in progressive disclosure guidance walks users through skill creation step by step, making it accessible even for those without prior skill-writing experience.

Improved Implicit Triggering

Previous versions had reliability issues with implicit triggers (auto-execution without a slash command). V2 has the Skill Creator perform description optimization alongside skill generation, significantly improving implicit triggering accuracy. The skill’s description is automatically refined to communicate more clearly to the model when to invoke it.

New Frontmatter Options

New frontmatter options in V2 enable fine-grained control over skill behavior.

OptionDescription
user_invocable: falseOnly the model can trigger it; users cannot invoke it directly
user_enable: falseUsers cannot invoke it via slash command
allow_toolsRestrict which tools the skill can use
modelSpecify the model to run the skill with
context: forkRun the skill in a sub-agent
agentsDefine sub-agents (requires context: fork)
hooksDefine per-skill hooks in YAML format

The context: fork + agents combination is particularly interesting. It delegates skill execution to a separate sub-agent, so the skill works independently without contaminating the main context. The benchmarking system’s multi-agent A/B test also runs on this foundation.

user_invocable: false is useful for creating “background skills” that aren’t exposed to users and are invoked internally by the model based on its own judgment.

Insights

The core of this V2 update is that the effectiveness of a skill can now be measured objectively.

Until now, skills operated on the assumption that “adding a skill will make things better.” With built-in benchmarking, you can finally determine with data whether a skill actually improves output quality, or whether you’re adding unnecessary prompt overhead on top of something the model already handles well.

The Capability Uplift vs. Inquiry Preference classification is equally practical. Instead of treating all skills identically, it provides a framework for distinguishing skills that should naturally be retired as the model advances from skills that should be maintained permanently.

Skill Creator V2 automating the generation-evaluation-improvement loop dramatically lowers the barrier to entry. Skill writing used to be squarely in the domain of prompt engineering. Now you just describe what you want, and an optimized, benchmark-validated skill comes out the other end. The skill ecosystem is set to grow rapidly in both quantity and quality.

Built with Hugo
Theme Stack designed by Jimmy