Escaping Skill Hell: A Governance Playbook for AI Agent Skills

Writer

Escaping Skill Hell: A Governance Playbook for AI Agent Skills
We are moving very quickly from framework hell into skill hell.
The first wave of AI agent adoption was about excitement: install a coding assistant, wire in a few prompts, add some slash commands, and watch the agent move faster than your backlog. But the second wave is where IT leaders, FinOps practitioners, platform teams, and tenant administrators start asking the uncomfortable questions:
- Why is every request getting more expensive?
- Why does the agent behave differently for different teams?
- Who approved this skill to run automatically?
- Why are we loading 50 tiny instructions when the user only asked for one small task?
- How do we scale this without turning our AI platform into a junk drawer?
That is the real enterprise problem. Not whether agents can use skills. They can. The problem is whether your organization can govern skills as reusable business capabilities instead of letting them become another unmanaged automation layer.
Matt Pocock’s mattpocock/skills repository is useful because it brings discipline to this chaos. It treats skills as small, composable, practical workflows rather than giant magical frameworks. The repository describes skills as tools for “real engineering,” designed to stay small, adaptable, and composable rather than taking over the entire process. Anthropic’s Agent Skills documentation also reinforces the core architectural idea: skills are filesystem-based bundles of instructions, metadata, and optional resources that can be loaded progressively instead of dumping everything into the model upfront.
That matters for governance.
A skill is not just a prompt. A skill is a unit of operational behavior. Once you see it that way, you can manage it like any other enterprise capability: with ownership, lifecycle, routing, cost controls, and auditability.
This article gives you a mental model for escaping skill hell using four levers:
- Trigger: decide who or what is allowed to invoke the skill.
- Structure: keep the core instruction small and move reference material behind gates.
- Steering: use shared vocabulary to make agent behavior predictable.
- Pruning: delete everything that does not change outcomes.
The technical audience may call this prompt architecture. I would call it AI operating discipline.
One important scope note: I am deliberately removing any promotional references to AI Hero or external courses. The useful implementation reference here is the public mattpocock/skills repository and, specifically, the /writing-great-skills skill pattern inside that repository.
The Big Mental Model: Skills Are Like Corporate Apps
Think about how enterprise IT manages SaaS applications.
You would never give every employee every app, every permission, every workflow, and every dataset by default. You would define access groups, roles, policies, ownership, lifecycle, and cost allocation.
AI skills deserve the same thinking.
| Enterprise app governance | AI skill governance |
|---|---|
| App is available in the catalog | Skill is available in the agent workspace |
| User must be assigned access | User or model must be allowed to invoke the skill |
| App has an owner | Skill has a business or platform owner |
| App consumes license or usage cost | Skill consumes tokens, tool calls, and review time |
| App has lifecycle management | Skill needs versioning, pruning, and retirement |
| App has risk classification | Skill needs autonomy and data-sensitivity classification |
The mistake many teams make is treating skills like harmless markdown files.
They are not harmless. They shape agent behavior. They consume context. They can trigger tools. They can influence architectural decisions. And if left unmanaged, they create a confusing operating environment where nobody knows which instruction actually caused which behavior.
That is skill hell.
What Is Skill Hell?
Skill hell is what happens when an organization installs too many agent skills without a governance model.
It usually starts innocently:
- One team adds a planning skill.
- Another team adds a code review skill.
- A product team adds a PRD skill.
- A platform team adds a deployment checklist.
- A security team adds secure coding instructions.
- Someone keeps all the old versions because “we might need them later.”
Then the agent starts behaving like a new employee who attended 47 onboarding sessions and remembered the wrong five.

The symptoms are predictable.
| Symptom | What it feels like | Governance root cause |
|---|---|---|
| Rising token usage | Every request feels heavier than it should | Too much always-visible metadata or instruction text |
| Inconsistent execution | The agent sometimes uses the right skill and sometimes ignores it | Overreliance on autonomous model invocation |
| Conflicting behavior | One skill says “ask first,” another says “act immediately” | No skill design standards or ownership |
| Bloated workflows | Simple tasks trigger enterprise-sized plans | Main skill files contain too much reference material |
| Poor auditability | Nobody knows why the agent made a decision | Skills are unmanaged and unversioned |
| Stale guidance | Old templates keep showing up | No retirement or pruning process |
The goal is not to eliminate skills. The goal is to make skills intentional.
Source Coverage Map
The original knowledge source behind this article had a very specific checklist. The final version keeps the full structure, but reframes the material for governance, FinOps, and tenant administration.
| Original concept | Preserved in this article as |
|---|---|
| Skill hell as the modern version of framework or tutorial hell | The governance failure mode where unmanaged skills create cost, confusion, and inconsistent execution |
| Four-part framework | Trigger, Structure, Steering, and Pruning remain the backbone of the article |
| Model-invoked vs. user-invoked skills | Reframed as automatic doors vs. badge readers, with cost and risk implications |
| Context load vs. cognitive load | Reframed as a budget and operating model tradeoff |
| Steps vs. reference | Reframed as procedure vs. reference material, with a skill content classification model |
| Context pointers and branch-specific references | Reframed as progressive disclosure and conditional references |
| Leading words such as “vertical slice” | Reframed as policy labels that compress intent and steer behavior |
| Forcing the leg work | Reframed as splitting discovery, planning, execution, and review |
| DRY, sediment, and no-ops | Reframed as measurable pruning targets |
/writing-great-skills | Included in the next steps as an audit tool from the mattpocock/skills repository |
Directional Cost Intuition: The Hidden Tax of Context Bloat
Token pricing varies by model, provider, region, contract, caching strategy, and product surface. Treat the math below as a directional planning aid, not a quote.
The key point is simple: every token you load unnecessarily is a small tax. At enterprise scale, small taxes become budget lines.
Imagine the following:
- You have 60 model-invoked skills.
- Each skill exposes a 120-token description or routing hint.
- That is roughly 7,200 tokens of always-visible skill metadata before the user has even asked anything meaningful.
- Your platform handles 100,000 agent requests per month.
That creates roughly:
At a directional input price of $2 to $5 per million input tokens, that can represent roughly:
Again, this is not a hard quote. It is a way to build intuition.
And this example only counts skill metadata. It does not include:
- the user’s actual request,
- conversation history,
- retrieved documents,
- tool outputs,
- generated responses,
- retries,
- failed runs,
- evaluation runs,
- or human review time.
The FinOps lesson: context is inventory. If you carry too much of it into every request, you pay storage rent in the form of tokens.
| Design choice | Cost intuition | Governance implication |
|---|---|---|
| Always-visible skill descriptions | Small per request, large at scale | Keep descriptions short and limited to approved model-invoked skills |
| Large skill bodies loaded upfront | High context cost and higher confusion risk | Move reference material into separate files loaded only when needed |
| Autonomous skill selection | Convenient, but can trigger wrong workflows | Use only for high-confidence, low-risk patterns |
| User-invoked slash commands | Lower autonomous risk, but users must know what to call | Best for expensive, sensitive, or specialized workflows |
| Prompt caching | Can reduce repeated-input cost where supported | Useful for stable, repeated platform instructions, but not a substitute for pruning |
The best cost control is not a cheaper model. It is not sending unnecessary tokens in the first place.
Lever 1: Trigger: Decide Who Gets to Pull the Fire Alarm
The first governance decision is not what the skill says.
It is how the skill gets invoked.
Skills generally fall into two practical activation patterns:
- User-invoked skills: a person explicitly calls the skill, often through a slash command.
- Model-invoked skills: the agent decides when a skill is relevant, usually based on skill metadata such as the description.
Both are useful. Both are dangerous when used lazily.
| Dimension | User-invoked skill | Model-invoked skill |
|---|---|---|
| Trigger | Human explicitly calls it | Agent decides based on the task |
| Predictability | High | Medium to low depending on descriptions and task ambiguity |
| Cognitive load | Higher for users | Lower for users |
| Context load | Lower if not exposed broadly | Higher if many skill hints are always discoverable |
| Best for | Costly, sensitive, specialized, approval-heavy workflows | Frequent, low-risk, high-confidence workflows |
| Failure mode | Users forget it exists | Agent invokes the wrong skill or ignores the right one |
| Governance posture | Safer default | Needs stronger review and monitoring |
The mental model: a model-invoked skill is like an automatic door; a user-invoked skill is like a badge reader.

Automatic doors are great for the lobby. They are not great for the data center.
Rule of Thumb
Default to user-invoked skills for anything that is costly, risky, customer-facing, security-sensitive, or architecturally significant.
Use model-invoked skills only when:
- the task pattern is easy to identify,
- the blast radius is low,
- the skill is small,
- the description is precise,
- and the cost of accidental invocation is acceptable.
Anthropic’s Claude Code documentation notes that skills can be used when relevant or invoked directly with /skill-name, and that skill body content loads only when used. Its documentation also describes metadata fields that influence visibility and automatic invocation behavior, including controls such as whether a skill appears as a slash command and whether model invocation is disabled.
For tenant administrators and platform owners, the key is not the specific YAML field. The key is the operating policy:
Do not let every skill become autonomous just because autonomy feels modern.
Autonomy without routing discipline is how you get expensive randomness.
Lever 2: Structure: Keep the Main Thread Clean
A well-designed skill has two layers:
- The procedure: what the agent must do.
- The reference material: templates, examples, policies, glossaries, checklists, and supporting knowledge.
Most bad skills fail because they mix these together.
They become 400-line instruction dumps that try to cover every branch, exception, template, example, and philosophical preference in one file. That feels thorough. It is usually just expensive.
The better approach is progressive disclosure: load the smallest useful instruction first, then pull extra reference material only when the branch actually needs it. Anthropic describes this pattern directly in its Agent Skills documentation: skills can contain metadata, instructions, and optional resources, with information loaded in stages as needed rather than consuming context upfront.
The Restaurant Menu Analogy
A good skill is like a restaurant menu.
The main menu should show the dishes, not the full supplier contract, oven manual, kitchen rota, and allergen database.
If the customer orders the pasta, the kitchen can pull the pasta recipe. If they order dessert, the kitchen can pull the dessert recipe. But you do not put every recipe in front of every customer every time.
Skills should work the same way.
| Skill component | Should live in main skill file? | Why |
|---|---|---|
| Purpose | Yes | The agent needs to know what the skill is for |
| Invocation boundary | Yes | The agent needs to know when not to use it |
| Core steps | Yes | This is the operating procedure |
| One or two critical rules | Yes | High-signal constraints belong upfront |
| Long templates | No | Load only when needed |
| Detailed examples | Usually no | Useful as references, noisy as default context |
| Glossaries | Usually no | Keep behind a pointer unless required every time |
| Edge-case policy | No | Move to branch-specific references |
| Historical rationale | No | Archive it elsewhere unless it changes behavior |
Example: PRD Skill
A skill that creates a Product Requirements Document should not carry the entire PRD template, every example PRD, and every product philosophy note in the main instruction file.
The main skill should say:
- clarify the customer and business objective,
- identify decision gaps,
- confirm constraints,
- write the PRD using the approved template,
- and only then load the template reference.
That gives you the behavior without dragging the whole policy binder into every interaction.
A more concrete example from the original source is a /to-prd skill. Its procedural steps might be simple: find the relevant context, confirm the important test seams with the user, and write the PRD. The reference material should sit elsewhere: the definition of a test seam, the approved PRD template, example language, and any formatting rules.
The same pattern applies to branching workflows. A /domain-modeling skill may sometimes update a local glossary such as context.md, and other times create an Architectural Decision Record. Those branches should not force every run to load every glossary rule and every ADR template. The main skill should use a context pointer like:
That is the essence of good skill architecture: the core path stays clean, and the branch pays the context cost only when the branch is actually taken.
Governance Move: Classify Skill Content
Tenant administrators and platform owners should classify skill content into three buckets.
| Content class | Description | Governance action |
|---|---|---|
| Always-needed instructions | Required for every execution | Keep short and inside the main skill |
| Branch-specific references | Needed only for certain outputs | Put in separate files and reference conditionally |
| Rare or historical material | Useful occasionally, but not operationally critical | Archive, link externally, or remove |
This is where FinOps and architecture meet. Good structure reduces cost, improves reliability, and makes skills easier to audit.
Lever 3: Steering: Make the Agent Follow the Operating Model
If you have ever watched an agent ignore a clear instruction, you know the pain.
You wrote:
Ask clarifying questions before writing the plan.
The agent replied:
Great. Here is the complete implementation plan.
That is a steering problem.
The fix is not always more text. Often, the fix is better language.
Use Leading Words
A leading word is a compact phrase that carries a lot of operational meaning.
For software teams, “vertical slice” is a great example. Instead of writing five paragraphs explaining that the agent should build one end-to-end path through the system before expanding horizontally, use the phrase vertical slice repeatedly and deliberately.
Why does this work? Because strong domain language compresses intent. In agent environments that expose planning or reasoning summaries, you will often see the chosen vocabulary show up in the model’s planning language. That is the point: the phrase becomes a steering handle, not just a nice label.
For business and IT audiences, think of leading words as policy labels.
| Weak instruction | Stronger leading word or phrase | Why it works better |
|---|---|---|
| Do not build everything at once | Vertical slice | Encodes delivery sequence and scope control |
| Ask better questions first | Interrogate assumptions | Signals a stronger discovery behavior |
| Do not make weird architecture choices | Respect the domain model | Anchors output to known business concepts |
| Keep costs reasonable | Cost-aware execution | Frames cost as a design constraint |
| Do not overuse tools | Tool-minimal path | Gives the agent a routing preference |
| Avoid risky autonomous actions | Human approval gate | Creates a clear control point |
This is not just writing style. It is behavior design.
Force the Leg Work by Splitting the Skill
Agents often rush to the final deliverable because they are optimized to be helpful. Unfortunately, “helpful” can become “prematurely confident.”
If you ask one skill to perform discovery, challenge assumptions, design the architecture, write the plan, generate issues, and draft the rollout email, do not be surprised when it cuts corners.
Split the workflow.
| Phase | Skill behavior | Governance value |
|---|---|---|
| Discovery | Ask hard questions, identify gaps, clarify business objective | Reduces rework and bad assumptions |
| Planning | Produce the PRD, implementation plan, or architecture note | Creates a reviewable artifact |
| Execution | Generate code, configuration, or operational steps | Keeps action separate from planning |
| Review | Validate against standards, cost, risk, and policy | Adds control before rollout |
This is the same governance pattern we use in enterprise change management:
- assess,
- plan,
- implement,
- validate.
The AI version should not be different simply because the actor is a model.
Practical Example: Safe Rollout of an Agent Skill
Here is a simple rollout path for enterprise teams.
| Stage | What to do | Administrative lever |
|---|---|---|
| 1. Sandbox | Test the skill with synthetic or non-sensitive scenarios | Isolated workspace or pilot group |
| 2. Named pilot | Enable for a small group of expert users | User-invoked only |
| 3. Cost baseline | Measure average input tokens, output tokens, retries, and tool calls | FinOps dashboard or usage export |
| 4. Behavior review | Compare outputs against expected patterns | Human review checklist |
| 5. Limited production | Enable for more users, but keep autonomous invocation disabled | Controlled rollout group |
| 6. Autonomous consideration | Allow model invocation only if pattern is low-risk and high-confidence | Approval gate from platform owner |
| 7. Lifecycle review | Reassess after 30 to 60 days | Retire, prune, or promote |
The strategic point: do not move a skill from “useful” to “automatic” without evidence.
Lever 4: Pruning: The Deletion Test
Once a skill works, your next job is to make it smaller.
This feels counterintuitive. Most teams add more instructions every time something goes wrong. A bad output appears, someone adds another rule, and the markdown file grows like sediment at the bottom of a lake.
That is how skills become slow, expensive, and contradictory.
A mature skill governance program uses the deletion test.

If removing an instruction does not change the output quality, delete it.
Three Things to Prune
| Prune target | What it looks like | What to do |
|---|---|---|
| Repetition | The same rule appears in five skills | Move it to one shared reference or platform policy |
| Sediment | Old edge cases, stale wording, abandoned preferences | Remove or archive |
| No-ops | Instructions that sound good but do not alter behavior | Delete after testing |
A classic no-op is this kind of sentence:
Write a clear, detailed, high-quality response.
Or, in a developer workflow:
Write a detailed, descriptive commit message.
The agent was probably going to try that anyway. Delete the sentence, run the same task again, and compare the output. If the quality does not change, the instruction was not a control. It was token decoration.
If the instruction does not create a measurable behavior difference, it is not governance. It is decoration.
What to Measure
For FinOps and platform teams, pruning should not be subjective. Track the operational signals.
| Metric | Why it matters |
|---|---|
| Average input tokens per run | Shows whether skills are carrying too much context |
| Average output tokens per run | Reveals verbosity and runaway generation |
| Retry rate | Indicates unclear instructions or poor routing |
| Human correction rate | Shows whether the skill is useful in practice |
| Tool-call count | Helps identify over-automation |
| Skill invocation rate | Shows whether users or models actually use the skill |
| Stale skill count | Measures governance hygiene |
You do not need a perfect evaluation system on day one. But you do need a habit of asking:
Is this skill still earning its place in the platform?
A Decision Guide for IT Leaders and Tenant Administrators
Use this quick guide when reviewing a new AI agent skill.
| Question | If yes | If no |
|---|---|---|
| Does the skill support a clear business process? | Assign an owner and evaluate it | Do not onboard it yet |
| Could it affect customer data, security, architecture, or production systems? | Keep it user-invoked and approval-gated | Consider lighter controls |
| Is the task frequent and low-risk? | Consider model invocation after testing | Keep manual invocation |
| Is the main skill file short and procedural? | Good candidate for pilot | Refactor before rollout |
| Does it include long templates or examples inline? | Move references behind conditional pointers | Keep as is |
| Can you measure usage and cost? | Pilot with baselines | Add observability before scale |
| Does it have a retirement path? | Add to lifecycle review | Define one before approval |
The most important question is not “Can the agent use this?”
The better question is:
Should this behavior become part of our AI operating model?
The Governance Model: Treat Skills Like a Product Catalog
For enterprise adoption, I recommend managing skills as a catalog.
Each skill should have a simple record.
| Field | Example |
|---|---|
| Skill name | /to-prd |
| Business purpose | Generate a product requirements document from clarified requirements |
| Owner | Product platform team |
| Invocation mode | User-invoked by default |
| Autonomy level | Low, medium, or high |
| Data sensitivity | Public, internal, confidential, regulated |
| Cost profile | Low, medium, or high expected token/tool usage |
| Dependencies | Templates, glossary, ADR format |
| Review cycle | Every 60 days |
| Retirement criteria | Low usage, high correction rate, superseded by another skill |
This does not need to be heavy bureaucracy. A simple markdown catalog or internal wiki page is enough to start.
What matters is that skills get owners and lifecycle management.
Without that, your AI platform becomes an unmanaged collection of clever prompts.
Legacy vs. Modern Skill Architecture
| Legacy skill design | Modern governed skill design |
|---|---|
| Big instruction files | Small procedural files |
| Every skill is autonomous | Invocation mode is risk-based |
| Templates are embedded everywhere | Templates are referenced conditionally |
| No cost model | Directional token and tool-call baseline |
| No ownership | Named business or platform owner |
| No retirement | Review and pruning cycle |
| More instructions after every failure | Test, measure, then prune |
| Developer convenience is the only goal | Business value, governance, and reliability matter equally |
This is the transition organizations need to make.
Agent skills are not just developer toys. They are becoming part of the enterprise automation fabric.
Practical Next Step: Audit Your Existing Skills
You do not need to invent the evaluation framework from scratch.
The mattpocock/skills repository includes a writing-great-skills skill under the productivity skills folder. Use it as a structured audit lens for your own internal skill catalog. The point is not to copy every pattern blindly. The point is to ask better governance questions:
- Trigger: Is this skill user-invoked or model-invoked, and is that appropriate for its risk level?
- Structure: Is the main skill file mostly procedural, or is it carrying too much reference material?
- Steering: Does it use strong leading words that compress intent?
- Pruning: Which instructions are duplicated, stale, or no-ops?
- Ownership: Who approves changes to this skill?
- Cost profile: What is the average token and tool-call footprint per execution?
- Retirement rule: When should this skill be merged, archived, or deleted?
For enterprise teams, I would run this audit quarterly for shared platform skills and monthly for high-volume autonomous workflows.
Key Takeaways
- Skill hell is a governance failure, not a prompt-writing failure. Too many unmanaged skills create cost, confusion, and inconsistent behavior.
- Context is inventory. If you carry unnecessary instructions into every request, you pay for them repeatedly.
- Default to user-invoked skills for high-risk or high-cost workflows. Autonomy should be earned through evidence, not granted by enthusiasm.
- Keep the main skill file small. Put long templates, examples, and edge cases behind conditional references.
- Use leading words to steer behavior. Strong domain language often works better than long explanations.
- Split discovery, planning, execution, and review. Do not ask one skill to do the whole change-management lifecycle in one breath.
- Prune aggressively. If an instruction does not change behavior, it is token waste.
- Manage skills as a catalog. Ownership, lifecycle, risk classification, and cost visibility are what turn clever prompts into an enterprise capability.
Final Opinion: The Future Belongs to Small, Governed Skills
The winning enterprise AI platforms will not be the ones with the most skills.
They will be the ones with the clearest operating model.
The best skills are small. They are opinionated. They have boundaries. They do one job well. They load reference material only when needed. They are easy to audit, easy to retire, and easy to explain to a business owner.
That is how you escape skill hell.
Treat your AI instructions like code. Treat your skills like products. Treat your context window like a budget.
And above all: stop confusing “more automation” with “better governance.”
Sources and Validation Notes
The strategic guidance in this article was validated against the following public sources as of 2026-07-04:
- Matt Pocock’s
mattpocock/skillsrepository describes the skills as small, adaptable, composable workflows for real engineering rather than monolithic process frameworks: https://github.com/mattpocock/skills - The repository includes a
writing-great-skillsproductivity skill that can be used as a reference pattern for evaluating and improving skill design: https://github.com/mattpocock/skills/tree/main/skills/productivity/writing-great-skills - Anthropic’s Agent Skills documentation describes skills as modular, filesystem-based capabilities with instructions, metadata, and optional resources that can be loaded progressively: https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview
- Anthropic’s Claude Code skills documentation states that skills can be used automatically when relevant or directly invoked with
/skill-name, and that the skill body loads only when used: https://code.claude.com/docs/en/skills - Anthropic’s public pricing page was used only to validate the general pricing model of per-million-token input/output billing and caching concepts. The directional math in this article is intentionally illustrative and should not be treated as contractual pricing: https://platform.claude.com/docs/en/about-claude/pricing
Read next


