STRUT — Spec-first development pipeline for Claude Code

Honest trade-offs

What STRUT is not

Not a code generator

v0, Bolt, and Lovable generate whole apps fast. STRUT assumes you already have a codebase and processes individual changes through a rigorous pipeline. Different tool, different job.

Not framework-specific

STRUT orchestrates your build, lint, typecheck, and test commands. It has no opinions about your language, framework, or test runner.

Not fast for simple changes

Adding a button doesn't need a 19-component pipeline. STRUT is for changes where getting it wrong costs more than getting it slow.

Not magic

The pipeline surfaces problems. It doesn't fix your architecture. You still need to make judgment calls at every gate. That's the point.

Classification

Ceremony scales to risk

Every change is classified by scan evidence into two independent binary modifiers. The pipeline adds components only when the risk warrants them.

trust

ON when auth, RLS, schema, encryption, or data immutability detected

Adds MUST NEVER constraints, negative criteria, security review by Opus, mandatory knowledge capture, and describe-flow traceability.

decompose

ON when change crosses 2+ architectural boundaries

Adds task breakdown (up to 5), per-task TDD loop, and a human gate after task 1 to verify the approach before remaining tasks proceed.

Four combinations: standard (both OFF), trust-only, decompose-only, guarded-decompose (both ON, which adds adversarial spec review).

Evidence base

Research-grounded, not vibes-grounded

Every architectural decision maps to tiered citations. The full rationale is in architectural-decisions.md and research-index.md in the repo.

R Peer-reviewed or methodologically transparent research

I Industry reports (stated findings, less transparent methodology)

D Our design judgment, stated explicitly

The speed paradox

R METR RCT (2025) Randomized controlled trial. 16 experienced developers, 246 real tasks. Developers believed AI made them 24% faster but were actually 19% slower. Root cause: prompting AI before clarifying intent.

R Faros AI (2025) Company-wide metrics. Individual developers completed 21% more tasks and merged 98% more PRs, but review time increased 91%. Net organizational improvement: 0%.

I Bain & Company (2025) Coding is 25-35% of idea-to-launch. Speeding up code generation does little if specification and review remain bottlenecked.

Instruction following degrades at scale

R Curse of Instructions (Harada et al., ICLR 2025) Compliance degrades exponentially: success_all = success_individual^N. Grounds STRUT's criteria caps and context budgets.

R AGENTIF (Qi et al., NeurIPS 2025) 707 instructions from 50 real-world agentic applications. Both constraint count and instruction length independently degrade compliance. Tool and condition constraints are the hardest types.

R RECAST (ICLR 2026) Confirms the degradation pattern across models and constraint complexity levels, including the most recent generation.

R CooperBench (2025) Isolated single-task agents succeed ~50% of the time vs. 25% when two agents collaborate. Validates STRUT's one-agent-per-task isolation.

AI code quality

R CodeRabbit (2025) 470 PRs analyzed. AI code has 1.7× more issues per PR (10.83 vs 6.45), security vulnerabilities 1.5-2× more frequent, and senior engineers spend 3.6× longer reviewing it.

R GitClear (2024) 211M lines of code analyzed. 8× increase in duplicated code blocks 2020-2024. Copy-paste surpassed refactoring for the first time.

R MSR (2026) ~110,000 open-source PRs. Agent-generated code shows measurably more churn than human-authored code.

R Carnegie Mellon Cursor study (2025) 807 repositories. Code complexity increased 40% after AI tool adoption.

I Apiiro (2025) 322% more privilege escalation paths in AI-generated code vs. human-written. Why STRUT's trust modifier exists.

I Veracode (2025) 45% of AI-generated code introduces OWASP Top 10 vulnerabilities.

I Ox Security (2025) AI code is "highly functional but systematically lacking in architectural judgment."

Review is the bottleneck

I Cognition/Devin (2026) PR volume now exceeds review capacity, leading to rubber-stamp approvals. Why STRUT pre-filters with automated review before human gates.

I ACM Queue (2025) The developer role is shifting from writer to navigator/reviewer. Review is the work, not overhead.

TDD is more critical with AI

R DORA Report (2025) TDD is "more critical than ever" with AI-assisted development. AI amplifies existing practices, good or bad.

R Sol-Ver (Lin et al., 2025) Self-play between test generator and code generator yields 19.63% improvement. The verifier drives quality in the solver.

R AlphaCode (Li et al., Science 2022) Generate-and-filter paradigm. One million candidate programs per problem, 99% filtered through test execution. Tests are the selection mechanism.

R AlphaEvolve (Novikov et al., 2025) Continuous evolutionary optimization of codebases with automated test suites as the fitness function.

R UTBoost (ACL 2025) 40.9% of SWE-bench tasks were affected by test quality issues. Test quality is load-bearing for AI code evaluation.

Specification quality

I GitHub Agent Config Analysis (2026) Teams covering fewer than 4 specification areas saw AI output quality drop below human baseline.

D Own design judgment Where external evidence doesn't exist, we say so. Every [D]-tier decision in the architecture doc is owned as a judgment call, not dressed up as research.

Roadmap

Built for Claude Code today, designed to be AI-agnostic tomorrow

Every architectural decision in STRUT is tagged with its platform coupling. The patterns (spec-first, file contracts, modifier-based risk routing, TDD enforcement) are grounded in LLM research that applies to any model. The current implementation uses Claude Code's skill/agent system. That's a starting point, not a ceiling.

Done

Phase 0: Foundation

Read Truth pipeline built and tested. Classification system with two-modifier architecture. Universal constraints model. Full design documentation with research citations.

run-read-truth truth-classify truth-repo-impact-scan git-tool universal-constraints dual-model test runner

Done

Phase 1: Spec refinement pipeline

The spec cycle: derive intent from scan evidence, write a structured spec, review it for quality and testability, loop until it passes or hits the iteration cap.

spec-derive-intent spec-write spec-review run-spec-refinement

Done

Phases 2-5: Full delivery loop

Implementation core (TDD cycle), review chain, build verification, knowledge capture. End-to-end pipeline from change description to merged PR with traceable decisions.

impl-write-tests impl-write-code run-review-chain run-build-check update-capture run-strut (entry point)

Additional modifiers

The two-modifier system (trust, decompose) is designed to be extensible. New modifiers can add pipeline steps for specific risk categories without touching the core path. Candidates emerge from observed friction in real pipeline runs.

performance modifier accessibility modifier migration modifier community-contributed modifiers

Future

AI-agnostic abstraction layer

Replace Claude Code-specific orchestrators with a generic workflow engine that reads the same registry files and dispatches to different LLM providers. Agent bodies (markdown task instructions) are already portable. The orchestration glue is the only Claude-specific layer.

multi-provider dispatch generic workflow engine model-agnostic frontmatter team scaling protocol evolution engine test wisdom directory

Get started

Three commands, your existing codebase

STRUT integrates into any project with a working build/test pipeline. Claude handles the mechanical setup; you provide the domain knowledge.

    # Clone into your project

    git clone https://github.com/bretbuilds/strut .claude

    # Tell Claude your stack and let it wire up the integration

    # "My stack is Next.js + Supabase. Do the Section A integration steps."

    # Run the pipeline on a change

    /run-strut "Add rate limiting to the auth endpoint"

    # Or step through it: pause after every agent, inspect the output, continue or abort

    /run-strut --step "Add rate limiting to the auth endpoint"

Step mode is useful for your first few pipeline runs. At each pause you see the completed agent, its output file, and what runs next. Type continue or abort at each step. The flag is per-invocation, so omitting it on a resume returns to normal flow.

The development pipeline that trades speed for rigor

AI coding tools are fast. That's the problem.

Three phases, 19 components, zero trust by default

What STRUT is not

Not a code generator

Not framework-specific

Not fast for simple changes

Not magic

Ceremony scales to risk

trust

decompose

Research-grounded, not vibes-grounded

Built for Claude Code today, designed to be AI-agnostic tomorrow

Three commands, your existing codebase