Open-source · Claude Code

The development pipeline that trades speed for rigor

STRUT is a spec-first, TDD-enforced orchestration layer for Claude Code. Instead of generating apps, it processes changes through a structured Read Truth → Process Change → Update Truth cycle, with human gates at the points where judgment matters.

Get started See the architecture
S
Source-anchored
T
Test-driven
R
Research-grounded
U
User-gated
T
Traceable
The idea

AI coding tools are fast. That's the problem.

The METR randomized controlled trial found experienced developers believed AI made them 24% faster, but were actually 19% slower. The bottleneck wasn't generation speed. It was intent clarity, spec quality, and review rigor.

STRUT is designed around that finding. Every change goes through classification, specification, test-first implementation, fail-fast review, and build verification, with human gates at the two points where judgment can't be automated: spec approval and PR review.

Architecture

Three phases, 19 components, zero trust by default

8 orchestrator skills, 10 worker agents, 1 bash script. All inter-agent communication passes through structured JSON file contracts in .pipeline/. Orchestrators read only status fields, never content.

Phase 1
Read Truth
  • Repo impact scan
  • Classification
  • Modifier assignment
Phase 2
Process Change
  • Derive intent
  • Spec write & review cycle
  • • Human gate: spec approval
  • Write tests → write code
  • Scope + criteria review
  • Build check
  • • Human gate: PR review
Phase 3
Update Truth
  • Knowledge capture
  • Decision log proposals
  • Friction surface reporting
Honest trade-offs

What STRUT is not

Not a code generator

v0, Bolt, and Lovable generate whole apps fast. STRUT assumes you already have a codebase and processes individual changes through a rigorous pipeline. Different tool, different job.

Not framework-specific

STRUT orchestrates your build, lint, typecheck, and test commands. It has no opinions about your language, framework, or test runner.

Not fast for simple changes

Adding a button doesn't need a 19-component pipeline. STRUT is for changes where getting it wrong costs more than getting it slow.

Not magic

The pipeline surfaces problems. It doesn't fix your architecture. You still need to make judgment calls at every gate. That's the point.

Classification

Ceremony scales to risk

Every change is classified by scan evidence into two independent binary modifiers. The pipeline adds components only when the risk warrants them.

trust

ON when auth, RLS, schema, encryption, or data immutability detected

Adds MUST NEVER constraints, negative criteria, security review by Opus, mandatory knowledge capture, and describe-flow traceability.

decompose

ON when change crosses 2+ architectural boundaries

Adds task breakdown (up to 5), per-task TDD loop, and a human gate after task 1 to verify the approach before remaining tasks proceed.

Four combinations: standard (both OFF), trust-only, decompose-only, guarded-decompose (both ON, which adds adversarial spec review).

Evidence base

Research-grounded, not vibes-grounded

Every architectural decision maps to tiered citations. The full rationale is in architectural-decisions.md and research-index.md in the repo.

R Peer-reviewed or methodologically transparent research
I Industry reports (stated findings, less transparent methodology)
D Our design judgment, stated explicitly
The speed paradox
R METR RCT (2025) Randomized controlled trial. 16 experienced developers, 246 real tasks. Developers believed AI made them 24% faster but were actually 19% slower. Root cause: prompting AI before clarifying intent.
R Faros AI (2025) Company-wide metrics. Individual developers completed 21% more tasks and merged 98% more PRs, but review time increased 91%. Net organizational improvement: 0%.
I Bain & Company (2025) Coding is 25-35% of idea-to-launch. Speeding up code generation does little if specification and review remain bottlenecked.
Instruction following degrades at scale
R Curse of Instructions (Harada et al., ICLR 2025) Compliance degrades exponentially: success_all = success_individualN. Grounds STRUT's criteria caps and context budgets.
R AGENTIF (Qi et al., NeurIPS 2025) 707 instructions from 50 real-world agentic applications. Both constraint count and instruction length independently degrade compliance. Tool and condition constraints are the hardest types.
R RECAST (ICLR 2026) Confirms the degradation pattern across models and constraint complexity levels, including the most recent generation.
R CooperBench (2025) Isolated single-task agents succeed ~50% of the time vs. 25% when two agents collaborate. Validates STRUT's one-agent-per-task isolation.
AI code quality
R CodeRabbit (2025) 470 PRs analyzed. AI code has 1.7× more issues per PR (10.83 vs 6.45), security vulnerabilities 1.5-2× more frequent, and senior engineers spend 3.6× longer reviewing it.
R GitClear (2024) 211M lines of code analyzed. 8× increase in duplicated code blocks 2020-2024. Copy-paste surpassed refactoring for the first time.
R MSR (2026) ~110,000 open-source PRs. Agent-generated code shows measurably more churn than human-authored code.
R Carnegie Mellon Cursor study (2025) 807 repositories. Code complexity increased 40% after AI tool adoption.
I Apiiro (2025) 322% more privilege escalation paths in AI-generated code vs. human-written. Why STRUT's trust modifier exists.
I Veracode (2025) 45% of AI-generated code introduces OWASP Top 10 vulnerabilities.
I Ox Security (2025) AI code is "highly functional but systematically lacking in architectural judgment."
Review is the bottleneck
I Cognition/Devin (2026) PR volume now exceeds review capacity, leading to rubber-stamp approvals. Why STRUT pre-filters with automated review before human gates.
I ACM Queue (2025) The developer role is shifting from writer to navigator/reviewer. Review is the work, not overhead.
TDD is more critical with AI
R DORA Report (2025) TDD is "more critical than ever" with AI-assisted development. AI amplifies existing practices, good or bad.
R Sol-Ver (Lin et al., 2025) Self-play between test generator and code generator yields 19.63% improvement. The verifier drives quality in the solver.
R AlphaCode (Li et al., Science 2022) Generate-and-filter paradigm. One million candidate programs per problem, 99% filtered through test execution. Tests are the selection mechanism.
R AlphaEvolve (Novikov et al., 2025) Continuous evolutionary optimization of codebases with automated test suites as the fitness function.
R UTBoost (ACL 2025) 40.9% of SWE-bench tasks were affected by test quality issues. Test quality is load-bearing for AI code evaluation.
Specification quality
I GitHub Agent Config Analysis (2026) Teams covering fewer than 4 specification areas saw AI output quality drop below human baseline.
D Own design judgment Where external evidence doesn't exist, we say so. Every [D]-tier decision in the architecture doc is owned as a judgment call, not dressed up as research.
Roadmap

Built for Claude Code today, designed to be AI-agnostic tomorrow

Every architectural decision in STRUT is tagged with its platform coupling. The patterns (spec-first, file contracts, modifier-based risk routing, TDD enforcement) are grounded in LLM research that applies to any model. The current implementation uses Claude Code's skill/agent system. That's a starting point, not a ceiling.

Done
Phase 0: Foundation
Read Truth pipeline built and tested. Classification system with two-modifier architecture. Universal constraints model. Full design documentation with research citations.
run-read-truth truth-classify truth-repo-impact-scan git-tool universal-constraints dual-model test runner
Done
Phase 1: Spec refinement pipeline
The spec cycle: derive intent from scan evidence, write a structured spec, review it for quality and testability, loop until it passes or hits the iteration cap.
spec-derive-intent spec-write spec-review run-spec-refinement
Done
Phases 2-5: Full delivery loop
Implementation core (TDD cycle), review chain, build verification, knowledge capture. End-to-end pipeline from change description to merged PR with traceable decisions.
impl-write-tests impl-write-code run-review-chain run-build-check update-capture run-strut (entry point)
Next
Additional modifiers
The two-modifier system (trust, decompose) is designed to be extensible. New modifiers can add pipeline steps for specific risk categories without touching the core path. Candidates emerge from observed friction in real pipeline runs.
performance modifier accessibility modifier migration modifier community-contributed modifiers
Future
AI-agnostic abstraction layer
Replace Claude Code-specific orchestrators with a generic workflow engine that reads the same registry files and dispatches to different LLM providers. Agent bodies (markdown task instructions) are already portable. The orchestration glue is the only Claude-specific layer.
multi-provider dispatch generic workflow engine model-agnostic frontmatter team scaling protocol evolution engine test wisdom directory
Get started

Three commands, your existing codebase

STRUT integrates into any project with a working build/test pipeline. Claude handles the mechanical setup; you provide the domain knowledge.

# Clone into your project
git clone https://github.com/bretbuilds/strut .claude

# Tell Claude your stack and let it wire up the integration
# "My stack is Next.js + Supabase. Do the Section A integration steps."

# Run the pipeline on a change
/run-strut "Add rate limiting to the auth endpoint"

# Or step through it: pause after every agent, inspect the output, continue or abort
/run-strut --step "Add rate limiting to the auth endpoint"

Step mode is useful for your first few pipeline runs. At each pause you see the completed agent, its output file, and what runs next. Type continue or abort at each step. The flag is per-invocation, so omitting it on a resume returns to normal flow.