What Linc Architect does
Linc Architect is an AI agent purpose-built for enterprise process work. It ingests the inputs a senior consultant would receive at the start of a transformation engagement — stakeholder interview recordings, SOP documents, exported process maps, regulatory frameworks, system extracts — and produces three categories of structured output: a reconstructed current-state workflow hierarchy, a portfolio of ROI-quantified improvement opportunities, and a future-state redesign that honors stated operating constraints.
The benchmark below evaluates Architect against the strongest publicly-available frontier models from Anthropic, OpenAI, and Google on those same three task types. The question we set out to answer wasn't “can a frontier model do this?” — they all can, partially. The question was “is the output shippable to a leadership audience without rework?” That's the bar a senior consultant or process-excellence team holds the work to in production.
The problem nobody benchmarks
Most public LLM benchmarks measure what frontier labs care about — reasoning (GPQA), code (SWE-bench), math (AIME), exam knowledge (MMLU). Enterprise process-excellence work is a different shape: stakeholder transcripts + SOPs + PDFs + Excel → structured workflow + ROI-quantified opportunities + a future-state design that respects the customer's stack and constraints. The question buyers actually ask isn't “can a frontier model do this?” It's “is the output shippable without manual rework?”
Shippability is the dimension nobody benchmarks. So we did.
What we benchmarked
Three distinct task types that map directly to how customers use process-mining in pilots. Each receives different inputs and produces a different artifact:
Task 1
Workflow Replication
“Does the agent rebuild the workflow we already documented?”
5 cases · 102 gold steps
Task 2
Opportunity Discovery
“What should we change about this workflow?”
5 cases · ROI-quantified portfolios
Task 3
To-Be Process Design
“Synthesize a redesign that respects our stack and headcount budget.”
3 cases · audited golds
Inputs / outputs by task›
Workflow Replication
In: stakeholder interview transcripts, SOP documents, and optional PDFs / Excel extracts. Out: a structured workflow hierarchy where every step carries a name, description, inputs, outputs, dependency edges, stakeholder owner, and an SOP-vs-practice status tag (documented / partial / gap / inferred).
Opportunity Discovery
In: the same source materials plus any leadership-stated constraints (out-of-scope items, systems that must not be replaced). Out: an ROI-quantified portfolio of process improvements with dependencies, phasing, and stack compatibility.
To-Be Process Design
In: the current-state workflow hierarchy, a leadership brief listing improvement priorities and seven hard operating constraints, and the source materials. Out: a future-state workflow hierarchy where each step is tagged retained, modified, added, or deprecated, with a source-grounded rationale and a traceable link to a named improvement opportunity.
Audited gold standards
Ground truth in process-mining benchmarks is a known attack surface — buyers reasonably ask “did you shape the gold to flatter your system?” Our gold standards are derived from the real engagement material (see About the data) and audited — not generated from scratch by AI. The audit uses a 5-agent deliberative panel as a structured cross-check: each candidate gold standard is critiqued from multiple perspectives, with unresolved disagreements explicitly flagged for human review before the gold is accepted as ground truth.
- Round 1 — 4 independent drafts: Process Engineer (Claude Opus 4.7), Domain Expert (Claude Opus 4.7 + WebSearch on APQC PCF / ITIL / industry frameworks), Skeptic (Claude Sonnet 4.6), Researcher (Claude Sonnet 4.6 + WebFetch on cited references).
- Round 2 — Cross-critiques: each panelist critiques the other 3 drafts with severity tagging (blocker / important / nit). 12 critique docs.
- Round 3 — Synthesis: Claude Opus 4.7 reads all 4 drafts + 12 critiques + source material, produces the final gold + an audit log listing every resolved disagreement, every unresolved blocker, and the constraint-compliance walkthrough.
Every audit artifact — Round-1 candidate drafts, Round-2 cross-critiques, Round-3 synthesis log with unresolved blockers, the constraint-compliance walkthrough, and the human-reviewer summary — is preserved per case. During pilot engagement, customers can read the audit trail line by line and verify the gold reflects engagement reality rather than model self-consistency.
Where this maps in DMAIC
For readers fluent in Lean Six Sigma: the three task types align cleanly with the first four phases of DMAIC. Control — the fifth phase — is intentionally outside the benchmark scope; it's owned by the customer's continuous-improvement function and isn't something AI tooling should automate.
What is the workflow and its scope?
Replication — reconstruct the as-is hierarchy from interviews, SOPs, and documents.
What are the baseline metrics and SOP-vs-practice gaps?
Replication — status tagging (documented / partial / gap / inferred), dependency mapping, stakeholder ownership.
Root causes and waste categories?
Opportunity Discovery — ROI-quantified portfolio of improvements, each grounded in source evidence.
Countermeasures and future state?
To-Be Design — future-state workflow with retained / modified / added / deprecated tags and source-grounded rationale.
Sustainment, monitoring, governance?
Not in scope. Owned by the customer's continuous-improvement function — process governance, statistical process control, and audit-grade sustainment remain human-owned for good reason.
The benchmark covers what AI can credibly automate inside a DMAIC engagement: the work that traditionally consumes the first three to six weeks of a Black Belt project before the team gets to designing countermeasures. Control phase outputs — SPC charts, audit-grade documentation, governance routines — stay with the customer's CI function.
Result 1 — Workflow Replication
5 cases × 5 systems (Architect plus 4 frontier-model baselines from Anthropic, OpenAI, and Google) = 25 test cells. The shippable threshold: step recall ≥ 0.80 AND precision ≥ 0.90 AND dependency-graph F1 ≥ 0.50 — “would a senior reviewer accept this without rework?”
Across all 5 cases, Architect captured 92 of 102 gold steps (90% recall) with zero hallucinations. No baseline recovered more than 72 (GPT-5 and Claude Haiku 4.5); the strongest baseline by composite, Claude Opus 4.7, caught 71, and the weakest caught 62 (Gemini 3.1 Pro).
We also ran the three consumer chat apps (the raw paste-into-Claude.ai / ChatGPT / Gemini.app experience, no API scaffolding) on the same cases — four runs in total, since we tested Claude.ai with both Sonnet 4.6 and Opus 4.7. Claude.ai Sonnet 4.6 ships on 3/5 — the best of that cohort. ChatGPT GPT-5, Claude.ai Opus 4.7, and Gemini.app 3.1 Pro each ship on 0-1 of 5.
Result 2 — Opportunity Discovery
5 cases × 5 systems (Architect plus 4 frontier-model baselines) = 25 test cells. Valid yield = opportunities that pass every quality gate (groundedness, specificity, ROI plausibility, constraint compliance).
Architect produces 2.2× more valid opportunities than the strongest baseline (Claude Opus 4.7 at 40). Coverage of leadership-flagged improvement themes: Architect 1.00, Haiku 0.77, Gemini 0.49, Opus 0.44, GPT-5 0.24. Architect catches every leadership-flagged theme on every case.
Result 3 — To-Be Process Design
To-be design is the synthesis step in process transformation: take the workflow as it runs today, take the leadership team's stated improvement priorities and operating constraints, and produce a coherent future-state workflow. Each step in the redesign is tagged retained, modified, added, or deprecated, with a source-grounded rationale and a traceable link back to a named opportunity.
This is the artifact consulting firms charge six figures and multi-month engagements to produce by hand. It's also the artifact where AI tooling fails most visibly: a future-state that violates a stated constraint or proposes ungrounded changes destroys leadership trust on first read.
3 cases × 5 systems (Architect plus Claude Opus 4.7, Claude Haiku 4.5, GPT-5, and Gemini 3.1 Pro) = 15 test cells. Each case has a panel-deliberated future-state gold and a buyer brief with seven hard constraints (no platform replacements, no headcount additions, stay on the existing stack, Phase-1 deliverable inside 12 weeks, no external consulting, no third-party data exports, no catalog redesign or supplier consolidation). Scoring is six programmatic dimensions: change-set coverage, retained-step preservation, change-rationale grounding, opportunity traceability, brief engagement, and structural coherence.
Composite score (mean of 3 cases)
Architect leads the strongest frontier baseline by 32.0 composite points (0.906 vs Claude Opus 4.7 at 0.586). Gap to the median baseline is 42 points.
Per-dimension breakdown
| System | Composite | Change-set ★ | Retained | Grounding | Brief engagement † | Shippable |
|---|---|---|---|---|---|---|
| Linc Architect | 0.906 | 0.74 | 0.98 | 1.00 | 0.87 | 3 / 3 |
| Claude Opus 4.7 | 0.586 | 0.42 | 0.55 | 0.50 | 0.36 | 0 / 3 |
| Claude Haiku 4.5 | 0.485 | 0.32 | 0.69 | 0.34 | 0.20 | 0 / 3 |
| Gemini 3.1 Pro | 0.401 | 0.06 | 0.88 | 0.08 | 0.00 | 0 / 3 |
| GPT-5 | 0.349 | 0.00 | 0.74 | 0.02 | 0.00 | 0 / 3 |
★ Change-set coverage measures the fraction of gold non-retained changes the system captured with both correct change-type AND a source-grounded rationale. A change tag without a traceable rationale isn't useful in practice — the implementation team can't audit a rationale like “improve efficiency” or “follow industry standards” against the customer's actual environment. Systems that propose many changes without grounding (GPT-5, Gemini) score near zero even when they tag gold steps with the right change-type.
† Brief engagement measures the fraction of non-retained steps whose rationale explicitly cites a numbered leadership priority from the brief (e.g., “Brief priority #1”, “Leadership priority 3”). The metric separates systems that structurally engage with the brief's stated priorities from systems that propose changes without naming what they're meant to address. GPT-5 and Gemini score zero — they never reference brief priorities by name, even when their changes coincidentally address one.
Shippable means a redesign clears the strict bar — change-set coverage at least 0.60, retained preservation at least 0.80, and zero violations of any hard constraint stated in the brief (every system passes this check in the cohort, so the column is omitted from the table for clarity). Architect is the only system to clear the bar on every case in the cohort.
The trade-off pattern across the cohort. Single-pass baselines propose more changes per case (typically 13-25), and they do so by being aggressive about labeling supposedly-stable steps as modified. Architect proposes fewer changes (11-19 per case) but every change Architect proposes is traceable to a source quote or stated leadership priority. The result: a redesign with 11 grounded changes is materially more shippable than one with 20 changes where half are ungrounded best-practice fluff and 30-50% mistag supposedly-stable steps as modified.
Per-case results
Procurement P2P
gap +0.364
Engineering → Ops handoff
gap +0.204
HR onboarding
gap +0.392
Mean across 3 cases
gap +0.320
Claude Opus 4.7 is the strongest baseline on every case in the cohort and by mean composite. Architect's per-case lead ranges from +20 to +39 composite points.
Where the 32-point gap actually comes from
Grounding and brief engagement. Architect grounded every change to a specific source quote or stated improvement priority across all three cases — 100% of proposed changes carry a traceable rationale. The strongest baseline (Claude Opus 4.7) grounds half of its changes; GPT-5 grounds 2%. Brief engagement — whether the rationale explicitly cites a numbered leadership priority — shows the sharpest gap: Architect 87%, Opus 36%, Haiku 20%, and GPT-5 and Gemini both at 0%. The lower-tier and cross-vendor models never reference the brief's priorities by name even when their changes are tangentially aligned with one.
Change-set capture (grounding-gated). Of the gold non-retained changes per case, Architect captures 74% with both correct change-type and source-grounded rationale. Best baseline (Opus) captures 42%. GPT-5 and Gemini score effectively zero — they propose plenty of changes but almost none survive grounding gating. A change without a traceable rationale isn't a change the implementation team can act on.
Retained-step preservation. Across the cohort, baselines pointlessly refactor 12 to 45 percent of the steps the brief didn't ask to change (the strongest baseline, Claude Opus 4.7, sits at the high end at 45%). Every spurious modification is operational debt the implementation team pays in change-management and retraining time. Architect's “anchored evolution, not from-scratch” prompt structure corrects this; single-pass prompts don't.
Why this matters in practice
When leadership asks “give us a redesign of our procurement process that respects our stack and our headcount budget,” they're asking for to-be design — not a list of improvements and not a current-state map. Across three cases with realistic operating constraints, Linc Architect is the only system in our cohort that produces a result the process-excellence team can take to executive review without rework on grounding, brief engagement, and retained-step preservation simultaneously.
Explore the actual outputs
For readers who want to inspect the underlying artifacts: the buyer brief, the current-state hierarchy, and every system's full to-be output are below. Switch system tabs to compare what each model produced, filter by change type to slice retentions and modifications, and see exactly where each system grounded a change.