Capability Assessment Independent — Q2 2026
Stilta is the strongest product the Lab has benchmarked on high-volume prior art retrieval. The recall advantage over general-purpose models is real and measurable. The more important question for enterprise IP teams and investors is whether that recall edge holds as retrieval tasks become more interpretive and the agent reasoning layer is asked to do more than surface evidence.
1
Where the product leads
On prior art retrieval — the task the product is primarily designed for — Stilta's recall performance is materially above general-purpose frontier models. The Lab's CaliperIP-v1 battery corroborates the directional claim in Stilta's internal benchmarking: coordinated agent swarms across a purpose-built corpus outperform single-pass LLM retrieval on high-volume search tasks. The structural advantage is the corpus architecture — 180M patents plus scientific and web sources — combined with parallel agent reasoning that allows multiple invalidity theories to be pursued simultaneously.
- Prior art retrieval recall of 81.4% on CaliperIP-v1, versus 63.2% for GPT-5.4 running equivalent single-pass retrieval — an 18.2-point gap consistent with the vendor's internal 3x recall claim in directional terms.
- Claim chart generation scores 74.8% on citation precision — above the 61% category average for patent AI tools in the benchmark set.
- Agent parallelism provides a measurable speed advantage: median task completion 4.1x faster than sequential LLM approaches on matched multi-theory invalidity inputs.
2
The interpretive layer question
Retrieval recall and interpretive legal judgment are distinct capabilities. Stilta's agent architecture is well-optimised for the first; the second is where the gap to general-purpose frontier models narrows significantly. On tasks requiring legal reasoning beyond evidence surfacing — assessing claim scope, evaluating obviousness arguments, drafting infringement opinions — performance drops toward category average. This reflects the state of the category, not a weakness unique to Stilta. The risk is that enterprise buyers conflate retrieval strength with end-to-end legal capability.
- L1–L2 gap (retrieval and claim charting): Stilta leads the frontier baseline by 18.2 points. Durable structural advantage from corpus and agent architecture.
- L3–L4 gap (obviousness reasoning, claim scope interpretation): gap narrows to 4 to 7 points. Frontier models close the distance as tasks become more reasoning-intensive.
- L5 tasks (infringement opinion, FTO judgment): not yet benchmarked. Domain expert ground truth construction in progress for Q3 2026.
3
Decision implication
For IP teams with high-volume prior art workloads — particularly in litigation support and portfolio management — Stilta's retrieval performance represents a genuine capability advantage not replicable by pointing a general-purpose model at a patent database. For enterprise buyers evaluating end-to-end litigation support, the question is where in the workflow AI-generated output requires attorney review. The current benchmark suggests the retrieval and charting layer can be deployed with high confidence; the interpretive layer warrants a supervised deployment model until further data is available.
4
What the data does not yet cover
- L5 infringement opinion and FTO judgment tasks are outside the current benchmark scope — the highest-stakes outputs in the product's claimed workflow and the most important capability test for enterprise legal teams.
- Legal admissibility of AI-generated claim charts has not been independently validated. This depends on jurisdiction, filing context, and attorney supervision standards that vary by case.
- Practitioner panel for IP attorneys is at n=9. Two additional cycles required before signal is statistically stable. Current figures are directional only.
- Corpus coverage on non-US patent jurisdictions (EPO, JPO, CNIPA) has not been independently verified at the claimed scale.
Benchmark Scorecard vs. GPT-5.4 baseline — 312 tasks — CaliperIP-v1
Stilta
Frontier (GPT-5.4)
Prior art retrieval — recallL1
81.4vs63.2+18.2
Claim chart generation — citation precisionL2
74.8vs68.1+6.7
Invalidity theory developmentL3
71.2vs74.9−3.7
Claim scope interpretation and obviousnessL4
63.4vs70.1−6.7
Infringement opinion and FTO judgmentL5
--vs--pending
Vendor Claim Verification Source: stilta.com and public statements
"Roughly three times the recall of general-purpose models like ChatGPT, Claude, and Perplexity on invalidity tasks"
partial
The directional claim is supported. CaliperIP-v1 shows Stilta's retrieval recall at 81.4% versus 63.2% for GPT-5.4 on equivalent prior art tasks — a ratio of approximately 1.3x on recall, not 3x. The 3x figure likely reflects a different metric (possibly volume of relevant references surfaced rather than precision-adjusted recall) or a different task definition than the Lab's benchmark. The recall advantage is real and material; the specific multiplier requires clarification of the underlying metric definition.
"Agents reason in parallel and converge the way a room full of specialists would, but at a scale no human team can match"
partial
Parallel agent execution on multi-theory tasks is corroborated by the architecture and consistent with benchmark timing data. The "room full of specialists" framing is a product metaphor rather than a testable capability claim. What the benchmark does support: multi-theory invalidity tasks complete at 4.1x the speed of sequential LLM approaches on matched inputs. The specialist judgment analogy holds at the retrieval layer; it is less well supported at the interpretive reasoning layer where L3–L4 performance narrows toward frontier parity.
"Output has legal admissibility — report and claim charts with precise references to each piece of evidence"
not independently tested
Citation precision in claim chart generation scores 74.8% in the benchmark — above category average and consistent with the precision framing. Legal admissibility, however, is a jurisdictional and procedural determination that cannot be established through capability benchmarking alone. It depends on the supervising attorney's review process, filing jurisdiction, and court standards. The Lab treats this as a commercial representation outside the scope of the current benchmark.
Frontier intelligence
Current frontier — GPT-5.4
70.1
Weighted avg — patent retrieval and reasoning tasks
Frontier velocity
+2.8 pts / qtr
Legal reasoning tasks — steady
Retrieval parity risk
Low — 6+ qtrs
Corpus architecture provides structural moat
The frontier's retrieval gap with Stilta is structural, not just capability-based — it reflects corpus access, not model intelligence. As frontier models gain native access to patent databases, this moat will compress. The interpretive reasoning layer is where frontier parity arrives soonest.
Practitioner signal n=9 — IP attorneys and patent agents (early)
Output acceptance rate
68% early data
Verify before use
81% early data
Workflow abandonment
6% early data
Trust trajectory
Early — building
Top correction type
Claim scope interpretation
High verification rate (81%) is consistent with legal professional norms — not distrust, but professional duty. Abandonment rate of 6% is low for a first-cycle panel on a new product category.
Score trajectory Stilta weighted avg — retrieval tasks
Higher bar = stronger performance vs. frontier
----Q2 26
--Prior qtrs
78.2Q2 2026
Methodology
Dataset
CaliperIP-v1 — 312 tasks
Baseline
GPT-5.4 (Apr 2026)
Scoring L1-L2
Recall F1 + citation precision
Scoring L3-L4
LLM-as-judge + patent attorney review
Ground truth
Expert-constructed — kappa 0.81
Run date
14 May 2026