Stilta -- Intelligence Profile -- The Caliper Lab

Intelligence Profile

Stilta

Agentic AI platform for patent litigation workflows. Deploys coordinated agent swarms across 180 million patents, 250 million scientific publications, and archived web sources to surface prior art, detect infringement, and support freedom-to-operate analysis. Built for IP teams and law firms where accuracy carries legal consequence.

Patent Litigation AI Agentic Prior Art Search Infringement Analysis FTO Workflows YC W26 a16z-backed

Initial coverage

Q2 2026 — Run #1 · 312 tasks — CaliperIP-v1

Coverage note: This is an initial profile based on Run #1 of the CaliperIP-v1 task battery. Patent claim evaluation requires specialist legal ground truth construction; the current dataset covers prior art retrieval and claim chart generation. Infringement opinion and FTO judgment tasks are scheduled for Q3 2026 once domain expert review is complete.

Q2 2026

Q3 2026

Q4 2026

Capability Assessment Independent — Q2 2026

Stilta is the strongest product the Lab has benchmarked on high-volume prior art retrieval. The recall advantage over general-purpose models is real and measurable. The more important question for enterprise IP teams and investors is whether that recall edge holds as retrieval tasks become more interpretive and the agent reasoning layer is asked to do more than surface evidence.

Where the product leads

On prior art retrieval — the task the product is primarily designed for — Stilta's recall performance is materially above general-purpose frontier models. The Lab's CaliperIP-v1 battery corroborates the directional claim in Stilta's internal benchmarking: coordinated agent swarms across a purpose-built corpus outperform single-pass LLM retrieval on high-volume search tasks. The structural advantage is the corpus architecture — 180M patents plus scientific and web sources — combined with parallel agent reasoning that allows multiple invalidity theories to be pursued simultaneously.

Prior art retrieval recall of 81.4% on CaliperIP-v1, versus 63.2% for GPT-5.4 running equivalent single-pass retrieval — an 18.2-point gap consistent with the vendor's internal 3x recall claim in directional terms.
Claim chart generation scores 74.8% on citation precision — above the 61% category average for patent AI tools in the benchmark set.
Agent parallelism provides a measurable speed advantage: median task completion 4.1x faster than sequential LLM approaches on matched multi-theory invalidity inputs.

The interpretive layer question

Retrieval recall and interpretive legal judgment are distinct capabilities. Stilta's agent architecture is well-optimised for the first; the second is where the gap to general-purpose frontier models narrows significantly. On tasks requiring legal reasoning beyond evidence surfacing — assessing claim scope, evaluating obviousness arguments, drafting infringement opinions — performance drops toward category average. This reflects the state of the category, not a weakness unique to Stilta. The risk is that enterprise buyers conflate retrieval strength with end-to-end legal capability.

L1–L2 gap (retrieval and claim charting): Stilta leads the frontier baseline by 18.2 points. Durable structural advantage from corpus and agent architecture.
L3–L4 gap (obviousness reasoning, claim scope interpretation): gap narrows to 4 to 7 points. Frontier models close the distance as tasks become more reasoning-intensive.
L5 tasks (infringement opinion, FTO judgment): not yet benchmarked. Domain expert ground truth construction in progress for Q3 2026.

Decision implication

For IP teams with high-volume prior art workloads — particularly in litigation support and portfolio management — Stilta's retrieval performance represents a genuine capability advantage not replicable by pointing a general-purpose model at a patent database. For enterprise buyers evaluating end-to-end litigation support, the question is where in the workflow AI-generated output requires attorney review. The current benchmark suggests the retrieval and charting layer can be deployed with high confidence; the interpretive layer warrants a supervised deployment model until further data is available.

What the data does not yet cover

L5 infringement opinion and FTO judgment tasks are outside the current benchmark scope — the highest-stakes outputs in the product's claimed workflow and the most important capability test for enterprise legal teams.
Legal admissibility of AI-generated claim charts has not been independently validated. This depends on jurisdiction, filing context, and attorney supervision standards that vary by case.
Practitioner panel for IP attorneys is at n=9. Two additional cycles required before signal is statistically stable. Current figures are directional only.
Corpus coverage on non-US patent jurisdictions (EPO, JPO, CNIPA) has not been independently verified at the claimed scale.

Benchmark Scorecard vs. GPT-5.4 baseline — 312 tasks — CaliperIP-v1

Stilta

Frontier (GPT-5.4)

Prior art retrieval — recallL1

81.4vs63.2+18.2

Claim chart generation — citation precisionL2

74.8vs68.1+6.7

Invalidity theory developmentL3

71.2vs74.9−3.7

Claim scope interpretation and obviousnessL4

63.4vs70.1−6.7

Infringement opinion and FTO judgmentL5

--vs--pending

Vendor Claim Verification Source: stilta.com and public statements

"Roughly three times the recall of general-purpose models like ChatGPT, Claude, and Perplexity on invalidity tasks"

partial The directional claim is supported. CaliperIP-v1 shows Stilta's retrieval recall at 81.4% versus 63.2% for GPT-5.4 on equivalent prior art tasks — a ratio of approximately 1.3x on recall, not 3x. The 3x figure likely reflects a different metric (possibly volume of relevant references surfaced rather than precision-adjusted recall) or a different task definition than the Lab's benchmark. The recall advantage is real and material; the specific multiplier requires clarification of the underlying metric definition.

"Agents reason in parallel and converge the way a room full of specialists would, but at a scale no human team can match"

partial Parallel agent execution on multi-theory tasks is corroborated by the architecture and consistent with benchmark timing data. The "room full of specialists" framing is a product metaphor rather than a testable capability claim. What the benchmark does support: multi-theory invalidity tasks complete at 4.1x the speed of sequential LLM approaches on matched inputs. The specialist judgment analogy holds at the retrieval layer; it is less well supported at the interpretive reasoning layer where L3–L4 performance narrows toward frontier parity.

"Output has legal admissibility — report and claim charts with precise references to each piece of evidence"

not independently tested Citation precision in claim chart generation scores 74.8% in the benchmark — above category average and consistent with the precision framing. Legal admissibility, however, is a jurisdictional and procedural determination that cannot be established through capability benchmarking alone. It depends on the supervising attorney's review process, filing jurisdiction, and court standards. The Lab treats this as a commercial representation outside the scope of the current benchmark.

Frontier intelligence

Current frontier — GPT-5.4

70.1

Weighted avg — patent retrieval and reasoning tasks

Frontier velocity

+2.8 pts / qtr

Legal reasoning tasks — steady

Retrieval parity risk

Low — 6+ qtrs

Corpus architecture provides structural moat

The frontier's retrieval gap with Stilta is structural, not just capability-based — it reflects corpus access, not model intelligence. As frontier models gain native access to patent databases, this moat will compress. The interpretive reasoning layer is where frontier parity arrives soonest.

Practitioner signal n=9 — IP attorneys and patent agents (early)

Output acceptance rate

68% early data

Verify before use

81% early data

Workflow abandonment

6% early data

Trust trajectory

Early — building

Top correction type

Claim scope interpretation

High verification rate (81%) is consistent with legal professional norms — not distrust, but professional duty. Abandonment rate of 6% is low for a first-cycle panel on a new product category.

Score trajectory Stilta weighted avg — retrieval tasks

Higher bar = stronger performance vs. frontier

----Q2 26

--Prior qtrs

78.2Q2 2026

Methodology

Dataset

CaliperIP-v1 — 312 tasks

Baseline

GPT-5.4 (Apr 2026)

Scoring L1-L2

Recall F1 + citation precision

Scoring L3-L4

LLM-as-judge + patent attorney review

Ground truth

Expert-constructed — kappa 0.81

Run date

14 May 2026

Representative profile for discussion — all scores and findings are illustrative, based on the Lab's published methodology
applied to Stilta's publicly stated capabilities and available product documentation.
L5 infringement and FTO tasks are outside current benchmark scope; findings reflect retrieval and reasoning tasks only.
Full benchmark data will be published upon completion of the formal evaluation programme. — thecaliperlab.com