Clarity 3.0 - Financial Intelligence

Building Production-Grade RAG in Clarity 3.0

The engineering decisions that moved accuracy, faithfulness, and latency in a financial-intelligence RAG system

10 min readPublished Dec 10, 2025

Approximately 5,000 publicly traded companies report quarterly earnings, generating 20,000 earnings calls annually—representing 20,000 hours of strategic commentary (833 days of continuous listening, or 2.28 years). Each call contains two distinct but interconnected data streams: structured financial results (revenue, margins, guidance) and unstructured narrative commentary (management insights, market positioning, risk factors, strategic priorities). Traditionally, investors manually sift through both quantitative metrics and qualitative discourse to build investment theses—a process that's time-intensive and doesn't scale.

This massive corpus of financial data and strategic commentary can be transformed into embeddings and semantically analyzed at scale. Clarity 3.0 demonstrates this approach: it's an enterprise-grade RAG application that has ingested 200+ MegaCap Tech earnings calls—both the structured financials and full transcript narratives—parsed, cleaned, and converted into vector embeddings stored in a Pinecone database. Users can query both dimensions of the data: "What were NVIDIA's data center revenues last quarter?" (quantitative) or "How is management describing competitive positioning in AI chips?" (qualitative narrative analysis).

The goal is simple: save time and leverage AI to write semantically robust queries that retrieve precise information from the vector database—whether that's specific financial metrics, thematic commentary patterns, or cross-company competitive analysis—in seconds rather than hours of manual review.

Key focuses for the app include: accuracy first—ensuring the right context is retrieved from 200+ earnings calls and correctly incorporated into generated responses—then optimize for speed. I built comprehensive evaluation systems to continuously monitor retrieval quality and response accuracy, catching hallucinations or misattributed data before they reach users. But latency matters too: enterprise users expect sub-3-second responses, not 30-second waits. The rest of this article covers the engineering decisions—and trade-offs—that balance these competing demands: maximizing retrieval precision while maintaining production-grade response times.

Current performance

Latest evaluation run (December 10, 2025)

Metric	Value
Relevance	82.1%
Faithfulness	89.5%
Accuracy	79.5%
Avg latency	8.8s

Baseline → current

Metric	Baseline	Current	Change
Relevance	77.3%	82.1%	+4.8pp
Faithfulness	77.0%	89.5%	+12.5pp
Accuracy	65.0%	79.5%	+14.5pp
Latency	19.5s	8.8s	55% faster

Note: The bigger win wasn’t “number go up.” It was learning which questions fail and why, and fixing the system bottlenecks upstream of generation.

Focus on “production-grade”

For Clarity, “production‑grade” didn’t mean scale. It meant the system behaves predictably under real usage, even when the query is ambiguous, the data is incomplete, or retrieval returns weak evidence. In practice, that translated to three requirements: quality is measurable, performance is measurable, and the answer is explainable (so you can verify it without trusting the model’s tone).

Measurable quality: relevance, faithfulness, and accuracy are scored against a fixed golden dataset. The key nuance is that faithfulness and accuracy fail differently: you can be grounded but wrong (retrieved the wrong quarter), or correct by accident (guessed right with no evidence).

Measurable performance: TTFT and total latency are tracked per run. I treat latency like a product metric because the UX changes completely once responses cross “feels instant” thresholds (e.g. ~2s, ~8s, ~15s).

Operational trust: the UI shows what was retrieved and which tools ran — so you can audit the evidence behind the answer. A good answer is one you can inspect: which quarter, which segment, which transcript chunk, which tool output.

The core realization is simple: RAG quality is mostly decided before generation. If retrieval is wrong, prompting can’t save you — the model can only synthesize what’s on the desk.

Example queries:

These are the kinds of prompts Clarity is designed for — each one maps to a specific retrieval path or failure mode.

Exact metric (structured JSON lane): “AAPL latest quarter revenue and gross margin”
Trend (multi-quarter structured): “NVDA data center revenue trend over the last 4 quarters”
Strategy narrative (transcripts): “How is Google monetizing AI? Focus on Search + Cloud.”
Executive commentary (hard mode): “What did AMD’s CEO say about AI demand in the latest call?”
Cross-company comparison (hard mode): “Compare MSFT vs GOOGL cloud growth and margins — latest quarter for each.”

Best practice: include a ticker and timeframe for numbers, and include a topic focus for strategy questions. If the data isn’t available, Clarity should say so rather than guessing.

The evaluation loop:

Before making “clever” retrieval changes, I built evaluation infrastructure: a golden dataset, repeatable runs, and strategy versioning so each change was testable and reversible. The goal was to stop debating improvements and start shipping the ones that moved specific metrics.

What the breakdown revealed

Unanswerable: mostly 80–100% accuracy (refuses to hallucinate)
Financial: 85–100% accuracy (when the metric exists)
Strategy: 75–100% accuracy (narrative questions)
Market/comparison: 30–60% accuracy (hardest category)
Executive/guidance: 20–60% accuracy (needs better chunking + attribution)

Why this mattered: without a fixed dataset and repeatable runs, “improvements” are just anecdotes. The eval dashboard became the source of truth that made it possible to actually iterate.

Baseline failure modes - why prompting didn't help:

In early versions, most failures traced back to evidence selection, not generation. When the model was shown the wrong type of context — or no context at all — it did what models do: it tried to be helpful. In finance, “helpful guessing” is the enemy.

Wrong content type: strategy questions retrieved financial context → the model “fills in” narrative → faithfulness collapses.

Coverage gaps: missing embeddings = no relevant context → plausible-sounding nonsense.

Exactness misses: dense embeddings blur Q3 vs Q2; “close” semantically is wrong factually.

Fiscal calendar traps: “latest quarter” differs across companies; comparisons silently misalign.

Engineered decisions:

I didn’t “prompt my way” to better results. The improvements came from fixing failure modes in the order they actually hurt users: stop making up numbers, stop retrieving the wrong kind of context, stop confusing quarters, then make it fast and debuggable.

1) Separate numbers from narrative

Early on, numeric questions behaved like a trap. If you asked “AAPL latest quarter gross margin,” the system might retrieve a transcript paragraph that talks about margins but doesn’t contain the number — and the model would guess anyway.

The fix was to split evidence into two lanes: structured financial JSON for metrics, and transcript chunks for narrative. Numbers come from deterministic tool output, not from text.

Result: financial questions became reliable (when the data exists) because the model is no longer asked to “extract” precise numbers from messy prose. The cost is maintaining two retrieval paths, but it’s worth it.

2) Add hybrid retrieval for exact terms

Dense retrieval is great for “what does this mean?” and surprisingly bad for “which quarter did you mean?” Dense embeddings happily treat Q2 and Q3 as “close,” which is fatal in finance.

Hybrid search (dense + sparse/BM25-style) fixes that by letting exact tokens matter again — tickers, product names, and period strings like “Q3 FY2025.” Example: “NVDA Blackwell demand” should still hit semantically relevant text, but “Q3 FY2025 gross margin” must match the exact period.

The trade-off is index migration and re-embedding work. The payoff is fewer “adjacent quarter” retrievals and fewer confident-but-wrong answers.

3) Teach the system what “latest” means (per company)

“Latest quarter” sounds simple until you compare companies with different fiscal calendars. Without fiscal-year intelligence, “Compare NVDA vs AMD latest quarter” can silently compare mismatched periods.

Clarity resolves “latest” per ticker using the most recent available quarter for that company and surfaces the risk when quarters don’t align. That turns a silent accuracy bug into an explicit, fixable behavior.

4) Harden structured data extraction (because “structured” isn’t)

Even the financial JSON isn’t perfectly consistent. Margin and EPS fields drift across sources and quarters, so a single hard-coded path produces false “not found” results.

The fix was explicit fallback chains for key metrics. It’s not glamorous, but it converts brittle failures into robust retrieval.

5) Treat latency as two separate problems

Real latency is reducing wasted work: fewer unnecessary tool loops, faster retrieval, and an LLM choice that fits interactive UX. Perceived latency is making the system legible while it works.

That’s why Clarity streams status (“analyzing / searching / generating”), tool start/results, and end-of-run metrics. People tolerate 10 seconds when they can see what’s happening — they abandon after 3 seconds of a blank screen.

6) Route queries by intent (instead of one-size-fits-all)

“What is Google monetizing AI?” and “AAPL Q3 FY2025 gross margin” should not use the same retrieval behavior. One is thematic; the other is exact.

The system routes by intent: precision/hybrid for exact terms, dense for narrative strategy, and deeper retrieval patterns for multi-part asks. It adds branching, but it avoids “wrong kind of context” failures.

Closing thought

The lesson isn’t “hybrid search wins” or “use model X.” It’s this: RAG is systems engineering.

Quality comes from coverage, retrieval, grounding rules, and evaluation discipline. Once those are strong, the model becomes what it should be: a synthesis engine — not a guesser.

Big Tech Earnings Intelligence