Problem
There is no shortage of information in sensitive domains. The real problem is deciding what is worth including, what can be trusted, what needs more context, and what should be blocked from reaching a reader at all.
I wanted to build something that did more than summarize. I wanted it to retrieve source material, generate a digest grounded in those sources, and then evaluate the result before delivery.
Product Approach
The system pulls in current source material, uses that to generate digest content, and then runs multiple checks before anything is treated as acceptable. The core idea is simple: do not trust a model's first pass just because it sounds polished.
This public version focuses on the architecture rather than the original domain. The interesting part for me was the combination of retrieval, grounding, evaluation, cost-aware personalization, and delivery orchestration.
Quality Model
1
Retrieval
Fetch current source material first so the model is working from evidence instead of a blank prompt.
2
Grounded generation
Generate shared and cohort-specific digest content from the available sources and foundational knowledge.
3
Layered evals
Run automated checks, a rubric-based evaluator, and an audit trail so no single pass is responsible for all quality judgment.
4
Delivery gate
Only deliver when the quality bar is met. Otherwise block the run and make the failure inspectable.
Example eval artifacts
These are adapted from real eval outputs, with the original topic and source details stripped out.
Ship gate
PASS only if:
- automated checks pass
- overall rubric >= 3.0
- no dimension below 2
- no blocking audit findings
Rubric excerpt
{
"accessibility": 4,
"tone": 4,
"biggest_risk":
"Some sections may feel
too clinical for a worried
reader."
}
Audit excerpt
used_source:
"Recent study on exercise
and mobility"
faithful_to_source: true
omitted_source:
"Early-stage molecular paper"
likely_reason:
"Too technical for a
general audience"
What I Learned
Source grounding changes more than just accuracy. It changes tone, confidence, and how much the system reaches for filler when it does not know enough. Evals are also easy to overestimate. A rubric that gives everything a 5 is not comforting; it is a sign that the evaluator itself needs calibration.
I also started to think more carefully about which guardrails belong in prompts and which should live in the product architecture. If a rule really matters, I increasingly prefer to enforce it structurally instead of hoping the model remembers.
Where It Could Go
The next step is a more fully genericized public version with synthetic data, a safer demo topic, and visible evaluation artifacts. I would especially like to make the golden cases, rubric calibration, and refusal logic easier to inspect from the outside, because that is where a lot of the product judgment really lives.