Project

Health Updates Briefing

A system for turning health research, news, and education into a clearer weekly update, and where I started taking evals and trust much more seriously.

Problem

There is no shortage of information in sensitive domains. The real problem is deciding what is worth including, what can be trusted, what needs more context, and what should be blocked from reaching a reader at all.

I wanted to build something that did more than summarize. I wanted it to retrieve source material, generate a digest grounded in those sources, and then evaluate the result before delivery.

Product Approach

The system pulls in current source material, uses that to generate digest content, and then runs multiple checks before anything is treated as acceptable. The core idea is simple: do not trust a model's first pass just because it sounds polished.

This public version focuses on the architecture rather than the original domain. The interesting part for me was the combination of retrieval, grounding, evaluation, cost-aware personalization, and delivery orchestration.

Quality Model

1

Retrieval

Fetch current source material first so the model is working from evidence instead of a blank prompt.

2

Grounded generation

Generate shared and cohort-specific digest content from the available sources and foundational knowledge.

3

Layered evals

Run automated checks, a rubric-based evaluator, and an audit trail so no single pass is responsible for all quality judgment.

4

Delivery gate

Only deliver when the quality bar is met. Otherwise block the run and make the failure inspectable.

Example eval artifacts

These are adapted from real eval outputs, with the original topic and source details stripped out.

Ship gate

PASS only if:
- automated checks pass
- overall rubric >= 3.0
- no dimension below 2
- no blocking audit findings

Rubric excerpt

{
  "accessibility": 4,
  "tone": 4,
  "biggest_risk":
    "Some sections may feel
     too clinical for a worried
     reader."
}

Audit excerpt

used_source:
  "Recent study on exercise
   and mobility"
faithful_to_source: true

omitted_source:
  "Early-stage molecular paper"
likely_reason:
  "Too technical for a
   general audience"

What I Learned

Source grounding changes more than just accuracy. It changes tone, confidence, and how much the system reaches for filler when it does not know enough. Evals are also easy to overestimate. A rubric that gives everything a 5 is not comforting; it is a sign that the evaluator itself needs calibration.

I also started to think more carefully about which guardrails belong in prompts and which should live in the product architecture. If a rule really matters, I increasingly prefer to enforce it structurally instead of hoping the model remembers.

Where It Could Go

The next step is a more fully genericized public version with synthetic data, a safer demo topic, and visible evaluation artifacts. I would especially like to make the golden cases, rubric calibration, and refusal logic easier to inspect from the outside, because that is where a lot of the product judgment really lives.