Reliability Open-source

A memory-correct reliability layer for an AI product team

Their assistant was storing users' questions as facts. A drop-in gate took memory precision from about 35% to 100% on their labeled test set.

app.crowork.ai/memory
Memory reliability gate dashboard showing speech-act classification
~35% ~35%→100%rarr; 100%
precision on labeled test set
Drop-in
no rebuild required
Public
tooling is open-source

The problem

An AI product team had a subtle, corrosive bug. Their assistant's memory was storing users' questions as if they were facts. Ask it "is the office open on Friday?" and it might later "remember" that the office is open on Friday — a fact it was never told.

Over time this poisoned answers with confident, invented information. It is one of the most damaging failure modes an AI product can have, because it erodes the one thing the product is supposed to provide: trust.

"It erodes the one thing the product is supposed to provide: trust."

Our approach

The fix was not a bigger model or a longer prompt — it was a gate at the moment of storage. Before anything is written to memory, classify what kind of statement it is. A question is not a fact. A hypothetical is not a fact. Only genuine assertions of fact should be stored.

And even then, one mention should not be treated as certainty — so we added a trust model where confidence builds with corroboration instead of being assumed on first contact.

How the gate works

Incoming message
Classify speech act
Genuine fact ✓
→ stored with trust score
Question / hypothetical
→ filtered, never stored

What we built

A drop-in reliability layer that sits in front of memory — the same layer we ship underneath every AI worker we build.

  • Classifies every incoming message before anything is stored
  • Stores only genuine facts — questions and hypotheticals are filtered out
  • Applies a trust model — a single mention is not treated as established truth
  • Drops into an existing assistant without rebuilding it
  • Validated against the team's own labeled test set

Open-source tooling

We open-source the reliability tooling behind this work — the approach is public, the engagement itself stays anonymous. The same layer ships underneath every AI worker we build.

The result

~35% ~35%→100%rarr; 100%
precision on labeled test set
Drop-in
no rebuild required
Public
tooling is open-source

On their labeled test set, precision went from roughly 35% to 100% — the layer stopped writing non-facts into memory. The assistant's answers stopped drifting toward confidently invented information.

We open-source the reliability tooling behind this work, so the approach is public; the engagement itself stays anonymous.

How we think about it

Reliability is the whole game for an AI product. A clever assistant that quietly corrupts its own memory is worse than a simple one that does not, because users learn they cannot trust it.

We would rather an AI say "I do not know" than confidently make something up — so we build the gate that makes that the default. This is the same layer we ship underneath every AI worker we build.

Related

Case study: Inbox-to-tasks assistant for a busy operator Case study: A custom AI employee for live operations monitoring All case studies

Is your AI product storing the right things?

A free audit will tell you whether your assistant's memory is reliable — and what a drop-in gate would actually cost.

Get a quote All case studies