Hybrid Scoring

When you call memory.retrieve, the response includes both a score and a similarity. They’re different by design. similarity is raw cosine distance. score is the composite Mnemexa actually ranks by — a weighted blend of four signals.

The four factors

FactorWeightWhat it measures
Similarity0.55Cosine similarity between the query and the memory’s stored embedding.
Recency0.20How recently the memory was stored or accessed.
Importance0.15The LLM-assigned importance score from when the memory was written.
Frequency0.10How often the memory has been retrieved historically.
score = 0.55 × similarity
      + 0.20 × recency
      + 0.15 × importance
      + 0.10 × frequency

Each factor is normalized to [0, 1] before weighting, so score itself is also in [0, 1].

These weights are tuned and may shift over time as retrieval quality is measured at scale. The relative ordering (similarity dominant, frequency lightest) is stable; the exact values are subject to refinement.

Why each factor matters

Similarity (0.55) — the semantic anchor

Without similarity, you’d retrieve random memories regardless of query. It carries more than half the weight for that reason. But pure cosine is brittle — two memories can be cosine-close and one of them be three months stale, or low-importance, or never read. The other factors correct for that.

Recency (0.20) — defends against staleness

A memory written six months ago and never touched ranks below a memory written last week, even at the same cosine. This matters for AI agent memory because workflows and preferences change: “client prefers async standups” from January is less useful than “client switched to sync standups in March”.

The recency factor uses a smooth decay curve, not a hard cutoff. Old memories aren’t excluded — they’re just outranked by fresher ones at similar cosine.

Importance (0.15) — defends against noise

Low-importance memories (greetings, small acknowledgments, transient debugging context) get stored when they slip through the noise filter, but they shouldn’t outrank high-signal facts. The importance weight ensures that “client uses Postgres 15 with pgvector” (high-importance) wins over “client said thanks for the help” (low-importance) at similar cosine.

Frequency (0.10) — defends against orphans

A memory that gets retrieved frequently is — empirically — more useful than one that has sat untouched since storage. The frequency factor rewards proven-useful memories.

Frequency carries the lightest weight because it’s the most game-able and the most subject to feedback loops. Bumping it higher would let popular memories dominate at the expense of newer, relevant ones.

How score and similarity diverge

Some realistic patterns:

MemorySimilarityRecencyImportanceFrequencyFinal score
Stored 7 days ago, retrieved 12 times, importance 60.700.950.600.800.768
Stored 6 months ago, never retrieved, importance 30.800.300.300.000.554

The second memory has higher raw cosine but the first wins the ranking — exactly what you’d want for an agent that needs to reflect current reality, not the archive.

When the weights shift

The retrieval pipeline supports a recency_mode flag internally (boost / dampen / neutral) that adjusts the recency weight by ±0.08 depending on the query context. This isn’t exposed on the public API today — the default neutral mode applies. The flag exists so future product features (e.g. a “historical search” mode) can dial recency down without changing the response shape.

What you control

  • top_k — how many results to return.
  • min_score — a floor on the cosine similarity before re-ranking. Defaults to 0.35. Set higher if you only want strong matches.
  • The query text itself — phrasing matters. Specific queries get specific matches.

That’s the whole knob surface. Mnemexa intentionally keeps the ranking opaque-but-explainable: you get score and similarity back so you can see why a memory ranked where it did, but you don’t tune the weights yourself. The weights are tuned by the people who maintain the system, against the workloads it sees.