supabrain

Context Allocation Across Time

The problem is not longer context. The problem is deciding what deserves context, when, and in what form.

Project overview

supabrain is a research project, not a product. It examines whether retrieval-augmented language-model agents need a per-memory utility signal that is distinct from the retrieval-similarity score the embedder already provides. The project ran across seven iterations between 2025 and 2026 — multi-resolution memory tiers, ELO-style hygiene, composite scoring, an abandoned real-corpus audit, query-conditional micro-tests, end-to-end evaluation on MemoryAgentBench, and finally a LoRA specialist trained on counterfactually ablated labels.

The project's primary contribution is methodological: counterfactual ablation as a per-memory utility signal that is non-circular with respect to cosine similarity by three structural arguments and is empirically separated from cosine across four large-scale runs. The signal makes it possible to ask "did this memory matter for this answer?" without the label-leakage failure modes that quietly trap most LLM-judge and rank-position baselines.

Phase 7 closed the build track (Pfad 1) under the project's pre-registered Rule C: all three triggers — MAB K=10 regression on 3/4 datasets, LoCoMo Multi-Hop F1 = 17.0% against A-MEM's published 45.85%, and a learning-pattern verdict of "MAB-K=5-locked surface-pattern detector" — fired independently. The broader hypothesis remains untested under operationalizations we did not run; the paper is the close-out of one operationalization, not a falsification of the thesis.

Paper

Counterfactual Ablation for Memory-Utility Evaluation:
A Pre-Registered Case Study in Specialist Re-ranking

Max Jürschik — Kill The Dragon GmbH, Vienna

Context allocation across time — not context length — is the central memory problem for retrieval-augmented language-model agents. The paper's methodological contribution is counterfactual ablation as a per-memory utility signal: remove each retrieved memory in turn and label it by the resulting change in answerer correctness. The construction is non-circular by three structural arguments, with Spearman correlations from −0.024 to +0.161 across four large-scale runs — three within-pipeline on MemoryAgentBench and LoCoMo, one substrate-independent on LoCoMo Multi-Hop whose confidence interval spans zero and which we treat as binding. We exercise the signal on one operationalization of the hypothesis that context-allocation requires per-memory utility distinct from cosine, and report a documented dissolution as the case study.

A 1.5B-parameter LoRA specialist trained on these labels produced point-estimate gains of +8/+7/+4/+5 substring-exact-match over vanilla retrieval at K = 5 on MemoryAgentBench. Five rigor layers tighten this result. Paired-bootstrap 95% confidence intervals leave two strictly significant cells. K-normalization to the published comparator depth leaves 1/4 datasets within ±2pp, on partial data. BM25 sparse retrieval beats the specialist by +13 to +22pp on three of four datasets, reframing the K = 5 gains as "less suboptimal than BGE cosine alone" rather than competitive. Cross-substrate transfer to LoCoMo Multi-Hop returns F1 17.0% against a published 45.85% (A-MEM, Xu et al. 2025), but a prompt-control shows the specialist contributes +13pp over vanilla cosine on the same prompt — the residual gap is pipeline-attributable, not total cross-substrate failure. Learning-pattern probes score memory-equals-query at 100% above zero and fail label discrimination on a held-out validation sample. What survives: counterfactual ablation as a non-circular outcome signal and the rigor-dissolution discipline with pre-registered ADRs anchored to public git history. The broader hypothesis remains untested under operationalizations we did not run.

Methodology in brief

Memory-system evaluation in the published literature relies primarily on per-question task correctness on a downstream benchmark, which is gameable through configuration choices, and on LLM-judge scores of per-memory usefulness, which on extractive substrates collapse to topic match — what cosine similarity already measures. The project history catalogues three distinct forms of this outcome-signal circularity: synthetic-construction (utility assigned as an algebraic function of the similarity score), LLM-judge (a frozen judge that re-encodes cosine in prompt-disguise), and rank-position-synthesis (Win/Draw/Loss labels built from the rank at which a memory was retrieved). A specialist trained on labels from any of the three recovers cosine rather than learning beyond it.

Counterfactual ablation is one response. For each retrieved memory in a top-K context, recompute the answer with that memory removed and label the memory by the resulting change in correctness. The construction is non-circular with respect to the embedder's similarity score by three independent structural arguments (no similarity score in the label; no LLM judge; rank-independence within the retrieved set), and is empirically separated from cosine on every substrate where the signal applies. Across four large-scale runs — three within-pipeline on MemoryAgentBench and LoCoMo (n = 5,910; n = 1,000; n = 1,665), one substrate-independent on LoCoMo Multi-Hop (n = 175, ρ = −0.024 with CI spanning zero) — the rank correlation between counterfactual utility and BGE cosine ranged from −0.024 to +0.161. Rank-variance unexplained by linear cosine correlation was at least 96% in every row.

Bar chart: vanilla BGE cosine, v6 specialist, and BM25 SEM scores at K=5 and K=10 across four MemoryAgentBench datasets. BM25 dominates the specialist on three of four cells at both retrieval depths.
Figure 1. BM25 sparse retrieval vs vanilla BGE cosine vs the Phase 6 specialist on MemoryAgentBench, K = 5 and K = 10.
Vertical cascade visualizing the five rigor layers: bootstrap CIs, K-normalization, BM25 baseline, cross-substrate transfer with prompt control, and learning-pattern probes. Each layer narrows the Phase 6 apparent +8/+7/+4/+5 advantage further. A bottom band describes what survives the cascade.
Figure 2. The rigor-dissolution cascade. The Phase 6 specialist's apparent K = 5 advantage narrows or disappears at each of the five rigor layers.

Iteration history

Code and data

The source repository at github.com/ktdmax/supabrain contains the full project history: planning documents, decision-rule ADRs with timestamps anchored to commits, all generated training data, the LoRA adapter weights from Phase 6, the four-rigor-layer evaluation outputs, the BM25 and prompt-control measurements added in revision, and the figure generation scripts. The pre-registration chain is auditable via git log progress/v{6,7}/decisions.md.

Bulk training data and Modal-CUDA evaluation outputs are available on request; reach out at the contact address below.

Contact

Max Jürschik
Kill The Dragon GmbH, Vienna
Email
max@killthedragon.com