10
Why “negative vectors” can't delete data in FAISS – but weighted kernels can
The fix for machine unlearning in vector databases turns out to be conceptually simple, but it requires changing the semantics of retrieval.
Standard FAISS-style indices store vectors and compute:
argmax ⟨q, vᵢ⟩
If you insert -v, nothing happens. It’s just another point. The original vector is still maximally similar to itself and remains rank-1.
This isn’t a bug—it’s a consequence of selection-based retrieval.
If instead you store (vector, weight) pairs and evaluate: φ(q) = Σ wᵢ · K(q, vᵢ)
you get a different object entirely: a field, not a selection. Now inserting the same vector with w = −1 causes destructive interference. The contribution cancels. The attractor disappears.
Deletion becomes O(1) append-only (add the inverse), not a structural rebuild.
FAISS-style: Vec<Vec<f32>> → argmax (selection) Weighted form: Vec<(Vec<f32>, f32)> → Σ (field)
We validated this on 100k vectors: • FAISS: target stays rank-1 after “deletion” • Field-based model: exact cancellation (φ → 0), target unretrievable
The deeper point is that this isn’t a trick—it’s a semantic separation. • FAISS implements a selection operator over discrete points. • The weighted version implements a field operator where vectors act as kernels in a continuous potential. • Retrieval becomes gradient ascent to local maxima. • Deletion becomes destructive interference that removes attractors.
This shifts deletion from structural (modify index, rebuild, filter) to algebraic (append an inverse element). You get append-only logs, reversible unlearning, and auditable deletion records. The negative weight is the proof.
Implication: current vector DBs can’t guarantee GDPR/CCPA erasure without reconstruction. Field-based retrieval can—provably.
Paper with proofs: https://github.com/nikitph/bloomin/blob/master/negative-vect...
That makes sense, but how do you efficiently evaluate the Gaussian kernel based approach (“operator-based data structures (OBDS)”)? Presumably you want to do it in a way that keeps a dynamically updating data structure instead of computing a low rank approximation to the kernel etc? In my understanding the upside of the kNN based approaches are fast querying and ability to dynamically insert additional vectors..?