I built TurboQuant+ (https://github.com/TheTom/llama-cpp-turboquant), the llama.cpp implementation of this paper with extensions: asymmetric K/V compression, boundary layer protection, sparse V dequant, and this week weight compression (TQ4_1S) that shrinks models 28-42%% on disk with minimal quality loss. 5k+ stars, 50+ community testers across Metal, CUDA, and AMD HIP.
Cool to see the same WHT + Lloyd-Max math applied to vector search. The data-oblivious codebook property is exactly what makes it work for online KV cache compression too. No calibration, no training, just quantize and go.
I built a Python implementation of Google's TurboQuant paper (ICLR 2026) for vector search. The key thing that makes this different from PQ and other quantization methods: it's fully data-oblivious. The codebook is derived from math (not trained on your data), so you can add vectors online without ever rebuilding the index. Each vector encodes independently in ~4ms at d=1536.
The repo reproduces the benchmarks from Section 4.4 of the paper — recall@1@k on GloVe (d=200) and OpenAI embeddings (d=1536, d=3072). At 4-bit on d=1536, you get 0.967 recall@1@1 with 8x compression. At 2-bit, 0.862 recall@1@1 with ~16x compression.
Cool to see the same WHT + Lloyd-Max math applied to vector search. The data-oblivious codebook property is exactly what makes it work for online KV cache compression too. No calibration, no training, just quantize and go.
If anyone is running local LLMs and wants to try it: https://github.com/TheTom/turboquant_plus/blob/main/docs/get...
The repo reproduces the benchmarks from Section 4.4 of the paper — recall@1@k on GloVe (d=200) and OpenAI embeddings (d=1536, d=3072). At 4-bit on d=1536, you get 0.967 recall@1@1 with 8x compression. At 2-bit, 0.862 recall@1@1 with ~16x compression.
Paper: https://arxiv.org/abs/2504.19874