Skip to content

Archives

Binary Quantization

  • Binary Quantization

    A readable explanation of the (relatively new) technique of Binary Quantization applied to LLM embeddings. It’s pretty amazing that this compression technique can work without destroying search recall and accuracy, but it seems it does!

    Using BQ will reduce your memory consumption and improve retrieval speeds by up to 40x […] Binary quantization (BQ) converts any vector embedding of floating point numbers into a vector of binary or boolean values. […] All [vector floating point] numbers greater than zero are marked as 1. If it’s zero or less, they become 0. The benefit of reducing the vector embeddings to binary values is that boolean operations are very fast and need significantly less CPU instructions. […] One of the reasons vector search still works with such a high compression rate is that these large vectors are over-parameterized for retrieval. This is because they are designed for ranking, clustering, and similar use cases, which typically need more information encoded in the vector.
    https://www.elastic.co/search-labs/blog/rabitq-explainer-101 is a good maths-heavy explanation of the Elastic implementation using RaBitQ. See also some results from HuggingFace, https://huggingface.co/blog/embedding-quantization .

    (tags: embedding llm ai algorithms data-structures compression quantization binary-quantization quantisation rabitq search recall vectors vector-search)