Et HELT vanvittigt (imponerende) eksempel på, hvad o1-modellen kan.

 Min ven Jim pegede mig hen på en ret nørdet ting, hvor de beskrev dette: 

"Approximate nearest neighbor search for vectors relies on indexes that are most often accessed from RAM. Therefore, storage is the factor limiting the size of the database that can be served from a machine. Lossy vector compression, i.e., embedding quantization, has been applied extensively to reduce the size of indexes. However, for inverted file and graph-based indices, auxiliary data such as vector ids and links (edges) can represent most of the storage cost. We introduce and evaluate lossless compression schemes for these cases. These approaches are based on asymmetric numeral systems or wavelet trees that exploit the fact that the ordering of ids is irrelevant within the data structures. In some settings, we are able to compress the vector ids by a factor 7, with no impact on accuracy or search runtime. On billion-scale datasets, this results in a reduction of 30% of the index size. Furthermore, we show that for some datasets, these methods can also compress the quantized vector codes losslessly, by exploiting sub-optimalities in the original quantization algorithm. The source code for our approach available at this https URL."

Jeg valgte så o1 som model, kopierede teksten ind og trykkede retur. Hold fast. Det her er langt og ret vildt:

Below is a high-level explanation and summary of the ideas presented in the passage. (Because you provided the text yourself, I’ll treat it as user-provided content and focus on discussing and clarifying it.)


What problem is being addressed?

  • Approximate Nearest Neighbor (ANN) search in high-dimensional spaces (e.g., for embedding-based search) often relies on specialized data structures such as inverted indexes or graph-based structures like HNSW graphs.

  • These data structures are typically stored in RAM to enable very fast query times. However, RAM is expensive and finite—thus storage usage becomes a critical bottleneck for large-scale ANN systems.

  • Lossy compression methods (like product quantization, scalar quantization, etc.) have received a lot of attention for compressing the embedding vectors themselves. But the paper/passage highlights that auxiliary data—such as storing the identifiers of vectors (often called “doc IDs,” “vector IDs,” or “label IDs”) or the edges in a graph-based structure—can consume more memory than the compressed vectors.

  • Reducing this auxiliary storage overhead without sacrificing search accuracy or speed is the main motivation.


Key idea: Lossless compression of “auxiliary” data

  1. Vector IDs in an inverted file or graph-based index:

    • If you think of an inverted list, it’s essentially a list of IDs of vectors that fall into a particular “bucket” or “cell.”
    • Likewise, in a graph-based index (like HNSW), each node is connected to a small set of other nodes, and those connections are stored as lists of IDs.
    • The ordering of these IDs generally doesn’t matter for correctness. What matters is which IDs appear, but not in what order.
  2. Compression strategies:

    • Asymmetric Numeral Systems (ANS): A family of entropy coding methods (similar in spirit to arithmetic coding or Huffman coding, but often faster in practice). ANS can compress sequences of symbols based on their probability distributions.
    • Wavelet trees: A compact data structure that, among other things, can encode permutations or subsets in a way that supports fast rank/select queries. Wavelet trees exploit patterns in sequences to compress them.

    The passage mentions these two approaches exploit the irrelevance of ordering to reduce redundancy. Essentially, if the index structure just needs to know “which IDs are in each bucket,” you can store them in a data structure that is more compact than a simple sorted or unsorted list.

  3. Compression results:

    • The passage states that vector IDs can sometimes be reduced by a factor of 7(!) with no accuracy or runtime penalty.
    • On billion-scale datasets, this yields a 30% index size reduction—a substantial saving given how large billion-scale indexes can be.

Secondary finding: Losslessly compressing quantized vector codes

  • The primary motivation was to compress the auxiliary data (IDs, links). However, the authors/passage observed that some existing quantization pipelines are not always optimal.
  • They found opportunities to “re-compress” the quantized representation itself (e.g., product quantization codes) without losing any information—i.e., completely lossless—by exploiting sub-optimalities in how the quantization or encoding was done originally.

Why this matters in practice

  1. Reduced memory footprint: With storage cut down by 30% (or more, depending on the data structure), the same machine can hold a larger index, effectively handling bigger datasets or leaving more RAM free for other processes.

  2. No compromise on accuracy or latency: Many compression tricks can hurt search accuracy or speed. This approach is purely lossless with respect to the structure (i.e., no additional recall drop) and, when done right, should not introduce overhead that slows down search.

  3. Easy integration: Because it’s lossless, these compression methods often fit neatly on top of existing ANN indices. The key challenge is to ensure the software still provides fast access to the neighbor lists or inverted lists during queries.


Where to find more details

  • The passage indicates that source code is available at a specified URL (redacted here as “this https URL”). If you have that link, you can likely explore:

    • Implementation details
    • Benchmarks
    • Possibly instructions on how to integrate the code into popular ANN frameworks.
  • If you are already using libraries like Faiss, Annoy, or HNSW:

    • Some of these compression ideas might be integrated into future releases (Faiss frequently experiments with new compression techniques).
    • Or you might adapt the provided reference implementation in a custom pipeline.

In summary

  • Goal: Shrink the memory usage of large-scale ANN indices by compressing the IDs/links, in addition to the usual vector quantization.
  • Methods: Primarily revolve around asymmetric numeral systems (ANS) and wavelet trees, which exploit the fact that ordering of IDs is not crucial.
  • Results: Large compression gains (up to 7× for IDs, ~30% total index size on large datasets), with no loss in accuracy or slowdown in query time.
  • Bonus: Identified ways to losslessly compress even the quantized embeddings themselves, taking advantage of sub-optimal encodings in existing methods.

Overall, it’s a straightforward yet powerful idea: treat the list of IDs or edges as a candidate for advanced entropy-based compression. When a system doesn’t require a particular ordering, it opens the door to more flexible and efficient encodings, bringing down RAM usage considerably.

explain wavelet trees, please.

A wavelet tree is a clever data structure used to represent and compress sequences (often strings or arrays of integers) while still allowing fast queries about the sequence—particularly rank, select, and access queries. These operations are valuable in text indexing, suffix arrays, and various compressed data structures.

Below is an overview of what wavelet trees are, how they’re built, and why they’re useful.


1. The Key Ideas Behind Wavelet Trees

  1. Hierarchical decomposition of the alphabet:

    • Imagine you have a sequence SS of length nn with symbols drawn from an alphabet Σ\Sigma.
    • A wavelet tree recursively splits Σ\Sigma into (usually) two subsets at each node, turning your sequence into a binary structure.
    • At each node, you store a bitvector that indicates, for each position in the subsequence, whether the symbol belongs to the “left” subset or the “right” subset.
  2. Self-contained structure:

    • You can think of each level of the wavelet tree as further distinguishing symbols by splitting them into smaller subsets.
    • Because each node focuses on a binary partition (left subset vs. right subset), the wavelet tree can store where (bit positions) each symbol “went” without storing the raw symbols again.
    • The final leaves of the wavelet tree effectively hold only one symbol (or a small subset of symbols).
  3. Supporting fast queries:

    • From the bitvectors (and partial sums or rank data structures on these bitvectors), you can quickly figure out:
      • How many symbols up to position ii are equal to or less than a certain value (rank\mathrm{rank} queries),
      • Where the kk-th occurrence of a symbol is (select\mathrm{select}),
      • The value at a given position (random access).
    • The queries run in time proportional to O(logΣ)O(\log |\Sigma|) or O(logn)O(\log n) in many typical implementations.

2. Example: Building a Wavelet Tree (Step-by-Step)

Consider a simple sequence of characters:

S=“banana”.S = \text{“banana”}.

The alphabet here is {a,b,n}\{a, b, n\}, so Σ={a,b,n}\Sigma = \{a, b, n\}.

Step 1: Decide on a partition scheme

One simple approach is to sort the alphabet and split it in half. Suppose we order Σ\Sigma as {a,b,n}\{a, b, n\}. A common split might be:

  • Left side: {a,b}\{a, b\}
  • Right side: {n}\{n\}

(Alternatively, you might do a balanced partition by frequency or by dividing the alphabet into two roughly equal halves in sorted order. For the sake of example, we’ll just do {a,b}\{a,b\} vs. {n}\{n\}.)

Step 2: Build the root bitvector

  • We traverse the sequence SS character by character.
  • For each character, we output a 0 if it belongs to the left subset ({a,b}\{a,b\}) and 1 if it belongs to the right subset ({n}\{n\}).

Sequence SS = [b,a,n,a,n,a][b, a, n, a, n, a]

  • bb \to left subset {a,b}\{a,b\} 0\to 0
  • aa \to left subset {a,b}\{a,b\} 0\to 0
  • nn \to right subset {n}\{n\} 1\to 1
  • aa \to left subset {a,b}\{a,b\} 0\to 0
  • nn \to right subset {n}\{n\} 1\to 1
  • aa \to left subset {a,b}\{a,b\} 0\to 0

So the root bitvector is: [0,0,1,0,1,0][0, 0, 1, 0, 1, 0].

Step 3: Split the sequence into two subsequences

  • Left child gets the subsequence of symbols that produced 0 in the root bitvector. In the same order they appear in SS.
  • Right child gets the subsequence that produced 1.

So:

  • Left child’s subsequence = [b,a,a,a][b, a, a, a]
  • Right child’s subsequence = [n,n][n, n]

Now we build the wavelet tree recursively for each child.

Left child (symbols in {b,a}\{b,a\}):

We have the subsequence [b,a,a,a][b, a, a, a] with an alphabet {a,b}\{a, b\}. We can split it again. Suppose we define:

  • Left subset: {a}\{a\}
  • Right subset: {b}\{b\}

Bitvector at left child:

  • bb \to right subset 1\to 1
  • aa \to left subset 0\to 0
  • aa \to left subset 0\to 0
  • aa \to left subset 0\to 0

So the bitvector is [1,0,0,0][1, 0, 0, 0].

Then:

  • Left child’s child: symbols [a,a,a][a, a, a]
  • Right child’s child: symbols [b][b]

At this point, each of these “grandchildren” is dealing with only one symbol in its alphabet, so you typically store no further bitvector. They become leaves.

Right child (symbols in {n}\{n\}):

For the subsequence [n,n][n, n], the alphabet is a single symbol {n}\{n\}. No further partition is needed; this is a leaf.

Hence, we have a complete wavelet tree:

  1. The root node (split {a,b}\{a,b\} vs. {n}\{n\} with bitvector [0,0,1,0,1,0]).
  2. Left child (split {a}\{a\} vs. {b}\{b\} with bitvector [1,0,0,0]).
  3. The left child of that node is [a,a,a] (no further splitting).
  4. The right child of that node is [b].
  5. The right child of the root is [n,n].

3. Queries on the Wavelet Tree

Thanks to the bitvectors and rank/select data structures, you can answer queries quickly:

  1. Access(ii): What is the symbol at position ii in SS?

    • You start at the root.
    • Check whether the bit at position ii is 0 or 1.
    • If 0, you move to the left child, but also you convert ii to the rank of that 0 up to position ii in the root bitvector.
    • If 1, you go right, and convert ii to the rank of that 1 up to position ii.
    • Repeating this at each level leads you to the leaf that tells you the actual symbol.
  2. Rank(c,ic, i): How many symbols equal cc up to position ii in SS?

    • Again, you start at the root and see if cc belongs to the left or right subset.
    • If it’s left, you count how many 0s are up to position ii, and then move down to the left child, adjusting ii to be that rank.
    • If it’s right, you count how many 1s up to position ii, move to the right child, and so forth.
    • You continue until you reach the leaf that represents symbol cc. The final rank you get is the total number of occurrences of cc in the prefix of length ii.
  3. Select(c,kc, k): Where is the kk-th occurrence of cc in SS?

    • You locate the leaf corresponding to cc.
    • Then you “climb up” the wavelet tree (or you can do a top-down approach with rank queries) to figure out which position in the parent bitvector corresponds to the kk-th occurrence.
    • By reversing that process up the tree, you eventually find the position in the root bitvector, which is the position in SS.

4. Why Wavelet Trees Help in Compression

  • Compact representation: Storing the sequence as hierarchical bitvectors can be more efficient than storing the raw symbols (especially if your alphabet is large but has skewed frequencies or if you use entropy-compressed bitvectors).
  • No repeated storage of full symbols: At each node, you don’t store the symbols themselves; you store a bitvector of 0/1 splits. The actual sub-sequences are implied.
  • Fast rank/select: Once you have rank/select structures for each node’s bitvector, queries run in O(logΣ)\mathrm{O}(\log |\Sigma|) time (or even faster with some advanced implementations).

In practice, wavelet trees often use additional compression techniques on the bitvectors themselves (like run-length encoding, Huffman-shaped wavelet trees, or asymmetric numeral systems (ANS)) to exploit patterns and frequencies in the data.


5. Usage in ANN Index Compression

In an approximate nearest neighbor index, sometimes you just need to store sets of IDs or permutations of IDs. A wavelet tree can be used to store the IDs in a compressed form while still giving you the ability to quickly answer queries like:

  • “Which IDs appear in a certain partition?”
  • “What is the kk-th ID in this posting list?”
  • “How many IDs less than X appear up to position i?”

Wavelet trees can reduce memory usage significantly (compared to naive arrays or linked structures) and still provide near-random-access functionality, which is important for fast nearest-neighbor search lookups.


6. Further Reading

  • Original paper: Wavelet trees were introduced by Grossi, Gupta, and Vitter in the paper titled “High-order entropy-compressed text indexes” in 2003.
  • Textbook coverage: Many texts on compressed data structures (e.g., Gonzalo Navarro’s “Compact Data Structures”) have entire chapters on wavelet trees, giving detailed proofs and examples.
  • Practical implementations:
    • The Succinct Data Structure Library (SDSL) provides a clean implementation in C++.
    • There are also Python, Java, and other language-specific libraries that implement wavelet trees, typically under the umbrella of “succinct” or “compressed” data structures.

In Summary

A wavelet tree represents a sequence in a hierarchical bit-partition manner. It’s powerful because it:

  1. Compresses (often near entropy bound, with proper bitvector encoding),
  2. Provides fast queries (rank, select, access),
  3. Is relatively straightforward to construct with a recursive or divide-and-conquer approach,
  4. Is widely used in text indexing, bioinformatics, and more recently in ANN structures for memory-efficient representation of IDs or edges.

They’re one of the fundamental tools in the “succinct data structures” toolkit, bridging the gap between compactness and random-access query performance.

is this of relevance in AI, eg. in LLM's?

Short answer: Wavelet trees can be relevant in AI workflows, but they are not a typical component of standard large language model (LLM) pipelines. They are, however, very relevant for compressed data structures in fields like text indexing, bioinformatics, or large-scale approximate nearest neighbor (ANN) search, which can appear in certain AI-related applications—especially those involving large embedding databases or custom retrieval systems.

Below is a bit more detail on why wavelet trees aren’t part of most LLM training or inference pipelines, yet how they might show up in certain AI contexts.


1. Typical LLM Pipelines Don’t Use Wavelet Trees

  1. Tokenization: LLMs usually work with byte-pair encoding (BPE), SentencePiece, or WordPiece tokenizers. These tokenizers rely on frequency-based subword merges or probabilistic splitting—not wavelet trees.

    • There’s no direct reason to use wavelet trees in these tokenizers, because BPE merges subwords in a greedy fashion, and the final lexicon is stored as a simple vocabulary plus merges or a finite-state transducer, not as a wavelet tree.
  2. Model Architecture: Transformer-based LLMs don’t typically perform symbolic rank/select or substring queries internally. They learn continuous embeddings and handle sequence context with attention layers.

    • Wavelet trees are essentially a discrete data structure, useful for tasks where you need fast queries on symbol positions or counts.
  3. Training Data Storage: Large language models do read enormous text corpora for training. But that data is usually stored in standard binary file formats (e.g. text files, TFRecords, or other chunked formats).

    • While wavelet trees could compress the text in a way that still allows partial random access, most LLM training pipelines (for now) do not require complex substring rank/select queries. Typically, you just read large chunks of text sequentially or in random chunks, which is more straightforward with a standard compressed format (gzip, zstd, etc.).

2. Where Wavelet Trees Can Be Relevant in AI

  1. Large-Scale Approximate Nearest Neighbor (ANN) Search:

    • AI systems often rely on embedding-based retrieval for tasks like semantic search, recommendation, and retrieval-augmented generation.
    • Some ANN structures (e.g., HNSW, inverted file indexes) can store large sets of IDs or adjacency lists. In certain advanced setups, you could use a wavelet tree to compress the ID lists or adjacency pointers while retaining relatively fast access.
    • This is where wavelet trees (and other “succinct data structures”) can directly reduce memory usage, especially at massive scale (billions of vectors).
  2. Specialized Text Search & Indexing:

    • If your LLM application involves full-text search with complex queries—say, advanced substring queries or storing suffix arrays or suffix automata—then wavelet trees might come into play for specialized indexing.
    • For instance, wavelet trees power compressed suffix arrays (aka FM-indexes) used in advanced text search or bioinformatics. If your AI pipeline does deep text analytics or substring-based retrieval, wavelet trees can be part of that indexing layer.
  3. Compressed Knowledge Stores:

    • Some “knowledge representation” frameworks might store large knowledge graphs or knowledge bases in a compressed format. If queries involve enumerating subsets of edges or adjacency, wavelet-tree-like structures can help.

3. Potential for Future Use

As LLMs and AI systems grow to include bigger and bigger knowledge corpora or embedding-based retrieval systems, efficient memory usage becomes more critical. Tools like wavelet trees are part of a general family of “succinct data structures” that might appear in:

  • Retrieval-Augmented Generation (RAG) systems that have to store huge corpora of vector embeddings and quickly retrieve them.
  • Online/offline hybrid setups where you might keep some data in compressed form (e.g., on disk) yet still need partial or random access.

While they’re not mainstream in day-to-day LLM training or inference, you may see more wavelet-tree-inspired methods in advanced compression for AI infrastructure—especially as vector databases and ANN indexes become essential building blocks.


4. Takeaways

  • Wavelet trees: Great for compactly storing and querying large discrete sequences.
  • LLMs: Typically do not require discrete rank/select queries, so wavelet trees aren’t a built-in component.
  • Relevance to AI: Most likely to show up in embedding search or specialized indexing scenarios, rather than in the core LLM architecture or tokenizer pipelines.

If you’re building large-scale ANN systems or compressed text indexes to feed LLMs (or to augment them with retrieval capabilities), wavelet trees and other succinct data structures can be helpful. Otherwise, they’re a more specialized tool that won’t appear in most LLM workflows by default.

lav venligst tre nemme spørgsmål til mig, der kan checke, om jeg har forstået de vigtigste punkter om dette emne. Tak.

Her er tre enkle spørgsmål på dansk, som kan hjælpe med at teste din forståelse af wavelet trees og deres anvendelse:

  1. Hvad er det grundlæggende formål med et wavelet tree?
    (Tip: Tænk på, hvordan en sekvens gemmes og hvilke typer forespørgsler strukturen understøtter.)

  2. Hvilken slags information lagres i hvert knudepunkt (node) i et wavelet tree?
    (Tip: Overvej inddelingen i 0/1-bitvektorer og fordelingen af symboler.)

  3. Hvorfor kan wavelet trees være nyttige i et system, der håndterer meget store mængder data (for eksempel i en ANN-indeks)?
    (Tip: Tænk på ord som ’komprimering’ og ’hurtig adgang’.)

Held og lykke!

Kommentarer

Populære opslag fra denne blog

GPT'er, del 2: Lav din egen, komplet med forhåndsviden og indstilling af persona

Bedre prompts, del 2: Metaprompting: En nem måde at få meget bedre prompts på.

Så kører det, eller: Vi skal i gang :).