Quantization In Depth

DRAFT

Jul 11, 2024

DRAFT: Not yet reviewed for correctness…

Learning discrete representations for non-discrete data is a fundamental strategy to a bunch of more recent modelling techniques.

For example, much of the latest works in audio modelling uses an auto-encoder with a quantization step as the bottleneck to create audio “tokens”. These tokens work just like the text tokens you use with a BPE Tokenizer for language models, and accordingly can be modelled just the same.

Most quantization strategies used today are actually pretty simple, but I find that often both their implementations and mathematical descriptions are deceivingly complex.

For now I’ll focus on how these tokenisers work in the forward pass, for inference, training them is a follow up.

## Vector Quantization

Vector Quantisation defines a “codebook”, which is just a mapping of `codebook_size` integer’s to a set of vectors. The `codebook_size` is just like the `vocab_size` of a language models tokenizer, and we can think of each “code” as a “token”. I’ll refer to them this way from here on.

```python

codebook = nn.Embedding(vocab_size, embedding_dim)

```

To turn a token into its vector representation, we just index into it.

```python

token = torch.tensor([5])

vector = codebook(token)

```

To encode a vector, we simply find the closest vector in our codebook. Most commonly this uses the Euclidean distance as the measure of “closeness”.

```python

distances = torch.cdist(vector, self.embed.weight, p=2)

token = distancesdist.argmin(dim=-1)

```

That’s all!

Usually you’ll find most papers use a `vocab_size` on the order of `1024-4096`. Larger codebooks are generally more expressive but need extra tricks to optimise.

As there are a discrete number of tokens to be represented, codebooks will often be measured in `bits` per code, for example a `vocab_size = 1024` is `== 2**10`, so 10 bits per code.

One common theme in many approaches to improving convergence is to use a very small `embedding_dim`, with linear layers to downscale and upscale around the bottleneck, e.g. `codebook_embedding_dim=32` even when the rest of the model operates with `embedding_dim=1024`. Intuitively, this lower dimensional embedding space might make gap in expressivity between the tokenised and vector representations smaller.

Another approach seen in the Descript paper[^1] is to normalise vectors and codebook embeddings, reducing the distance calculation down to only be cosine similarity.

```python

distances = -(F.normalize(vector, dim=1, p=2) @ F.normalize(self.codebook.weight, dim=1, p=2).T)

token = distancesdist.argmin(dim=-1)

```

## Residual Vector Quantization

To further the expressivity of the learnt tokens, “RVQ” employs a layered set of tokenizers.

```python

quantizers = [Quantizer(vocab_size, embedding_dim) for i in range(n_quantizers)]

```

The idea here is for each layer to attempt to make up for the deficits in the previous layers representation.

```python

vector = torch.randn(embedding_dim)

tokens = []

for quantizer in quantizers:

token = quantizer.encode(vector)

token_vector = quantizer.decode(token)

vector = vector - token_vector

return tokens

```

When we compute `token_vector`, the vector closest to our input vector, there will inherently be a small amount of loss, which the subsequent layers attempt to make up for.

```python

tokens = [5,13,66,3]

vector = torch.zeros(embedding_dim)

for token, quantizer in zip(tokens, quantizers):

vector += quantizer.decode(token)

```

## Group Vector Quantization

Another strategy, employed by Group Vector Quantization, is to break up the input vector into groups, and encode them each separately.

```python

quantizers = [Quantizer(vocab_size, embedding_dim // 2), Quantizer(vocab_size, embedding_dim // 2)]

vector_parts = vector.chunk(2) # vector[:embedding_dim // 2, embedding_dim // 2:]

tokens = []

for vector_part, quantizer in zip(vector_parts, quantizers):

tokens.append(quantizer.encode(vector_part))

```

This chunking strategy gives larger implicit number of codebooks increasing expressivity, *and* reduces the size of the each vector to be encoded, which improves stability as we discussed earlier.

## Product Quantization

Product Quantization also uses multiple codebooks. In essence, it takes the vocabulary of multiple smaller codebooks, and combinatorially multiplies them all together, creating one giant combined codebook.

Any splitting strategy could be used, like the grouping or residual quantization. Once combined, the shared codebook is of size `prod(vocab_sizes)`.

In practice this is generally done with very small codebooks, e.g. 4 tiny codebooks of size `8` yields a combined codebook of size `8 ** 4 = 4096`, more in line with the sizes of normal vector quantization. Optimising these tiny codebooks rather than a large combined one is potentially more stable.

While the above strategies which utilise multiple codebooks do *implicitly* have a codebook of size `vocab_size ** n_codebooks`, they’re not combined and are each modelled individually in downstream tasks, as it would be infeasible for a language model to exist with `1024 ** 8 == 2 ** 80` tokens, even if their effective bitrate is identical.

## Latent Free Quantization

LFQ takes a novel approach to the problem which while technically quite different from normal Vector Quantization, can easily be explained in the same terms. We’ll start by describing it in the terms we’ve already discussed as a theoretical framework.

LFQ operates over a vector of size `1` (a scalar), and has a vocab size of only two tokens.

```python

lfq_part = nn.Embedding(vocab_size=2, embedding_dim=1)

```

Like normal, when encoding we find the token closest to our input vector, and when decoding perform the inverse of this.

Unlike normal Vector Quantization, its codebooks vectors are always fixed rather than learnt.

```python

lfq_part.weight = torch.tensor([[-1.0], [1.0]])

```

We then employ Group Quantization from above, using say `n_codebooks = 16`. As each input vector is split to a single component, this means our combined input embedding size is now `embedding_dim == n_codebooks`.

```python

def lfq(embedding_dim: int = 16):

return [Quantizer(vocab_size=2, embedding_dim=1) for i in range(embedding_dim)]

```

Finally, we construct a combined codebook using Product Quantization, giving us `codebook_size = 2 ** n_codebooks`, for example `2 ** 16 = 65536`.

As you can see, LFQ is simply a specific arrangement of the above components, however to simplify its implementation we can consider it differently.

While you can still consider it as performing `cdist` on each component vector, because we have a fixed codebook of `[[-1.0], [1.0]]` applied only to single component vectors, the operation actually simplifies to simply:

```python

quantized = [-1.0 if component < 0.0 else 1.0 for component in vector]

```

## Finite State Quantisation

FSQ is a slight generalisation of LFQ.

Each codebook still operates only on a single component vector (i.e. a scalar).

Rather than its implicit codebooks having only two tokens, each codebook can have `n` tokens. Each of these tokens are equally spaced, for example `n=5` gives a codebook of `[-1, -0.5, 0.0, 0.5, 1.0]`.

Conceptually its distance function differs slightly too, in essence applying a `tanh` to the input vector before computing the distance.

It then splits a larger input vector up into components and distributes it to these individual quantifiers in a Group Vector Quantization like scheme, like LFQ, and similarly computes a combined codebook multiplicatively with Product Quantization.

In implementation, it scales out the input vector and codebook such that the full range of codebooks `[-1, -0.5, 0.0, 0.5, 1.0]` would map to integers `[0, 1, 2, 3, 4]`, then is able to call `vector.round` as it’s performant distance function.

[^1]: DAC PAPAER

Zela Labs

Discussion about this post