Add more choices to quantization tool. Post processing after sim_anneal(). (optimizer.py/ext_quant.cpp) #712

Originalimoc · 2025-01-10T08:55:27Z

The main idea behind this change is that I sometimes found same-sized GPTQ models performing better on certain tasks. I figured this might indicate some overfitting that error calculations wouldn't reveal when optimizing bit distribution across layers.

The new code mainly disencorages low bpw layers when target bpw is at a middle point of 2.0~8.0 like 4.0 to 6.0, the improvement on perplexity impact test result is quite large on some models:

But the result is not consistent(blackbox, surprise) on different damping parameters but it's almost consistently better than original sim_anneal(). So I included multiple modes at the beginning of sim_anneal() for selection. You can add a new parameter to optimize.py so users can try either until get best result.
Also I'm not sure though if lower perplexity on same model is always better, well...

6.5bpw OG: -- sum(log(err)): -852.326775 -- max(err): 0.003952 calibration perplexity (quant): 8.0247 5.43bpw: v1: -- sum(log(err)): -833.710759 -- max(err): 0.005545 calibration perplexity (quant): 8.0294 v2: -- sum(log(err)): -865.617083 -- max(err): 0.006786 calibration perplexity (quant): DNF

-- sum(log(err)): -866.199803 -- max(err): 0.005706

-- sum(log(err)): -840.236110 -- max(err): 0.005603

-- sum(log(err)): -839.939039 -- max(err): 0.005954 +1: try to avoid <4 bpw layer(rng)

-- sum(log(err)): -840.398832 -- max(err): 0.006020 +1: try to avoid <4 bpw layer(rng)

-- sum(log(err)): -840.932717 -- max(err): 0.005954 +1: try to avoid <4 bpw layer(rng)

72B: -- sum(log(err)): -884.396164 -- max(err): 0.017426 vs OGB: -- sum(log(err)): -842.360744 -- max(err): 0.018692

72B.E: -- sum(log(err)): -880.320887 -- max(err): 0.019905 calibration perplexity (quant): 11.1816

72B.E: -- sum(log(err)): -878.606786 -- max(err): 0.018033

-- sum(log(err)): -877.778402 -- max(err): 0.017617

-- sum(log(err)): -878.246132 -- max(err): 0.017426

remove layer position

parameterize on a scale of higher min(bpw) or lower exp. error

turboderp · 2025-01-10T12:09:22Z

I wouldn't really trust the perplexity test done during quantization since it uses a very small dataset, which ends up (the way the default calibration data is constructed and then truncated for the test) being about 64k tokens of Wikipedia articles. The inclusion of multilingual and random data in the calibration dataset does a lot to regularize the model and prevent errors that otherwise only show up on Chinese text, or on long contexts, or some other situation that the quantizer wouldn't directly optimize for, but this isn't reflected in that number. It's really more of just a sanity check than a benchmark.

This regularization is also one of the reasons, at least as far as I can tell, that GPTQ models sometimes do better than EXL2 models of an equivalent bitrate, because they're calibrated to exactly the wikitext test set, with a low damping factor, which ends up becoming essentially a per-linear-layer finetuning pass on the test data. Sometimes it even scores better than the base model on WikiText-2, which of course is a red flag. So definitely be careful with perplexity as a metric.

That said, the spreadsheet you provided shows an improvement on sum(log(err)) as well, which is what the optimizer is explicitly optimizing for. So that is actually very encouraging.

To elaborate, the error metric is the relative Frobenius (Euclidian) norm of the difference between the hidden state after the original vs. the quantized layer, which is intended to be a measure of how much noise has been introduced by quantization. Now, the assumption is that this noise is multiplicative, and I'm not entirely sure about that, but it seems to be roughly correct based on my testing. So the objective of the optimizer becomes to minimize the product of expected errors (or the log sum for numerical stability reasons.)

But even distilled down to that, it's still an NP-hard multiple-choice knapsack problem. For the large number of groups and options in say a 72B model, with very large integer costs and floating-point values, I don't think there exists an exact solution that could be run in any reasonable amount of time. So I settled on simulated annealing which reliably finds a "good" solution but probably never an optimal one.

All that is to say, lower perplexity alone isn't a solid indication that you're finding better quantization strategies, but the lower values of sum(log(err)) definitely are, based on the existing assumptions. And the fact that it correlates with better perplexity validates those assumptions a little bit.

I'm a little preoccupied exploring whole new quantization schemes for the next version, which may or may not use this kind of optimization. But just based on your numbers this is still worth including for the current format, I think. I'll have some time to look it over over the weekend. In the meantime, you could try the model_diff.py to map out the hidden state error layer by layer and test top-K agreement and KL divergence with the original model.

python model_diff.py -ma /path/to/original_model -mb /path/to/quantized_model -ed /path/to/wikitext-test2.parquet

Originalimoc · 2025-01-15T09:14:28Z

1/ the rate of error change vs bpw is higher on self_attn. I write an optimizer solely for minimizing sum(log(err)) it quickly decide to 8.0 all self_attn and 2.0 all mlp layers and indeed sum(log) is very low.
2/

This regularization is also one of the reasons, at least as far as I can tell, that GPTQ models sometimes do better than EXL2 models of an equivalent bitrate, because they're calibrated to exactly the wikitext test set, with a low damping factor, which ends up becoming essentially a per-linear-layer finetuning pass on the test data. Sometimes it even scores better than the base model on WikiText-2, which of course is a red flag. So definitely be careful with perplexity as a metric.

What I noticed is not perplexity of GPTQ model actually never measured one, I just used one with same size then it gives better results on some same questions consistently.
3/

Now, the assumption is that this noise is multiplicative, and I'm not entirely sure about that, but it seems to be roughly correct based on my testing.

Does this mean deeper layer models that can somewhat "correct" itself? And is this the reason the bits distribution always looks like this(bigger bpw towards end layers), and, but, SHOULD it? (ver 3-5-5 tried to reverse this bias and it's quite good on some models(see below xlsx), 3-5-6 removed that portion's code because I'm not sure if later layers really contains more information, or not.)

4/ More testing, weighted sort by difference of abs(perp diff) and sum(log(err)), green is my choice on version of that model:

Originalimoc added 16 commits January 10, 2025 12:06

Change malloc to calloc

d45c2b1

reverse VRAM scratching

a129ee9

improvement v1

782d26c

improvement v3-1

272c31d

-- sum(log(err)): -866.199803 -- max(err): 0.005706

improvement v3-2

117a60a

-- sum(log(err)): -840.236110 -- max(err): 0.005603

improvement v3-3

8483153

-- sum(log(err)): -839.939039 -- max(err): 0.005954 +1: try to avoid <4 bpw layer(rng)

improvement v3-4

21a4d9c

-- sum(log(err)): -840.398832 -- max(err): 0.006020 +1: try to avoid <4 bpw layer(rng)

improvement v3-5

8918b24

-- sum(log(err)): -840.932717 -- max(err): 0.005954 +1: try to avoid <4 bpw layer(rng)

improvement v3-5-1

2282621

72B: -- sum(log(err)): -884.396164 -- max(err): 0.017426 vs OGB: -- sum(log(err)): -842.360744 -- max(err): 0.018692

improvement v3-5-2

556d0e4

72B.E: -- sum(log(err)): -880.320887 -- max(err): 0.019905 calibration perplexity (quant): 11.1816

improvement v3-5-3

02cfba8

72B.E: -- sum(log(err)): -878.606786 -- max(err): 0.018033

improvement v3-5-4

a0f0e90

-- sum(log(err)): -877.778402 -- max(err): 0.017617

improvement v3-5-5

305f312

-- sum(log(err)): -878.246132 -- max(err): 0.017426

improvement v3-5-6

bb120fc

remove layer position

improvement v3-5-f

165c909

parameterize on a scale of higher min(bpw) or lower exp. error

update modes for v3-5-f

baaa786

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more choices to quantization tool. Post processing after sim_anneal(). (optimizer.py/ext_quant.cpp) #712

Add more choices to quantization tool. Post processing after sim_anneal(). (optimizer.py/ext_quant.cpp) #712

Originalimoc commented Jan 10, 2025

turboderp commented Jan 10, 2025

Originalimoc commented Jan 15, 2025

Add more choices to quantization tool. Post processing after sim_anneal(). (optimizer.py/ext_quant.cpp) #712

Are you sure you want to change the base?

Add more choices to quantization tool. Post processing after sim_anneal(). (optimizer.py/ext_quant.cpp) #712

Conversation

Originalimoc commented Jan 10, 2025

turboderp commented Jan 10, 2025

Originalimoc commented Jan 15, 2025