Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more choices to quantization tool. Post processing after sim_anneal(). (optimizer.py/ext_quant.cpp) #712

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

Originalimoc
Copy link

The main idea behind this change is that I sometimes found same-sized GPTQ models performing better on certain tasks. I figured this might indicate some overfitting that error calculations wouldn't reveal when optimizing bit distribution across layers.

The new code mainly disencorages low bpw layers when target bpw is at a middle point of 2.0~8.0 like 4.0 to 6.0, the improvement on perplexity impact test result is quite large on some models:
image

But the result is not consistent(blackbox, surprise) on different damping parameters but it's almost consistently better than original sim_anneal(). So I included multiple modes at the beginning of sim_anneal() for selection. You can add a new parameter to optimize.py so users can try either until get best result.
Also I'm not sure though if lower perplexity on same model is always better, well...

6.5bpw OG:
 -- sum(log(err)): -852.326775
 -- max(err): 0.003952
calibration perplexity (quant): 8.0247
5.43bpw:
v1:
 -- sum(log(err)): -833.710759
 -- max(err): 0.005545
calibration perplexity (quant): 8.0294
v2:
 -- sum(log(err)): -865.617083
 -- max(err): 0.006786
calibration perplexity (quant): DNF
 -- sum(log(err)): -866.199803
 -- max(err): 0.005706
 -- sum(log(err)): -840.236110
 -- max(err): 0.005603
 -- sum(log(err)): -839.939039
 -- max(err): 0.005954
+1: try to avoid <4 bpw layer(rng)
 -- sum(log(err)): -840.398832
 -- max(err): 0.006020
+1: try to avoid <4 bpw layer(rng)
 -- sum(log(err)): -840.932717
 -- max(err): 0.005954
+1: try to avoid <4 bpw layer(rng)
72B:
 -- sum(log(err)): -884.396164
 -- max(err): 0.017426
vs OGB:
 -- sum(log(err)): -842.360744
 -- max(err): 0.018692
72B.E:
 -- sum(log(err)): -880.320887
 -- max(err): 0.019905
calibration perplexity (quant): 11.1816
72B.E:
 -- sum(log(err)): -878.606786
 -- max(err): 0.018033
 -- sum(log(err)): -877.778402
 -- max(err): 0.017617
 -- sum(log(err)): -878.246132
 -- max(err): 0.017426
remove layer position
parameterize on a scale of higher min(bpw) or lower exp. error
@turboderp
Copy link
Member

I wouldn't really trust the perplexity test done during quantization since it uses a very small dataset, which ends up (the way the default calibration data is constructed and then truncated for the test) being about 64k tokens of Wikipedia articles. The inclusion of multilingual and random data in the calibration dataset does a lot to regularize the model and prevent errors that otherwise only show up on Chinese text, or on long contexts, or some other situation that the quantizer wouldn't directly optimize for, but this isn't reflected in that number. It's really more of just a sanity check than a benchmark.

This regularization is also one of the reasons, at least as far as I can tell, that GPTQ models sometimes do better than EXL2 models of an equivalent bitrate, because they're calibrated to exactly the wikitext test set, with a low damping factor, which ends up becoming essentially a per-linear-layer finetuning pass on the test data. Sometimes it even scores better than the base model on WikiText-2, which of course is a red flag. So definitely be careful with perplexity as a metric.

That said, the spreadsheet you provided shows an improvement on sum(log(err)) as well, which is what the optimizer is explicitly optimizing for. So that is actually very encouraging.

To elaborate, the error metric is the relative Frobenius (Euclidian) norm of the difference between the hidden state after the original vs. the quantized layer, which is intended to be a measure of how much noise has been introduced by quantization. Now, the assumption is that this noise is multiplicative, and I'm not entirely sure about that, but it seems to be roughly correct based on my testing. So the objective of the optimizer becomes to minimize the product of expected errors (or the log sum for numerical stability reasons.)

But even distilled down to that, it's still an NP-hard multiple-choice knapsack problem. For the large number of groups and options in say a 72B model, with very large integer costs and floating-point values, I don't think there exists an exact solution that could be run in any reasonable amount of time. So I settled on simulated annealing which reliably finds a "good" solution but probably never an optimal one.

All that is to say, lower perplexity alone isn't a solid indication that you're finding better quantization strategies, but the lower values of sum(log(err)) definitely are, based on the existing assumptions. And the fact that it correlates with better perplexity validates those assumptions a little bit.

I'm a little preoccupied exploring whole new quantization schemes for the next version, which may or may not use this kind of optimization. But just based on your numbers this is still worth including for the current format, I think. I'll have some time to look it over over the weekend. In the meantime, you could try the model_diff.py to map out the hidden state error layer by layer and test top-K agreement and KL divergence with the original model.

python model_diff.py -ma /path/to/original_model -mb /path/to/quantized_model -ed /path/to/wikitext-test2.parquet

@Originalimoc
Copy link
Author

1/ the rate of error change vs bpw is higher on self_attn. I write an optimizer solely for minimizing sum(log(err)) it quickly decide to 8.0 all self_attn and 2.0 all mlp layers and indeed sum(log) is very low.
2/

This regularization is also one of the reasons, at least as far as I can tell, that GPTQ models sometimes do better than EXL2 models of an equivalent bitrate, because they're calibrated to exactly the wikitext test set, with a low damping factor, which ends up becoming essentially a per-linear-layer finetuning pass on the test data. Sometimes it even scores better than the base model on WikiText-2, which of course is a red flag. So definitely be careful with perplexity as a metric.

What I noticed is not perplexity of GPTQ model actually never measured one, I just used one with same size then it gives better results on some same questions consistently.
3/

Now, the assumption is that this noise is multiplicative, and I'm not entirely sure about that, but it seems to be roughly correct based on my testing.

Does this mean deeper layer models that can somewhat "correct" itself? And is this the reason the bits distribution always looks like this(bigger bpw towards end layers), and, but, SHOULD it? (ver 3-5-5 tried to reverse this bias and it's quite good on some models(see below xlsx), 3-5-6 removed that portion's code because I'm not sure if later layers really contains more information, or not.)
image

4/ More testing, weighted sort by difference of abs(perp diff) and sum(log(err)), green is my choice on version of that model:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants