-
-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more choices to quantization tool. Post processing after sim_anneal(). (optimizer.py/ext_quant.cpp) #712
base: master
Are you sure you want to change the base?
Conversation
6.5bpw OG: -- sum(log(err)): -852.326775 -- max(err): 0.003952 calibration perplexity (quant): 8.0247 5.43bpw: v1: -- sum(log(err)): -833.710759 -- max(err): 0.005545 calibration perplexity (quant): 8.0294 v2: -- sum(log(err)): -865.617083 -- max(err): 0.006786 calibration perplexity (quant): DNF
-- sum(log(err)): -866.199803 -- max(err): 0.005706
-- sum(log(err)): -840.236110 -- max(err): 0.005603
-- sum(log(err)): -839.939039 -- max(err): 0.005954 +1: try to avoid <4 bpw layer(rng)
-- sum(log(err)): -840.398832 -- max(err): 0.006020 +1: try to avoid <4 bpw layer(rng)
-- sum(log(err)): -840.932717 -- max(err): 0.005954 +1: try to avoid <4 bpw layer(rng)
72B: -- sum(log(err)): -884.396164 -- max(err): 0.017426 vs OGB: -- sum(log(err)): -842.360744 -- max(err): 0.018692
72B.E: -- sum(log(err)): -880.320887 -- max(err): 0.019905 calibration perplexity (quant): 11.1816
72B.E: -- sum(log(err)): -878.606786 -- max(err): 0.018033
-- sum(log(err)): -877.778402 -- max(err): 0.017617
-- sum(log(err)): -878.246132 -- max(err): 0.017426
remove layer position
parameterize on a scale of higher min(bpw) or lower exp. error
I wouldn't really trust the perplexity test done during quantization since it uses a very small dataset, which ends up (the way the default calibration data is constructed and then truncated for the test) being about 64k tokens of Wikipedia articles. The inclusion of multilingual and random data in the calibration dataset does a lot to regularize the model and prevent errors that otherwise only show up on Chinese text, or on long contexts, or some other situation that the quantizer wouldn't directly optimize for, but this isn't reflected in that number. It's really more of just a sanity check than a benchmark. This regularization is also one of the reasons, at least as far as I can tell, that GPTQ models sometimes do better than EXL2 models of an equivalent bitrate, because they're calibrated to exactly the wikitext test set, with a low damping factor, which ends up becoming essentially a per-linear-layer finetuning pass on the test data. Sometimes it even scores better than the base model on WikiText-2, which of course is a red flag. So definitely be careful with perplexity as a metric. That said, the spreadsheet you provided shows an improvement on sum(log(err)) as well, which is what the optimizer is explicitly optimizing for. So that is actually very encouraging. To elaborate, the error metric is the relative Frobenius (Euclidian) norm of the difference between the hidden state after the original vs. the quantized layer, which is intended to be a measure of how much noise has been introduced by quantization. Now, the assumption is that this noise is multiplicative, and I'm not entirely sure about that, but it seems to be roughly correct based on my testing. So the objective of the optimizer becomes to minimize the product of expected errors (or the log sum for numerical stability reasons.) But even distilled down to that, it's still an NP-hard multiple-choice knapsack problem. For the large number of groups and options in say a 72B model, with very large integer costs and floating-point values, I don't think there exists an exact solution that could be run in any reasonable amount of time. So I settled on simulated annealing which reliably finds a "good" solution but probably never an optimal one. All that is to say, lower perplexity alone isn't a solid indication that you're finding better quantization strategies, but the lower values of sum(log(err)) definitely are, based on the existing assumptions. And the fact that it correlates with better perplexity validates those assumptions a little bit. I'm a little preoccupied exploring whole new quantization schemes for the next version, which may or may not use this kind of optimization. But just based on your numbers this is still worth including for the current format, I think. I'll have some time to look it over over the weekend. In the meantime, you could try the model_diff.py to map out the hidden state error layer by layer and test top-K agreement and KL divergence with the original model. python model_diff.py -ma /path/to/original_model -mb /path/to/quantized_model -ed /path/to/wikitext-test2.parquet |
The main idea behind this change is that I sometimes found same-sized GPTQ models performing better on certain tasks. I figured this might indicate some overfitting that error calculations wouldn't reveal when optimizing bit distribution across layers.
The new code mainly disencorages low bpw layers when target bpw is at a middle point of 2.0~8.0 like 4.0 to 6.0, the improvement on perplexity impact test result is quite large on some models:
But the result is not consistent(blackbox, surprise) on different damping parameters but it's almost consistently better than original sim_anneal(). So I included multiple modes at the beginning of sim_anneal() for selection. You can add a new parameter to optimize.py so users can try either until get best result.
Also I'm not sure though if lower perplexity on same model is always better, well...