handle partially quantized models #76

davidkoski · 2024-05-20T23:44:20Z

fix for LLMEval not loading Qwen1.5 -0.5B model in to memory #53 LLM evaluator setting any repetitionPenalty crashes the program #71 generate silently fails on OpenELM 3B after latest commit #69 Phi-3 mini stop token not recognized #74
in order to test the models
- I added a default prompt of an appropriate form
- while working on the model configuration also added additional stop tokens (Phi-3 mini stop token not recognized #74)
fixed the repetitionPenalty code (LLM evaluator setting any repetitionPenalty crashes the program #71)

Note: this isn't directly usable as it requires ml-explore/mlx-swift#73 (I was using the local checkout for development). This will need an update to the mlx version after 73 is merged.

- fix for #53 #71 #69 #74 - in order to test the models - I added a default prompt of an appropriate form - while working on the model configuration also added additional stop tokens (#74) - fixed the repetitionPenalty code (#71)

davidkoski · 2024-05-20T23:44:46Z

Applications/LLMEval/ContentView.swift

@@ -125,6 +125,8 @@ struct ContentView: View {

 }
 .task {
+ self.prompt = llm.modelConfiguration.defaultPrompt


Use the new default prompt (in case the model changes)

davidkoski · 2024-05-20T23:45:02Z

Libraries/LLM/Evaluate.swift

@@ -12,7 +12,7 @@ private func topPSampling(logits: MLXArray, topP: Float, temp: Float) -> MLXArra
 logits = logits.asType(.float32)
 }

- let probs = softMax(logits / temp, axis: -1)
+ let probs = softmax(logits / temp, axis: -1)


deprecation warning fix

davidkoski · 2024-05-20T23:45:56Z

Libraries/LLM/Evaluate.swift

 if repetitionContext.shape[0] > parameters.repetitionContextSize {
- repetitionContext = repetitionContext[1...]
+ repetitionContext = repetitionContext[(-parameters.repetitionContextSize)...]


Fix for #71 -- this uses a direct port of the python code. Previously it looks like there were workarounds for the lack of full numpy array indexing, see take() above.

davidkoski · 2024-05-20T23:46:09Z

Libraries/LLM/Evaluate.swift

+ }
+ .map {
+ $0.last!
+ })


Fix for #74

davidkoski · 2024-05-20T23:46:31Z

Libraries/LLM/LLMModel.swift

@@ -12,4 +12,15 @@ public protocol LLMModel: Module {
 func callAsFunction(_ inputs: MLXArray, cache: [(MLXArray, MLXArray)]?) -> (
 MLXArray, [(MLXArray, MLXArray)]
 )
+
+ /// Optionally preprocess the weights and modify / remove values as needed.
+ func sanitize(weights: [String: MLXArray]) -> [String: MLXArray]


To match the mlx_lm implementation

davidkoski · 2024-05-20T23:47:48Z

Libraries/LLM/Load.swift

- quantizeIfNeeded(model: model, weights: weights, quantization: quantization)
+ quantize(model: model, groupSize: quantization.groupSize, bits: quantization.bits) {
+ path, module in
+ weights["\(path).scales"] != nil


This is the fix for loading partially quantized models -- if they have "scales" then the layer is quantized. We don't need to test the type of layer as the quantize() method will only convert layers that can be converted.

Note that this depends on ml-explore/mlx-swift#73

davidkoski · 2024-05-20T23:47:55Z

Libraries/LLM/Load.swift

-
-private func quantizeIfNeeded(
- model: LLMModel, weights: [String: MLXArray], quantization: BaseConfiguration.Quantization
-) {


No longer needed.

Nice simplification!

davidkoski · 2024-05-20T23:48:40Z

Libraries/LLM/Models.swift


 public static let codeLlama13b4bit = ModelConfiguration(
 id: "mlx-community/CodeLlama-13b-Instruct-hf-4bit-MLX",
- overrideTokenizer: "PreTrainedTokenizer"
+ overrideTokenizer: "PreTrainedTokenizer",
+ defaultPrompt: "func sortArray(_ array: [Int]) -> String { <FILL_ME> }"


I verified each of these models and added a defaultPrompt so I don't have to hunt around each time

Beautiful, thanks!

davidkoski · 2024-05-20T23:49:10Z

Libraries/LLM/Models.swift

@@ -111,40 +130,53 @@ extension ModelConfiguration {
 "<PRE> " + prompt.replacingOccurrences(of: "<FILL_ME>", with: "<SUF>") + " <MID>"
 }

- public static let phi4bit = ModelConfiguration(id: "mlx-community/phi-2-hf-4bit-mlx") {
- prompt in
- "Instruct: \(prompt)\nOutput: "


This wasn't giving good results (any more?) and isn't used on the python side.

davidkoski · 2024-05-20T23:49:46Z

Libraries/LLM/Qwen2.swift

@@ -249,6 +271,8 @@ public struct Qwen2Configuration: Codable {
 Bool.self, forKey: Qwen2Configuration.CodingKeys.ropeTraditional) ?? false
 self.ropeScaling = try container.decodeIfPresent(
 [String: StringOrNumber].self, forKey: Qwen2Configuration.CodingKeys.ropeScaling)
+ self.tieWordEmbeddings =
+ try container.decodeIfPresent(Bool.self, forKey: .tieWordEmbeddings) ?? false


The qwen2 model implementation has changed slightly

davidkoski · 2024-05-21T06:19:09Z

note: the CI tests are failing because this depends on the mlx-swift change that needs to merge first

solume · 2024-05-21T14:55:20Z

tested it with mlx-swift fork mlx-swift fork, this breaks for me with Libraries/LLM/Load.swift:62:13 Cannot find 'quantize' in scope
and
Libraries/LLM/Qwen2.swift:198:37 Value of type 'Embedding' has no member 'asLinear'

davidkoski · 2024-05-21T14:58:12Z

tested it with mlx-swift fork mlx-swift fork, this breaks for me with Libraries/LLM/Load.swift:62:13 Cannot find 'quantize' in scope and Libraries/LLM/Qwen2.swift:198:37 Value of type 'Embedding' has no member 'asLinear'

That sounds like the hookup with that other branch didn't work. See:

https://github.com/ml-explore/mlx-swift/pull/73/files#diff-9edad027266f523cf845c47a29ad9d65622849922650d5333910c563987794e0R43

solume · 2024-05-25T19:20:59Z

was able to fix dependencies but getting runtime error:
libc++abi: terminating due to uncaught exception of type std::invalid_argument: [matmul] Last dimension of first input with shape (1,42,1280) must match second to last dimension of second input with shape (160,32000).

awni · 2024-05-28T20:54:14Z

Libraries/LLM/Qwen2.swift

+ if configuration.tieWordEmbeddings && weights["lm_head.weight"] == nil {
+ weights["lm_head.weight"] = nil
+ }


This looks kind of odd. If it's equal to nil, why set it to nil? Is that a Swift thing?

No, you are right that is odd.

Here is the original python:

if self.args.tie_word_embeddings: weights.pop("lm_head.weight", None)

I think I can just remove the && ...

awni · 2024-05-28T21:03:57Z

Libraries/LLM/Evaluate.swift

 /// - didGenerate: visitor for the tokens as they are generated
 public func generate(
 promptTokens: [Int], parameters: GenerateParameters, model: LLMModel, tokenizer: Tokenizer,
+ configuration: ModelConfiguration,


I'm not certain about the way this model configuration gets passed around and what the intention for it is (the naming is a bit ambiguous). Somehow we don't have a need for it in Python, so I'm wondering why do you need it here? What should go in it vs in the tokenizer/model directly?

From what I can tell it's more like the default arguments for a given model (eos token / prompt). The prompt gets handled outside this function. So here it's just for the additional eos tokens?

yeah, just for the additional eos tokens. I can switch that to just pass in additional eos tokens (optional). If we need more parameters in the future I can rethink that slightly.

awni

Thanks!! LGTM

handle partially quantized models

0b5cb5f

- fix for #53 #71 #69 #74 - in order to test the models - I added a default prompt of an appropriate form - while working on the model configuration also added additional stop tokens (#74) - fixed the repetitionPenalty code (#71)

davidkoski requested a review from awni May 20, 2024 23:44

davidkoski commented May 20, 2024

View reviewed changes

Libraries/LLM/Evaluate.swift

}

.map {

$0.last!

})

Copy link

Collaborator Author

davidkoski May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix for #74

davidkoski commented May 20, 2024

View reviewed changes

awni reviewed May 28, 2024

View reviewed changes

address PR feedback

ee0da07

awni approved these changes May 28, 2024

View reviewed changes

davidkoski merged commit 9d74afd into main May 28, 2024
1 check passed

davidkoski deleted the fix-quantized-load branch May 28, 2024 23:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle partially quantized models #76

handle partially quantized models #76

davidkoski commented May 20, 2024 •

edited

Loading

davidkoski May 20, 2024

davidkoski May 20, 2024

davidkoski May 20, 2024

davidkoski May 20, 2024

davidkoski May 20, 2024

davidkoski May 20, 2024

davidkoski May 20, 2024

davidkoski May 20, 2024

awni May 28, 2024

davidkoski May 20, 2024

awni May 28, 2024

davidkoski May 20, 2024

davidkoski May 20, 2024

davidkoski commented May 21, 2024

solume commented May 21, 2024 •

edited

Loading

davidkoski commented May 21, 2024

solume commented May 25, 2024 •

edited

Loading

awni May 28, 2024

davidkoski May 28, 2024

davidkoski May 28, 2024

awni May 28, 2024 •

edited

Loading

davidkoski May 28, 2024

awni left a comment

handle partially quantized models #76

handle partially quantized models #76

Conversation

davidkoski commented May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidkoski commented May 21, 2024

solume commented May 21, 2024 • edited Loading

davidkoski commented May 21, 2024

solume commented May 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awni May 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awni left a comment

Choose a reason for hiding this comment

davidkoski commented May 20, 2024 •

edited

Loading

solume commented May 21, 2024 •

edited

Loading

solume commented May 25, 2024 •

edited

Loading

awni May 28, 2024 •

edited

Loading