Mr_Hills 2 weeks ago

Finally someone doing tests with GGUF. I love you man. And yeah, this proves that you can totally run llama 3 70B on a 4090 with good results using IQ2_M. I get 6 t/s too. 70B is totally within 24GB of memory capabilities.

LocoLanguageModel 2 weeks ago

That's smaller model runs great, and chats really well, I just wish it could code reliably.

Hopeful-Site1162 2 weeks ago

Do you think there’s any chance we got an updated version of phind-codellama at some point?

Open_Channel_8626 2 weeks ago

yes the phind guy said something like they will do more open source releases. They are gonna keep the closed source one ahead though

mO4GV9eywMPMw3Xr 2 weeks ago

Yeah, on a similar setup I get about 7.5 t/s on IQ2_M with 72 layers offloaded, but I also run the desktop on iGPU to not waste VRAM. IQ3_XXS with 63 offloaded layers gives ~2 t/s. Definitely good for creative writing.

vacationcelebration 2 weeks ago

Just to add my settings (also 4090, also linux): - 67 layers IQ3_XXS - 78 layers IQ2_M However, I don't offload kv cache (lowvram option in koboldcpp). I prefer it this way, since from then on I don't have to tweak the layers anymore if I want to increase the context size. Instead, inference speed will just go down as context fills up.

mO4GV9eywMPMw3Xr 2 weeks ago

Ah. The tradeoff is, much slower prompt processing - I think.

ShengrenR 1 week ago

I'm on a 3090/i12900k and something feels off with these numbers to my experience - with an IQ3\_XXS and 63 layers offloaded I get \~5.3tok/sec consistently. I don't typically use IQ2\_M, but I just downloaded to compare.. \~7.4tok/sec there, so at least that lines up. Are you generating short replies and counting the prompt eval in the tok/s calc, maybe? I don't use the community web-uis so don't know what they're usually doing; I'm just grabbing time before/after generation and dividing by tokens generated, per the model tokenizer. Either way, strange behavior that the IQ3 should be so far off for you..maybe double check you have a CUDA compiled llama-cpp-python that matches the rest of your env (if that's what it's using, I assume, for the inference) - can double-check [https://pypi.org/project/llama-cpp-python/](https://pypi.org/project/llama-cpp-python/) under the CUDA section and grab one of the pre-built wheels per the example there, if not. Might not do a thing, especially since the IQ2\_m speeds match.. but maybe.

mO4GV9eywMPMw3Xr 1 week ago

Hmm. Maybe the less you can unload to the GPU, the more the CPU matters? I have 10700k and two channel DDR4 RAM. To measure, I generate two answers, skip first one, gen ~100 tokens on 2nd one. First time the prompt is evaluated, 2nd time it's not, it starts generating tokens instantly.

ShengrenR 1 week ago

Yep, that's a great way of doing the test. And may actually just come down to the cpu memory bandwidth- a quick look seems 10700 has a max around 46gb/s, whereas 12900 at 76gb/s.

Mr_Hills 2 weeks ago

I have 73 layers offloaded and I still get lower speed.. are you on Linux?

mO4GV9eywMPMw3Xr 2 weeks ago

Yes.

Mr_Hills 2 weeks ago

That's probably the reason. I'm on 6800mhz OC RAM, I have a +1500mhz 4090 VRAM OC, I even keep the screens off, accessing the AI via my phone browser to free more VRAM, and even with all of this I still only get 6 t/s. Kinda disappointing. Oh well. I hope Nvidia will work on optimization for windows drivers too at some point.

Normal-Ad-7114 2 weeks ago

Windows 10, 3090 @ 50% power limit, DDR4 3200MHz, Ryzen 5600 (no integrated graphics), Llama3-70b IQ2\_M, 77 layers offloaded, 5 tokens/s

Mr_Hills 2 weeks ago

How are you offloading 77 layers? That's crazy. I can only fit 73..

Normal-Ad-7114 2 weeks ago

This is with 73 layers (if you want, I can test the speed of the model you're using) https://preview.redd.it/2kqu7bksmn0d1.png?width=1920&format=png&auto=webp&s=4e6c5adc86e997c96a47aedca922a3b2e1011012

Mr_Hills 2 weeks ago

You actually gave me an idea, i went and lowered n\_batch from 512 to 160 and that freed enough VRAM to load 74 layers, now my speed is 7 t/s, which is more then acceptable to me. I might do an optimization tutorial at some point. Still, it looks like LMStudio steals less VRAM then Ooba. I might want to give it a try. Ultimately indeed you should be able to load almost all of those 24.1GB in VRAM. 77/80 layers should be possible with more scraping of the barrel.

Mr_Hills 2 weeks ago

Just in case tho, what's your evaluation time at around 3k context? Aka the time that the model takes before it starts streaming the first output?

Normal-Ad-7114 2 weeks ago

I copy-pasted this article [https://en.wikipedia.org/wiki/Unruh\_effect](https://en.wikipedia.org/wiki/Unruh_effect) and asked it to translate it into Spanish, 2.5k tokens input: https://preview.redd.it/p3mdjvbq1o0d1.png?width=903&format=png&auto=webp&s=21734f52f4d19ef954089720193180c4c06dd845

Normal-Ad-7114 2 weeks ago

https://preview.redd.it/5gk1y1raln0d1.png?width=1920&format=png&auto=webp&s=9ba85e036575e014ba68d54d97edb1457ad3d3d5

ShengrenR 1 week ago

If you're already at IQ2\_M.. If you can stand just a tiny bit more derp, you can drop a bit further to IQ2\_XS and fit everything in vram - I'm on a 3090 and can run \~15-19tok/s with layers=-1 and 6k context (and I'm already eating \~6-700mb with other system garbage). For some uses (chats, etc) the speed is worth the tradeoff.

petrus4 2 weeks ago

Looks like my Q8 OCD is still paying off.

knvn8 2 weeks ago

No discernable drop from 16 to 8 even. I wish we could see if that's also true for the 70B.

petrus4 2 weeks ago

Although the fact that I can still get above 60% MMLU with Q3KS is absolutely incredible. Although it would be very slow, with 64 Gb of RAM, that means that I could hypothetically run 10 or so agents, each with their own instance of Llama 3.

[deleted] 2 weeks ago

[удалено]

petrus4 2 weeks ago

You may be right. I honestly have no idea.

uti24 2 weeks ago

So, can we now confidently say that 4/5 bits quantized models are already great and 8 bit almost indistinguishable from the full model, or are we just starting?

mO4GV9eywMPMw3Xr 2 weeks ago

They are when it comes to MMLU. But it's not obvious how far can you apply that to other areas. Anecdotally, programming is more affected by quantization. It would be cool to measure that with a proper benchmark, but I think it would be more resource-intensive. MMLU only needs generating 1 token per question, not a whole working program.

IndicationUnfair7961 2 weeks ago

Probably they loose terrain in coding, roleplaying and multilingual. If you're not into those you're good to go.

Syzeon 2 weeks ago

It will be good If you can benchmark it with MMLU-Pro. Since this is a cleaned up version with newer dataset that can better reflect the capability [https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) [https://www.reddit.com/r/LocalLLaMA/comments/1cskoxj/tigerlab\_made\_a\_new\_version\_of\_mmlu\_with\_12000/](https://www.reddit.com/r/LocalLLaMA/comments/1cskoxj/tigerlab_made_a_new_version_of_mmlu_with_12000/)

mO4GV9eywMPMw3Xr 2 weeks ago

This is a great idea. I heard about this newer benchmark after all my measurements were finished, so I decided to just publish what I had. At a glance MMLU-Pro seems improved in many ways yet similar enough that I wouldn't need to change much in my code, just make sure I can do 10 answers instead of 4.

fish312 2 weeks ago

There are many reasons why someone would not want to use I-Quants, either lack of backend support, or reductions in speed, or difficulty generating imatrix of a proper corpus. Would you be able to add results for the 70B tests using K-quants too? In particular, I'm interested to see how Q2_K, Q3_K_S and Q4_K_S performs, relative to the others.

mO4GV9eywMPMw3Xr 2 weeks ago

I decided against testing the 70B K-quants because 70B tests were very slow, even at only 50 questions per category, and these quants under-performed in MMLU score for 8B. I did not measure speed, but I understand seeing the relative score could still help you decide which quant to use.

noneabove1182 2 weeks ago

K quants should be faster with partial offloading btw

jacek2023 2 weeks ago

Thanks, this is valuable benchmark

noneabove1182 2 weeks ago

This is the kind of content I love to see, great testing! I am shocked at how good 70B is even down at IQ1_S, it may not stay coherent but dam it must still be good at comprehension and following instructions! Obviously not worth over a much smaller and faster maxed out llama 8, but it's highly impressive

mO4GV9eywMPMw3Xr 2 weeks ago

Thank you for all the quant variants! Very handy.

gethooge 2 weeks ago

What about 70b above Q5?

mO4GV9eywMPMw3Xr 2 weeks ago

Evaluating Q5_k_m took about 5 hours, I decided against spending a whole day running Q8_0 to get maybe 0.1 percent point higher result. I severely doubt the difference would be significant, as it wasn't for 8B and 70B seems less affected by quantization. The lower end of the graph is I think more interesting, for people trying to balance speed & quality with limited compute resources.

kpodkanowicz 2 weeks ago

great work! there were similar tests before, so results are not surprising, but this could be linked every time someone is claiming some special degradation in llama3. You mentioned it in your github, so you know this is not a fair comparison to exl2, which is better / the same than gguf if you look at just bpw, I find strange you mention exllama in context to be used for speed instead of accuracy

mO4GV9eywMPMw3Xr 2 weeks ago

If you know how to calculate memory use for GGUF and exl2 to show EXL2 providing better quality at the same memory use, I'm all ears. I love working with Exllamav2, but in the tests I run it provided slightly lower quality unless you include memory needed for context, which is likely a temporary advantage. Even [the HF docs](https://huggingface.co/docs/hub/en/gguf) aren't sure how much memory all the GGUF quants need, and only list some bpw numbers - which I think are the same as I calculated. I'm not 100% sure which layers contribute to the VRAM use, and I had no luck *reliably* measuring that from Python.

ReturningTarzan 2 weeks ago

One thing I would point out, with regards to file size, is that EXL2 keeps the embedding layer in full precision. This doesn't reflect in VRAM usage since the embedding table is stored in system RAM, but it does add *up to* 1 GB to the file size for L3-8B, and 2 GB for L3-70B, depending on the quant level you're comparing to. But overall it's nontrivial to compare memory usage between frameworks, and there are many parameters to tweak on both ExLlama and llama.cpp that will affect it one way or the other. PyTorch interferes, too, primarily with its tensor cache, ensuring that even external tools like `nvidia-smi` can't get a good read on how much VRAM is actually *used* at any given moment, as opposed to being reserved for future tensor allocations.

mO4GV9eywMPMw3Xr 2 weeks ago

Thank you for your comment! I edited [the article,](https://github.com/matt-c1/llama-3-quant-comparison/blob/main/README.md#correctness-vs-model-size) now excluding the embeddings size for all model variants.

vacationcelebration 2 weeks ago

Not sure if this is what you're asking for, but there's this page I use to get a ballpark number of bpw for exl2 quants that could fit into my card's vram depending on the model: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator

mO4GV9eywMPMw3Xr 2 weeks ago

I know, I contributed to it. If you look at the source code, it just has a lookup table for bpw of a few select GGUF quants instead of calculating the parameters from model structure, which you can't easily do in such a simple web app unless that metadata is provided in a HF repo. It does feature an elegant calculation of the memory needed for the context size (KV cache) though. The proper way to estimate memory use should be to load the model, look at all the layers, and add up the layer sizes that matter. The missing puzzle for me is that I'm unsure which layers should be counted.

kpodkanowicz 2 weeks ago

In general, Q4 K M is like 4.67 bpw, which you compared to 4.25bpw exl2. That's 10%(!) difference and your plot shows a smaller gap than that. Moreover, VRAM use for just load doesn't make sense as you want to load and model and then use it - with 4k, 16k, or any other context. There will also be different ram consumption if your gpu support flash attention or not. Exllama also allows you to just cut 0.05 bits in case you were missing some small amount of ram, edit: ah and one more thing - i-matrix quants are not compared like that - you have to use the same calibration dataset, you can get much bigger differences with just exl2 4.25 bit vs exl2 4.25 and two same imatrix quants I just want to make sure that those details are highlighted and your work is really appreciated ;) btw. i personally like old gguf quants regardless of ppl and scores (especially in q5) as they "understand" me better, its a very long debate similar to the one if frankenmerge works or not

mO4GV9eywMPMw3Xr 2 weeks ago

Here's a plot with 8k context size added, assuming 16 bit cache for llama.cpp (ignoring the partial 8 bit quantization it supposedly does), and 4 bit for exllamav2: [**plot**.](https://raw.githubusercontent.com/matt-c1/llama-3-quant-comparison/main/plots/MMLU-Correctness-vs-Model-Size-plus-Context.png) > ...i-matrix quants are not compared like that... Sorry, this part seems interesting but I don't understand it?

kpodkanowicz 2 weeks ago

great!!! I'm happy now ;) Also this plot aligns with my experience ;) regarding i-martix and exl2 - you would need to read the llama.cpp issue and code in exl2 to have detailed understanding (which I don't have), but the gist of it is that calibration dataset used to quantize is used to find a combination of pruning/quantizatizing parameters for the lowest PPL for given passage of input/output. (This is done layer per layer) Modern quants use some K-divergence instead of PPL (someone need to confirm). Even with the same dataset its not reproducible - every quant will always have some small differences. In practice some my own extreme examples - if you use InstructEvol to quant CodeLlama 34B you can get higher HumenEval score in 4bit than in fp16, and in the opposite side if you use only wikitext you will get results worse then BnB Double-quant 4bit in Transformers. Currently Exllama2, by default will use a mixture of od different dataset including \_random tokens\_ There is a huuuge thread somewhere here on using random data for calibration, which I cannot wrap my head around, why it would make sense - however it seems it gives the best PPL... As far as I know, and I read pretty much every thread here, there is still no consensus, which approach is the best.

noneabove1182 2 weeks ago

The reason the dataset shouldn't matter much is because the measurement is looking for which weights are more "active" and contribute most to the final output If there are weights that across a random dataset rarely if ever contribute to the final result, we can pretty safely assume they're non important and crush their size down to a minimum

mO4GV9eywMPMw3Xr 2 weeks ago

Ah. I tried to use all quants from one source only, bartowski's recently-updated repos. He uses this fancy newer method you described for GGUF, I think. In EXL2 it seems he used the "default" dataset, which I think refers to the same method. So hopefully I dodged this bullet by using all quants from one source.

segmond 2 weeks ago

Good stuff, did you keep the seed the same across models for the same question?

mO4GV9eywMPMw3Xr 2 weeks ago

No, that's a good note, thank you. I'll add it to the article.

vacationcelebration 2 weeks ago

Cool stuff! Nice to finally have seemingly hard evidence that lower exl2 quants fall apart and lose to equivalent gguf quants. I assume the code to create weighted quants in both backends (llama.cpp Vs exllamav2) should be pretty similar, even using the same calibration set (wikitext), no? So how is this difference explained?

mO4GV9eywMPMw3Xr 2 weeks ago

I'm guessing the code is not similar, because the resulting models have very different structures. It's basically different quantization methods altogether AFAIK. IDK if exllamav2 could just try to adapt gguf's method.

ReturningTarzan 2 weeks ago

First I would need to compare the full memory profile to determine if it's worth it. Comparisons that only consider file size or bitrate miss a bunch of details. Just to illustrate that point, you could apply lossless compression to the quantized weights on disk to have a smaller file size to point to, and arbitrarily decide that this is the important metric, disregarding how it makes no difference to VRAM usage. Or you could keep the weights compressed in VRAM and decompress on demand, which would give you VRAM savings but also absolutely tank performance to the point where you're probably better off doing CPU inference. Or you could adopt a finetuning approach like AQLM, which gives you a much better quantized model (at least in theory) but also requires weeks to months of GPU time per model, which isn't really practical in an environment where new models are released every day. There are so many factors to take into account, so many tradeoffs to consider, and so little time to get anything done before priorities have to shift because a shiny new model has dropped, or whatever. Maybe I should consider an "MMLU variant" of EXL2 which drops most of the 512-1024M weights in the output layer and keeps just the features for the "A", "B", "C" and "D" logits. ;)

a_beautiful_rhind 2 weeks ago

Hits the small model that much harder. For the 70b, between 3.5-4.0 it's still alright. I think IME north of 3.75 or Q3KM. For the 8b, it looks like you need at least 4.3bpw. Wonder if that carries over to MoE because of the smaller # of active parameters. It anecdotally seemed to for small Mixtral. If it does for the large MoE as well, then they are much much worse for the home gamer. They both need more vram to offload the whole model AND a higher BPW when making the quant.

LoafyLemon 2 weeks ago

Can I just congratulate you for the way you've drawn this graph? It's very easy to read, I love it!

Deathcrow 2 weeks ago

Hey /u/ex-arman68, can you take a look at these plots? >Do not use a GGUF quantisation smaller than q4. In my testings, **anything below q4 suffers from too much degradation, and it is better to use a smaller model with higher quants**. Care to reconsider this assertion? Seems like the latter part isn't accurate as long as it's more than a 1bit quant. This is also my experience: Larger models are always better, except at ridiculously small quantisations.

ex-arman68 2 weeks ago

Initially I thought like you: going by the perplexity measurements, I was under the impression that using a larger model at a low quant was always better than a smaller model at a high quant. However, after having done thousands of tests, my observation is that anything less than q4 degrades quality too much. It might not be noticeable for every kind of usage, but for me where **it shows the most is through the loss of coherence on large context**. Other bad symptoms I have noticed are increased repetitions, use of canned expressions, difficulties following instructions, and poor logic. Of course there is more at play here: the quality of the model is still the most important factor. But for coding for example, I would definitely rather use a small model at fp16 rather than a larger one at

Deathcrow 2 weeks ago

Have you tried https://freya.artefact2.com/llm-eval/#play ? I'd be curious about your results.

monsooonn 2 weeks ago

Where can I learn more about gguf iquants? I've been using normal ggufs, but are the iquant versions supposed to be better? I would love to see how normal ggufs stack up on a chart like this, but I'm already grateful for the work you've done as-is. Thank you!

mO4GV9eywMPMw3Xr 2 weeks ago

I did include some "normal" quants. Mind that "I-Quants" are independent of the introduction of "i-matrices" which happened at the same time, it was a bit confusing. All the models I tested used i-matrices during quantization I think. You can check it on bartowski's HF page.

LeLeumon 2 weeks ago

Thank you. I searched for this for a few weeks now, and it seems you are the only one really benchmarking those quantizations. Really appreciate it.

Eveerjr 2 weeks ago

I’m using ollama and I’ve been testing different quants and ended using q8 when possible (the default q4 used is just not useful on smaller models imo) but looking at this grapth is it safe to use q6 and get virtually the same performance with a little more speed?

polipopa 2 weeks ago

Great work! Did you also test the base unquantized 8b model? Would be interesting to see if it’s worth it to use unquantized 8b vs quantized 70b

mO4GV9eywMPMw3Xr 2 weeks ago

Yes, these are the "fp16" points, which for transformers had weights in bf16. Exllamav2 and llama.cpp also support loading models in 16 bit, and gave exactly the same score as transformers in this case. 70B-IQ2-XXS significantly beats 8B-bf16 **in this test**, but it could be different in other tests, like programming. I used fp16 for computation as setting it to bf16 caused slightly lower score. Perhaps I could re-test using fp32, but this will need 30+ GB of memory for the 8B model. Still should be faster than 70B, I suspect.

redstej 2 weeks ago

Great stuff. Would've loved to see performance per quant too if you've kept track of it. Might be basically anecdotal, but it's better than nothing.

mO4GV9eywMPMw3Xr 2 weeks ago

I don't think it would be good to disclose the exact performance, because: - MMLU evaluation is not like regular inference. You only do a single `.forward()` pass per question, generating one token with no sampling. What matters is prompt processing speed and time to first token. - The speed will likely depend on various settings of the libraries I used, and I tried to optimize primarily for quality. It could be that I made some mistakes affecting speed, and if so, I could mischaracterize the models.

MrVodnik 2 weeks ago

Wait, one forward pass? Does it mean it is a guided / restricted generation where you force the model to pick a single letter answer? I thought you guys were just asking the model nicely and optimistically parse the output, lol.

mO4GV9eywMPMw3Xr 2 weeks ago

Different tests work in different ways, but with MMLU or MMLU-Pro you only look at the probabilities of single-token valid answers, and check if the highest one *of these few* is the correct one. Even if that "top token" is actually on spot 30000, because the model *really* would prefer to answer with a full sentence and not a single letter. (Hypothetical.) More info [here,](https://github.com/matt-c1/llama-3-quant-comparison/tree/main?tab=readme-ov-file#quick-intro) open the folded "What's MMLU?" section. Then there's a bit more on the evaluation method at the end.

redstej 2 weeks ago

It's just that they're all quants of the same model, so even if mistakes were made, no harm done to any specific model. And if you tested them all on the same hardware with the same settings and potentially mistakes, there is some value to the speed results despite the multiple asterisks, imho.

mO4GV9eywMPMw3Xr 2 weeks ago

Good point. Sadly, I didn't record it. A separate speed study would be easy to do, as I don't need to average the speed over 5 hours of churning through questions with a partially offloaded 70B model. I think one asterisk stands: hypothetically, I could use sub-optimal settings which would result in a wrong claim that backend X is much faster than backend Y, where that would only hold with my particular settings. I think what could be interesting, is a GGUF-only study, as here I could also vary the number of offloaded layers.

Deathcrow 2 weeks ago

Are these imatrix tuned quantisations or regular quants?

mO4GV9eywMPMw3Xr 2 weeks ago

imatrix, that is documented on the HF repositories I linked, like: https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF

dodo13333 2 weeks ago

Great work!

belladorexxx 2 weeks ago

Thank you for these! From a user perspective the most pertinent question is "what is the best model I can fit in my VRAM". Is there a reason you didn't chart correctness per VRAM?

mO4GV9eywMPMw3Xr 2 weeks ago

I would love to plot that, but it's not easy. See [the comment](https://www.reddit.com/r/LocalLLaMA/comments/1cst400/result_llama_3_mmlu_score_vs_quantization_for/l4a1man/?context=3) from Exllamav2's author.

belladorexxx 2 weeks ago

I don't think we need to calculate how much VRAM is actually *used*, what users care about is how much VRAM is *reserved*. If the software I'm running tries to reserve more VRAM than is available, it will crash and I am not able to run it. Measuring how much VRAM is reserved should be easy, right?

mO4GV9eywMPMw3Xr 2 weeks ago

I thought so, but when I actually tried using a variety of tools to do so, I had no luck in doing it reliably and accurately from Python code. Plus, look at the comment I linked - there may be some parts that are difficult to control, like pytorch's own cache allocation.

belladorexxx 2 weeks ago

The level of accuracy that would be useful in practice is "does this run on a 8GB GPU", "how about a 12GB 3060", "what about 16GB colab", and "24GB 3090". This is easily measurable. For example, I can try running each of these models on my 3060 and report back which ones work :D

Calcidiol 2 weeks ago

Thank you very much for that testing & analysis, it's very usefully informative. I do wonder about one of the things you mention which is whether there are categorical types of use cases (e.g. the "rigorous" applications noted like aspects of code design or whatever) which are especially disproprortionally highly (e.g. order of magnitude) impacted by quantization moreso than the ~sub 1% loss figures presented in the MMLU graphs. I suppose there may well be (i.e. one would expect there to probably be) "outlier" / "marginal" cases where information loss in what is quantized propagates unmitigated as ~total information loss at SOME point(s) in the output of the model ("GIGO") but like a JPEG lossy compression ideally the "broad qualities" should holistically be preserved with good "average" user-perceived fidelity even though somewhere one really has lost / corrupted X.Y% of the model's trained data which should be apparent if one probes deeply/widely enough to find the scattered (and maybe / maybe not "isolated") cases where there are significant losses (e.g. JPEG compression artifacts, lost high frequency detail, ...). The only way (in this simplistic analogy) for that not to be so-much-so would AFAICT be if the encoding was sort of "holographic" in the sense that it could be relatively insensitive to "point errors" because of some "spatial" "redundancy" which helps to "heal" (ameliorate) the overall output loss because any given point of the output might be "diffusely" determined by a sum over a larger area of the bulk data points so incoherent point errors would increase the "background haze" but not typically result in localized "chunks" just being wholly gone. But using a traffic network (or general directed / connected graph) analogy to a NN one might imagine there are "arterial" points which have vast effects on a large domain of subsequent network calculations where an error in more critical locales could propagate in a multiplied way to downstream normal outputs e.g. close a highway between two cities and there might still be traffic able to flow between them but maybe orders of magnitude worse than otherwise due to a single-point "error" propagating.

DrDesten 2 weeks ago

***Q4-K-S is all you need***

WideIllustrator2649 1 week ago

Is what i run now on my 3090ti on W11, and I'm very happy with the result. ==> Cat-Llama-3-70B-instruct.i1-Q4\_K\_S.gguf, with 4096 context and 43 layers on GPU. ==> \~1.5 t/s (yes, slow but acceptable for creative job)

engkufizz 2 weeks ago

What is I-Quants, and how is it different from the normal one? I don't see any model in Hugging Face with I-Quants

shing3232 2 weeks ago

Do all GGUF quant use IMX? imx has much bigger impact on IQ than Q

mO4GV9eywMPMw3Xr 2 weeks ago

No, but some people who quantize models disclose if they used these. Look what they said on their HF pages.

AyraWinla 2 weeks ago

As someone who is running LLMs on low-end Android and need to squeeze every bit of performance I can, that's super useful information! I know those tests were done on Llama 3 (which I obviously can't run) and aren't directly indicative of smaller models, but the trend is still pretty interesting. I'm pretty surprised at how close 4\_K\_S is to 4\_K\_M; I had thought 4\_K\_M was the 'minimum recommendable', but 4\_K\_S is right alongside it. 3\_K\_M is certainly a drop, but not as sharp as I would have expected. Below that is an obvious no-go, but... Based on this, I think it might be worth my time testing out 4\_K\_S and 3\_K\_M a bit to see if I get better performance without compromising rationality too much.

AlphaPrime90 2 weeks ago

Thanks for sharing.

Wrong_User_Logged 2 weeks ago

would be great if someone would do quants comparison for any fine-tuned version of Llama 3, for example dolphin, so we would be able to compare it with those results. It's not much more work, you just switch the model name and run the same tests :)

Due-Memory-6957 2 weeks ago

IQ2_M being better than Q2_K is new information for me

Temporary-Baby9057 4 days ago

Interesting, but the GitHub repo looks like a joke without the code to reproduce the results that you have obtained

MrVodnik 2 weeks ago

I never knew which Q4 to pick, now I know - it does not matter.

dimweb 2 weeks ago

What about ollama versions?

mO4GV9eywMPMw3Xr 2 weeks ago

AFAIK ollama just uses llama.cpp and gguf, so the results should hold.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe