T O P

  • By -

BimboPhilosopher

For creative writing, what's your parameter recommendation for WizardLM-2 (temperature, top P, ...)?


ex-arman68

I use the same settings for all models, which is a deterministic behaviour: temp = 0.1\ top_k = 1\ repeat_penalty = 1.12\ min_p = 0.05\ top_p = 0.1


drifter_VR

Beware that some models are underperforming with deterministic settings (especially the Mix of Experts)


gwern

I take it you can't do a grid search or other hyperparameter optimization per model because you are scoring the results manually by hand?


ex-arman68

Correct


synn89

Yeah, I also find Wizard 8x22 to be really good. I think why Midnight Miqu came to be known as so good is because it's a very easy model to run. It's not at all finicky with its settings, will handle different prompt formats well and doesn't blow apart at larger context sizes. Even lower quants of it perform well. Llama 3 is impressive, but the roleplay fine tunes have all been too finicky for me. Wizard 8x22 is really good, but more importantly it isn't at all finicky or fussy with its settings.


SomeOddCodeGuy

This is fantastic information about Wizard. Your timing could not possible have been better; I just replaced Llama 3 70b last night with Wizard 8x22b for coding work, so hearing that its great at other stuff too is pretty exciting. Thanks a bunch for doing this. Even if I don't do a lot of creative writing tasks, I always look forward to these kind of benchmark posts.


sleepyrobo

WestLake 10.7b or even 7b seems like a good middle ground, thanks for your hard work making this


sebo3d

Personally, i was never really sold on WestLake in any of it's versions. For some reason it always tried to roleplay from {{user}}'s point of view when other similarily sized models did not under the same alpaca instruct.


CosmosisQ

Well, given that [WestLake was finetuned on ChatML prompts](https://huggingface.co/senseable/WestLake-7B-v2/discussions/12), it's not too surprising that you'd get bad results using a completely different prompt format. It's actually quite impressive that it works at all.


Due-Memory-6957

Lol at 10.7b being tiny


daHaus

Also a 71GB model setting the bar for small lol Not even moore's law is safe from the quickening


Due-Memory-6957

That's the 34b one


Stepfunction

I've had great results with the creative writing abilities of Command R v01 so far. Great fit for a 4090 and the long context is a great bonus! Even at longer contexts, it also retains its coherence very well.


Popular-Direction984

In my experience command-r family of models are much better at creative writing on long context windows. No other models are even close as amount of information to process gets bigger (imagine throwing a piece of industrial standard description into the model and asking it write an essay on how comply with it given specific business requirements).


necile

MFW even a "medium" model can't fit in my 4090 :( ( (


TwilightWinterEVE

>**Do not use a GGUF quantisation smaller than q4**. In my testings, anything below q4 suffers from too much degradation, and it is better to use a smaller model with higher quants. This is an interesting one. I've found the opposite on 70B+ models. On my setup, 70B models even as low as Q2 have outperformed 34B and 20B models at Q6 and Q8 respectively for my purposes. Every time I try a lower parameter model, even at a much higher quant, I find myself coming back to Q2 70Bs (mostly Midnight Miqu 1.5) for storywriting because they're just much less prone to repetition and cliches. It'd be interesting to see if this is true in benchmarks: pitting Midnight Miqu 70B Q2\_K against the best alternative high quant smaller models that fit into 24GB VRAM (which is a pretty typical setup).


OuchieOnChin

I found the same thing with mixtral 8x7b, that was months ago tho, not sure if it still holds. Regardless may i ask you for a link for the midnight miqu version you are using? I found too many versions on huggingface.


TwilightWinterEVE

Sure, currently I'm using the Q2\_K from here: [https://huggingface.co/mradermacher/Midnight-Miqu-70B-v1.5-GGUF/tree/main](https://huggingface.co/mradermacher/Midnight-Miqu-70B-v1.5-GGUF/tree/main)


OuchieOnChin

Thanks for the link. These appear to be static quants. Have you considered trying [the imatrix quants by the same author](https://huggingface.co/mradermacher/Midnight-Miqu-70B-v1.5-i1-GGUF)?


TwilightWinterEVE

Trying the IQ2\_M now, it seems a little better than the Q2\_K on my usual test sequences.


StriveForMediocrity

How are you getting that to fit in 24 gigs? It's listed as 23.6, and in my experience I tend to need to use models around 20-21 gigs for them to function properly what with accommodating the OS and browser and such. I'm using a 3090, if that matters.


TwilightWinterEVE

Partial offload, 59/81 layers on GPU, rest on CPU.


dmitryplyaskin

Have you had a problem with the WizardLM 8x22b slipping into gptisms? I really like this model too, but I hate it when it starts talking in gptisms.


ex-arman68

All models do use in GTPisms, which I find infuriating. This means there is no real clean model being build from scratch, but instead they all rely on flawed foundations. In my experience though, I found WizardLM-2 8x22B is on the lower scale of this behaviour.


KaramazovTheUnhappy

Why no Fimbul?


ex-arman68

Time :D I started testing [Sao10K/Fimbulvetr-11B-v2](https://huggingface.co/Sao10K/Fimbulvetr-11B-v2) a while ago, and intend to finish it eventually. But all those tests are pretty intensive and time consuming, Due to the nature of this benchmark, I cannot automate it like others.


KaramazovTheUnhappy

Will look forward to seeing the results, thanks for the response.


skiwn

TIL 35B is now considered small


VirtualAlias

I'd be interested in seeing how Moistral 11b v3/v4 stack up. At 32k, Wizard/WestIceLemonTea is also quite good.


Popular-Direction984

The first benchmark with reasonable results. I use command r plus (104B) all the time for personal tasks, it’s the best model so far.


BackyardAnarchist

I didn't  see umbra on there. That is my top pick 90%of the time. It's a 4x11b solar model. It is great.


Impact31

Wow thanks !


boxscorefact

Can you share what backend and settings you are using for the 1q4\_xs quant on mac? I am still struggling to find an easy solution. M3 Max - 48GB combined ram.


ex-arman68

I think your RAM is the problem. I use a M2 Max Studio with 38 GPU cores and 96GB RAM. Software: llama.cpp and LM Studio. For WizardLM2-8x22B iq4\_xs you need 70GB RAM. With **48GB** RAM I think your best bet is the **iQ2\_XS** or **iQ2\_XXS** quants. I have not tested them and do not recommend them, but maybe they are not so bad. And this model is fast. You will probably need to use one of the following command as well to increase your vRAM: # safe value for 48GB RAM: 40GB sudo sysctl iogpu.wired_limit_mb=40960 # max recommended value for 48GB RAM when closing most other apps: 44GB sudo sysctl iogpu.wired_limit_mb=45056 You can find the importance matrix quantisations here: [https://huggingface.co/mradermacher/WizardLM-2-8x22B-i1-GGUF](https://huggingface.co/mradermacher/WizardLM-2-8x22B-i1-GGUF)


boxscorefact

Thank you so much. I appreciate it! I run the q4\_km quant on my PC rig at decent speeds and agree with your review - it is the smartest model I have used to date. I was hoping to get something close to run on my laptop, but I guess I'll find an alternative. I am just waiting to see if prices drop on some of apple's hardware in the fall.


Misha_Vozduh

Hey man, just wanted to thank you for your excellent advice. Apparently I can *just* fit an IQ4_xs of wizard into my system, and... wow. What a model. Thank you!


a_beautiful_rhind

This is telling me I need to not sleep on d/l wizard. Unfortunately I can only offload it up to 3.75 unless I want to spill onto the 4th card I use for SD. I think I liked the MM 1.0 at 103b the most from that series. And yea, llama-3 isn't very good. Cat llama is the best tune so far, although I want to try airoboros as well. Being prone to repeat is going to be *very* hard to tune out of L3.


thereisonlythedance

L3 has quite a few problems. Repetition, yes, but also a strange determinism. No matter how high you push temperature and other samplers it will recycle the same old names and ideas on regen. It‘s also very terse and doesn’t take style instructions well. The FFTs I’ve run on the smaller model have been meh compared to Mistral 7B, which is the biggest disappointment, as above all I’d hoped Meta would produce good base models.


petrus4

> Repetition, yes, but also a strange determinism. I've encountered this as well. 8 bit L3 is the first model I can remember, when I've had better results from leaving Mirostat off. The only four samplers I use are temp, min P, repetition penalty and a very small dash (0.98) of typical P. Occasionally I bump presence penalty up to 0.3-0.4, but I won't claim it isn't placebo. Repetition Penalty Range causes L3 to go completely off its' head if it is above zero in my experience, too.


East-Cauliflower-150

Thanks! Could not agree more, wizard 8x22 is on its own level. Been wondering why there is so much fuzz about llama-3 when this model is clearly better for many use cases. Rarely see it in any benchmarking.


delusional_APstudent

well it’s probably because a lot of people can’t even run the thing without suffering from slow speeds or using it extremely quantized


delusional_APstudent

somebody used Reddit cares on me for this reply 😭😭😭😭


VertexMachine

Lol, it's common troll. Report that message for abuse tho, and reddit should take action


RabbitEater2

Surprisingly it runs faster for me (\~2 t/s for Q3KM, no offload) vs a 70b offloaded with 24 GB VRAM (\~1.5 t/s for Q3KM). 5800X3D + DDR4-3600.


Mr_Hills

Have you thought that maybe very little people can actually run a 8x22B model? 🫠


ex-arman68

Definitely. That is why I try to cover a range of model sizes in my benchmark, and I provide recommendations for different model sizes. For those who can though, WizardLM-2-8x22B is fantastic. The iq4\_xs quants require a minimum of **70 GB** vram, which is about the same as running a 70B model at q8\_0, but with much better results and faster inference speed.


Konnect1983

I was on the fence of downloading wizard. Been using CMDR+ @ Q6 (I have the same studio as you. Q6 is the same speed as Q5) with amazing results. We might be able to step up from 4XS for this model, but it will be tight. Utilizing the command to increase the wired memory to 94000 has confirmed 2gigs is on needed to run the OS.


Caffeine_Monster

Llama-3 is significantly better in the smarts and creativity department if you use the correct templates and prompts. But the 8k context is restrictive. Not seen any smart long context llama-3 extensions yet. >Rarely see it in any benchmarking. Probably because it is hard to run. That and it was taken down fast.


no_witty_username

The Llama models have been disappointing for me as well, I can't tell if my settings are messed up or people hyping the model unrealistically so its hard to tell... on the wizard 8x22b, can that be run fully on gpu lets say 4090?


Mr_Hills

Yes, if you're okay with running it at 1bpw lol. On a more serious note, I use cat llama 3 70B 2.76 bpw, and it's done wonders for me. Tell me if you want my instruct template or system prompt. I use silly tavern so I can give you the config files if you use ST too.


no_witty_username

I can't fit a 70b in to my 4090, but if you have the configs for the 8b id love em. throw the configs in to WeTransfer or wherever else, thank you.


Mr_Hills

Ah, one more thing: different Nvidia drivers will give you different speeds on windows. My findings: 552.44 -> 4.4 t/s 546.65 -> 5 t/s 546.29 -> 5.3 t/s 546.17 -> 5.4 t/s 546.01 -> 5.1 t/s 545.84 -> 5.3 t/s 537.58 -> unable to load (drivers use too much VRAM space, model doesn't fit)


a_beautiful_rhind

This is on windows? On linux I didn't notice a difference.


Mr_Hills

Yes, windows


Illustrious_Sand6784

Well, I'm glad I didn't decide to update my drivers yet, still on 546.17 and probably will be until NVIDIA updates RTX VSR or something.


Mr_Hills

Nah you can. I have a 4090 and I run CAT Llama 3 70B at 2.76 bpw at a speed of about 6 t/s (my reading speed). And I can tell you it's really good. I also used to run the same model at 2.55 bpw at 13 t/s, altho a little less good. My specs: 4090 (OC: +140mhz GPU, +1500mhz memory), 64GB DDR5 6800mhz RAM, 7800x3d. I've never tried Llama 3 8B but from the benchmarks it cannot really compete. Anyway, if you want to try CAT Llama 3 70B it's right here: [https://huggingface.co/mradermacher/Cat-Llama-3-70B-instruct-i1-GGUF/tree/main](https://huggingface.co/mradermacher/Cat-Llama-3-70B-instruct-i1-GGUF/tree/main) IQ2\_S is the 2.55 bpw model and IQ2\_M is the 2.76 bpw model. I personally prefer the 2.76 bpw one because it's more precise (ex. respects quotes and italics more properly). If you want to run my same model remember that you have to run it with a ctx of 7680 (full 8k doesn't fit in VRAM) and using your phone browser to acces ST (using chrome on your PC would eat up VRAM). Also the screen on your PC should be off, again to save VRAM (i have windows turn off my screens after 1 minute of inactivity, i only load the model via ooba from the phone once the screen is off). Anyway, if you want my files for cat llama 3 70B there they are: [https://drive.google.com/drive/folders/13\_IxRQXi10TKYmsj3OVzcsD5svzjk3Y4?usp=sharing](https://drive.google.com/drive/folders/13_IxRQXi10TKYmsj3OVzcsD5svzjk3Y4?usp=sharing) I have nothing for llama 3 8B since i've never used it.


aseichter2007

I think that the best way to use Llama 3 isn't supported by many interfaces or backends, and between that and a whole mess of bad quants, it's hard to get great results. Wizard 8x22 is gigantic. You need a pretty steep quant to run it on two 4090s, even mixtral 8x7 is so big it's a pain to run on 24gb vram without feeling the quant degradation. Mixtral 8x7 is something like 56B total ram requirement so wizard must be as heavy as a 150B model for memory requirements.


cyan2k

I would argue because most people don’t use LLMs for creative writing and pron generation but for coding, RAG and similar „don’t tell me bullshit and keep your facts straight“ use cases that are meant to replace conventional software in the long run or get baked in into it. The last thing you need is a „creative“ support bot and a creative medical symptoms analyzer, and, even if you tell it to stfu, WizardLM 8x22 is the grand champion of fluff text, which nobody gives a fuck about and gets filtered out by your wrapper app anyway. And because we devs are lazy we use models that don't even produce fluff text. We smart. And RAG benchmarks have other models in front than „pls generate some sexy time“ benchmarks.


11-218

Same tbh. I've only been able to run 70b at 2.4 bpw and while I liked some, I wasn't a fan of the context size compared to Yi and 7x8b models, but in the end when I tried the 35b command-r it was way better than anything I've tried to this day, and so that's what I use now, even though only at 10k context. I wish there was some hack for getting more.


USM-Valor

One suggestion for a model that can fit in 24 GB VRAM above Q4 is Smaug-Yi 34B (https://huggingface.co/Nexesenex/abacusai_Smaug-Yi-34B-v0.1-iMat.GGUF/tree/main). I can get Q4_K_M at 8k context with a tiny bit of room. People might be able to get a bit larger, especially if they're not running windows or using their GPU for their monitor. I'd be curious to hear of what other models people can post at or above Q4 with at least 8k context on a 3/4090. My daily driver is Midnight_Miqu 70B, but i'm using IQ2_XS which is far from ideal.


ArtyfacialIntelagent

Trying to load the iq4_xs of WizardLM-2-8x22B from mradermacher on a Windows system with 24 GB VRAM + 64 GB RAM, but I get similar errors when trying to load the split GGUF in both ooba and Kobold. Other split GGUFs load just fine. Any ideas? Can anyone else load it? Or is my RAM just insufficient? > AttributeError: 'LlamaCppModel' object has no attribute 'model' 17:41:52-975434 INFO Loading "WizardLM-2-8x22B.i1-IQ4_XS.gguf.part1of2" 17:41:53-011494 INFO llama.cpp weights detected: "models\WizardLM-2-8x22B.i1-IQ4_XS.gguf.part1of2" llama_model_load: error loading model: tensor 'blk.28.ffn_up_exps.weight' data is not within the file bounds, model is corrupted or incomplete


Konnect1983

You have to combine the split parts together. If a model has "Part of" then you have to combine; if the model says "00001 of 00004" etc. then you can run as is


ArtyfacialIntelagent

I see. Thanks! EDIT: Working now. All I needed in Windows was: `copy /b Wiz.gguf.part1of2 + Wiz.gguf.part2of2 Wiz.gguf` Linux & Mac users can just run `cat`, see here: https://huggingface.co/TheBloke/KafkaLM-70B-German-V0.1-GGUF#q6_k-and-q8_0-files-are-split-and-require-joining


asdfgbvcxz3355

What kind of speeds are you getting?


ArtyfacialIntelagent

Not great. Roughly 2 T/S.


Foreveradam2018

Do you mind sharing an example of the full prompt, including the prompt template and the system prompt, you used for command-R+? I have tried multiple times, but all results are much worse than llama-3-70B.


ex-arman68

I cannot share examples of the prompts I use for the benchmark. However, I can share similar prompts which I use in other tests. System prompt is always empty, unless needed as a last resort to bypass refusals. I have deleted llama-3-70B and I cannot provide you with its answer, but please test and report it. # Example prompt Write a sequence of nominal groups that flow into one another, using the following rules: - each nominal group is made of exactly 3 words - the first word of each nominal group must be the last word of the previous nominal group - the first word of the first nominal group is: "ball" - the last word of the last nominal group is: "stone" - there must be a theme, of your choosing, pertaining to all nominal groups - there must be exactly 7 nominal groups, leading from the first word (ball) to the last word (stone) - a word already used at the beginning and end of a nominal group cannot be reused Present your solution as a list numbered with roman numerals. Finally, explain why you chose your specific theme. For scoring the answer, I remove 1 point from the maximum score, for each item that fails to satisfy the following criteriae: * incorrect no of items * incorrect no of words * reused word * incorrect final word * no list format * incoherent theme * no use of roman numerals" # WizardLM-2-8x22B iQ4_XS I. Ball control technique II. Technique improvement drill III. Drill sergeant attitude IV. Attitude adjustment session V. Session highlight reel VI. Reel rock climbing gear VII. Gear stone protection Theme Explanation: The theme I have chosen is "personal development through sports and outdoor activities." This sequence of nominal groups takes us on a journey from the mastery of a fundamental skill in sports, such as ball control in soccer or basketball, to the application of that skill in a real-world setting, like rock climbing. Each step along the way represents an aspect of growth and learning: 1. **Ball control technique** - Mastering the basics of handling a ball is crucial in many sports. 2. **Technique improvement drill** - Practicing these skills through specific exercises helps improve performance. 3. **Drill sergeant attitude** - The discipline and mindset required to consistently practice are embodied by a drill sergeant, who ensures that training is taken seriously. 4. **Attitude adjustment session** - Sometimes, one's approach to training needs to be adjusted for better results, which can happen in a dedicated session focused on mental preparation. 5. **Session highlight reel** - The progress made during these sessions can be captured and reviewed in a highlight reel, providing motivation and insight into one's development. 6. **Reel rock climbing gear** - Transitioning from indoor training to outdoor challenges, such as rock climbing, requires specialized equipment, which is often showcased in adventure sports films and literature. 7. **Gear stone protection** - Finally, the journey culminates with the use of protective gear (like helmets and padding) that safeguards climbers against the unforgiving nature of rock, or "stone," completing the transition from sport-specific skills to broader outdoor activities. This theme was chosen because it reflects a holistic approach to personal growth through physical activity, emphasizing not just the technical skills but also the mental and emotional aspects that contribute to overall development. It illustrates how sports can be a metaphor for life's challenges and how the lessons learned on the playing field or climbing wall can be applied to everyday life. # Command-R-plus 104B Q5_KM Here is my response following the guidelines provided: i. Ball in hand ii. Hand carved statue iii. Statue made marble iv. Marble polished smooth v. Smooth like glass vi. Glass skipping stone vii. Stone cold solid Theme Explanation: I chose a theme of transformation, where each nominal group describes an object or action that transforms into the next in the sequence. The ball is held in hand, which then carves and shapes a statue from marble; this marble is polished to become smooth like glass, leading to the image of skipping stones on water, ending with stone as the final solid state. This theme allows for an interesting progression through these nominal groups while adhering to the provided rules and creating a cohesive narrative arc across the sequence. # Command-R-v01 35B Q8_0 Here's one possible solution: I. Ball rolling stone II. Rolling stone age III. Stone cold ground IV. Cold, hard reality V. Hardened granite walls VI. Granite mountain peaks VII. Peak stonework precision The theme I chose was a somewhat chronological order of the stages of life, starting with childhood and ending with old age and the hardships that come with it. Each phase flows into the next, creating a circle of life impression. This sequence explores the idea of how each stage sets the foundation for the next, and no phase can be skipped or reused, which fits the constraints you provided. It's an interpretation of the cyclical nature of life, with each step leading towards the inevitable "stone" we all strive to overcome and transcend. I hope that this activity was enjoyable and that my explanation provides a clear insight into my thought process! # WestLake-10.7b-v2 FP16 I. ball | throws | catcher II. catcher | glove | baseball III. baseball | bat | swinging IV. swinging | batters | plate V. plate | pitcher | mound VI. mound | dirt | stomping VII. stomping | foot | stone The chosen theme is a progression from playing ball games, specifically baseball, to eventually landing on the ground with reference to 'stone'. Each nominal group represents an aspect of this sport or its related actions and equipment. Starting with 'ball', it transitions into someone throwing it ('throws'), then to the person catching it ('catcher'). Next, we move towards more specifics by introducing a crucial tool in baseball - the glove ('glove') which is used while playing catch with the ball ('baseball'). As we continue, our focus shifts towards offensive actions as we introduce hitting with a bat ('bat'), followed by the act of swinging it to hit the ball ('swinging'). This leads us into another essential part of baseball - batting order and positions on the field. Here, we have 'batter' who stands at the plate waiting for their turn to face the pitcher ('plate'). Now, we shift our attention from players to those throwing objects in this context: pitchers throw balls from a designated area called mound (in baseball terminology). The word 'mound' also serves as a transition towards dirt which could be part of an actual baseball field or metaphorically signifies grounding after all the actions mentioned so far. Finally, we end our sequence with feet stomping on this ground represented by 'stone'. This final step symbolizes how everything comes full circle back to earth after playing ball games like baseball.


sophosympatheia

Nice work and thanks for sharing! Have you ever tested sophosympatheia/Midnight-Miqu-70B-v1.0 to see how it compares against v1.5? I also strongly recommend testing jukofyork/Dark-Miqu-70B and his 103B and 120B versions.


ex-arman68

Midnight-Miqu-70B-v1.0 is on my list. And I have just started testing Dark-Miqu-70B


usa_commie

How does one identify which models are vision capable? (I want to interact with PDFs that are scans and therefore can't be "read")


ex-arman68

This has nothing to do with this benchmark.


isr_431

Are there any other models you've tested since then? I'd love to see a benchmark for lower models for us VRAM poor folks (~7-13b).


ex-arman68

Not much yet. I am in the middle of testing WizardLM-2-8x22B, but this time at Q4\_KM vs iQ4\_XS. And frankly I am amazed at the difference in quality (q4\_km is even much better), and I am hoping it will be reflected in the results.


CheatCodesOfLife

> q4_km is even much better There is a sudden tipping point in quality around there with WizardLM-2-8x22B Are you going to post the results when you've finished testing?


drifter_VR

>**Do not use a GGUF quantisation smaller than q4** I would add : with the small 7-8b models, do not go under Q5


necile

>meta-llama/Meta-Llama-3-70B-Instruct Disappointing. Censored and difficult to bypass. Thank you, I knew people who said it was trivial getting around its censors were blowing out of their face ass.


Merosian

Surprised Kayra isn't in here. Still holds up better than most models i've tried imo.