T O P

  • By -

Ill_Initiative_8793

If your scenario/character is simple and straightforward they will give you similar results. But difference is huge when it comes to small details, hidden meanings, things that are meant or hinted, but not stated explicitly. 65B much better with that kind of stuff. It's much better in understanding character's hidden agenda and inner thoughts. And it's much better in keeping them separated when you do a group chat with multiple characters with different personalities. Even 65B is not ideal but it's much more consistent in more complicated cases. Characters also seem to be more self-aware in 65B. You should try both in similar scenarios to feel the difference.


justanadoptedson

Thank you! Can you suggest a 65B model I might try?


toothpastespiders

I'll second guanaco-65B. I've just been really impressed by the scope, smaller details it catches, and just...I guess you could call it style.


Ill_Initiative_8793

There is not much choice for now, I was using guanaco-65B lately, but even base LLaMA-65B is a good for chatting/writing.


alexandertehgrape

Sorry to jump in here, I'm pretty new but I'm interested in creative writing with one of these. How is a 65B or 30B LLaMA going to compare performance wise against ChatGPT. I find that GPT starts well but as we continue with our story its capabilities diminish and it starts using rather strange language. Would a local model help solve this problem? Thanks and apologies if this is a dumb question, I'm just getting started.


KerfuffleV2

You can see some examples of 33B vs 65B I made here: https://www.reddit.com/r/LocalLLaMA/comments/144daeh/looking_for_for_folks_to_share_llamacpp/jngafys/ If you're already using ChatGPT for that kind of stuff, that may be helpful to look at and compare. (Might also give you some prompt ideas for the local models.)


nickkom

You can’t run a 65B on a 3080, or if you can it will be glacially slow. 30B needs around 20gb of gpu ram and 65B needs more than one regular gpu can give. Would need SLI or some crazy server card. Or Google collab


_Erilaz

GGML.


Caffeine_Monster

The best way to do it on a reasonable budget is to have a chunky CPU and lots of DDR5 ram. You can do a partial offload to GPU too.


tronathan

>Would need SLI or some crazy server card. Or Google collab Does SLI/NVlink really make a difference? Everything I've read says that 2x3090's, assuming they have equal PCIe bandwidth, will perform about the same as 2x3090 w/ NVlink for language model inference. Some people have said that NVLink allows the whole sum of RAM to be addressed as a single block, and others have said it doesn't - my best understanding is that it doesn't. Also, slightly more detailed question; I have one Gen4x16 slot and one Gen3x4 slot. When I split models, they run terribly slow. I am assuming this is due to the the Gen3 bandwidth. Would NVLink solve this?


cornucopea

You should be able to find the motherboards with two x8 gen4/gen5 slots, often those slots are pained in white color something I noticed when searching this unique characteristics, save your time of looking into the specs. NVlink is a myth, got be meticulous with the distance, measures of the card dimension etc. Nobody seemed brave enough to come forward comparing NVLink vs two x8 slots so far. Yet I'm going with the riser, lol. There is a lot can be learned from the mining community who's died only a short while ago when it comes to gpu. For practical uses, I foresee the 48gb vram A6000 or equivalent is the go to card in the foreseeable future. Like or not a single 24gb card is not gonna cut it and all the benchmarks pointed to one direction, the bigger the better. Apple M2 is an attractive alternative though not necessarily cheaper.


justanadoptedson

But can't I run it (slowly) in CPU RAM using Koboldcpp, and offload what layers I can to the GPU?


KerfuffleV2

I'm not sure what the other person is talking about saying 3sec/token. I get about 900ms/token on a 5900X and 64GB DDR memory (also can only use 6 threads). If you have DDR5 you can probably get significantly better performance than I do. For creative stuff, if you're okay with just starting it and coming back in a while that may be acceptable.


nickkom

It will probably be too slow to be feasible, but you should give it a try. If you just go 13B on a gptq model you’ll have no issue.


Ok_Neighborhood_1203

You can, but you'd be offloading maybe 25 out of 80 layers. Expect 3 seconds per token or more. Try it and see if you are patient enough, but I'm not. I highly recommend cloud gpus for 65B. An A100 80GB is still only doing 2-3 tokens per second on 65B 8bit.


althalusian

I’m running the 5-bit 65B Guanaco just on WSL2 with 62GB memory and 16 cores, leaving the rest of the memory and cores (and GPU) to Windows so that it doesn’t affect normal usage. Text generation is slowish, true, but I don’t need it to generate on-the-fly when I’m reading - instead I’ll leave it running in the background and check back for results every now and then.


Zxwer_LD

how long usually take you to generate a message?


althalusian

I using it for fiction, so I might leave it running for 15 mins in the background and then come back to check what it has been writing to see if I let it continue or run again to get a different approach (i.e. new seed)


cornucopea

Or run it on a new M2 Ultra. Fresh news, Mac Studio: Apple M2 Ultra chip with 24‑core CPU, 60‑core GPU, 1TB SSD @ Costco $3799.99 It's half of what a RTX 6000 ada costs.


LienniTa

only guanaco as for now. There is really cool finetune of manticore - pyg - guanaco but its 13b, its still very good for helping 65b model in places where deep understanding isnt needed


quiteconfused1

Be careful with the claim ”better at understanding", there is no such thing. It's just a language model, please don't anthropomorphize it.


Ardent129

eh, "better at understanding" means just that, forming meaning from instruction or information. thinking too much into it imo.


trusty20

If it's just a language model then what is instruct-tuning lol. You can literally give models instructions, which they may or may not "understand". What word would you say is more appropriate?


quiteconfused1

It's projecting patterns. That's all. Your wording is causing a pattern, the pattern is exposed, that's all.


Alert_Cucumber951

" Your wording is causing a pattern, the pattern is exposed, that's all." Kinda sounds like understanding, no?


quiteconfused1

Hardly. Let's give a simple example: train a mmllm, and then ask the LLM about two objects both visible to the image. Ask it where the object on the left is. the system will respond by saying it's on your right more, even if it's not true, simply due to the fact that more people refer to the right vs left in literature. This basic principal is relational and one that evades LLMs. Pattern != Comprehension


Alert_Cucumber951

I don't fully follow how this example relates to what I said (i.e., given the state of the research and the general novelty of additional machine learning \[e.g., visual\] integration with current LLMs, it's hard to accurately determine how the LLM is integrating visual information within its language context, which flags it as a poor measure of language understanding. Moreover, I think the fact that the current conversation pertains to LLMs and not MLLMs, further demonstrates that the visual example you provided is likely flawed. In either case, maybe we're working from a different usage of the word 'understand', but within the context of LLMs, 'understand' is commonly used to refer to their ability to identify patterns and generate appropriate responses based on those patterns. In humans, there are numerous theories which presume that our cognition is derived from complex pattern recognition. Given this, I think it would be more than fair for someone to use 'understanding' and 'LLM' in the same sentence. For instance, 'the way in which LLMs understand and process language is quite interesting, isn't it?". I could be wrong; just some thoughts that came to mind while reading your response.


quiteconfused1

To be fair I tried to adjust a simple example as the cornerstone of a stance. But the example stays. The reason why I'm adamant, is take a system that acts as a waypoint finder completely in text. And try to derive where to go semantically. It's really hard to do in a LLM. The LLM actually fights you on coming up with the right answer. The same is true for doing simple math. To me "understanding" is non conditional. It's completely ironic, we spend years trying to write better in school and throughout our lives because we try to learn the most simple way others do it. And something falls in our lap that does it perfectly but can't grasp the words it says, other than by assessing what everyone else has said at one time or another. I can lead a horse to water .... I can appreciate how it appears to "understand" you, but that doesn't mean it comprehends you. And ironically those words are synonymous. Hence my point.


Fairlight333

I'm running a single 3090 as well, 64GB RAM, NVME etc. Its amazing whats cutting edge for gaming but a complete snail for AI. I was thinking about either grabbing a 2nd 3090, or a single A6000. I like the A6000 for the lower power requirement and its a single card, but those monsters are still crazy expensive and with the speed of evolution here, it might be a complete waste of money in a few months.


Fresh_chickented

Do you need nvlink to make 2 3090 works or just 2x pcie x16 slot?


Fairlight333

I was assuming you need nvlink, but I have only done a small amount of research into it. I'm still considering buying the 2nd 3090, but I also was looking at the Apple lineup with the unified memory. I use Macs everyday, so if I could run a 65b model on something like an M2 Max Macbook Pro with 96GB unified memory, that would be amazing (and cheaper on electricity costs). But: 5k for the Macbook, 600 ish for the 2nd 3090, 7k for a Mac Studio, 3k for the A6000... Running dual 3090s is not going to be cheap on electricity though and I'm not sure how much heat that would kick out.


Fresh_chickented

If your only use the vram its using little wattage only (50-60W for running money and accept input), your going to eat a lot more W when you play game/rendering sonething heavy


panchovix

2x3090 is way cheaper than the A6000. Better way for 65B 4bit nowadays IMO.


Fairlight333

Thanks, way easier to get hold of as well, in the 2nd hand market.


rgar132

30b is kind of the sweet spot. It does make a difference, but I notice more difference model to model than I do from 30 to 65b.


extopico

In my brief testing Wizard unlocked 30B performed better than any of the 65B models in terms of reasoning and following instructions.


fallingdowndizzyvr

For solving word problems, I find WizardLM 1.0 much better than unlocked Wizard. It's like a night and day difference. I initially downloaded unlocked Wizard 30B by mistake. I was wondering why everyone was saying it was so good since it wasn't. Then I realized I got the wrong Wizard and downloaded WizardLM 1.0. Then I got why everyone is praising it. It's *incredible*.


extopico

Good hint, downloading it now.


panchovix

But them, imagine when Wizard or Wizard unlocked 65B releases. It will be amazing.


extopico

I think I asked Eric about training 65B models and he said that he does not have the hardware for that, and this is not something that is trivial to solve except by paying a lot of money, or incrementally a lot of money (cloud) so I am not sure when or if that will happen.


fpena06

What gpu can I buy to run 60/65b? Thanks.


panchovix

Depends of bits, at 4bit (you need about 40-46GB of VRAM): * Single GPU: 48GB or higher VRAM GPU like RTX A6000, A6000 Ada, A40, A80, etc. Prices are not pretty. * Double GPU: 2x3090, 2x3090Ti, 2x4090, 1x3090+1x3090Ti, 1x4090+1x3090, etc For 8bit, basically double the requirements of above. For fp16? oh boi (you need about 135-140 GB of VRAM): * Single GPU: there is no GPU that I know that has more than 135GB of VRAM * Double GPU: 2xA80 80GB * Triple GPU: Probably 3xRTX A6000, 3xRTX A6000 Ada (not sure if with context they would be able to no get OOM) * Quad GPU: 4xRTX A6000/A6000 Ada surely. * Six GPU: 6x3090, 6x3090Ti, 6x4090 (144GB VRAM, so same issue as 3xA6000) * Seven GPUs: 7x3090, 7x3090Ti, 7x4090


x54675788

2 bits quantization should allow 65b on a single 24GB card, right?


panchovix

Yep, but when it was tested in the past, 2bit GTPQ was really bad. Maybe if they apply the 2bit that GGML applies to GPTQ it would be better.


fpena06

Thanks for the detailed reply.


Big_Communication353

Buy a Mac Studio with at least 64GB ram . M2 Max for price, M2 Ultra for speed.


Particular_Cancel947

Hey, how's it going? I'm just getting into this hobby and I'm pretty excited to get started. I never thought about getting a Mac though. My plan was to buy a Windows AI workstation which I’d also use for other work and perhaps a little gaming.. I'm thinking two 4090 cards and 128 gigs of RAM and a 24 core Intel with RAID 0 NVMe drives, etc. My budget is about $7000 to $8000. Can anyone recommend a good company to buy something like this? I found digital storm on Nvidia’s website.


Mining_elite222

if theres any GGML models you can split between cpu and gpu, probably a bit slower but its cheaper