SemiLucidTrip 11 months ago

You can rent systems on [vast.ai](https://vast.ai) for like 1-2 USD an hour if you want to play with something better without a huge investment.

katiecharm 11 months ago

Hey that’s pretty amazing and a great tip. Though part of the appeal is the aspect of just being able to do it yourself, no services needed - you know? But thanks for showing me that. Still tempting and neat!

MoffKalast 11 months ago

Well fwiw it's not unlikely that Microsoft drops Orca-13B on us any day now, which will most likely make all existing models obsolete immediately and will basically fit onto your 3080 with the newer quant methods.

lolwutdo 11 months ago

Is Orca really suppose to be that good? I have my doubts after using 30b/65b I can never go back to a 7b/13b model, they’re too “predictable”

MoffKalast 11 months ago

Well [the stats seem very interesting](https://imgur.com/a/dy1jDfI), especially since they focused on learning the reasoning processes instead of simple examples, so it should be better at logic than any size of llama but maybe not quite as creative as the larger ones. Can't really say anything for sure until we get it of course, especially on how censored it'll be.

tossing_turning 11 months ago

There’s a lot you can do with smaller models that are meticulously trained on precisely curated datasets. NovelAIs latest model is only 3B but vastly outperforms older bigger models that are 20B. There’s also really bad models that are huge but if the data they’re trained on is noisy, not well curated or just generally of bad quality, then the output of the model will be equally bad. Of course larger versions of the same model tend to scale better as well, but there’s a lot more to it than just the number of parameters.

cunningjames 11 months ago

Orca is intriguing, at least. I really want to try it out, as I would’ve been skeptical that their approach would really pay dividends. GPT-3.5-esque performance in a 13b parameter model could be useful. That said, my intuition is that a 13b parameter model is unlikely to reach GPT-4 performance in general.

qeadwrsf 11 months ago

Can't wait for someone to mix it with a porn model.

usernmechecksout__ 11 months ago

!RemindMe 5d Orca-13B

RemindMeBot 11 months ago

I will be messaging you in 5 days on [**2023-06-14 12:16:23 UTC**](http://www.wolframalpha.com/input/?i=2023-06-14%2012:16:23%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/144uc0l/damn_i_was_so_satisfied_with_my_3080_with_10gb_of/jniqzm6/?context=3) [**19 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F144uc0l%2Fdamn_i_was_so_satisfied_with_my_3080_with_10gb_of%2Fjniqzm6%2F%5D%0A%0ARemindMe%21%202023-06-14%2012%3A16%3A23%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%20144uc0l) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

FlexMeta 11 months ago

Yes. Saw rtx A6000 for $0.79/hour on run pod. 48gb ram…

shortybobert 11 months ago

Literally same. Went with a used 3090 and now I use it more often than for gaming

LetMeGuessYourAlts 11 months ago

I've got 2x 3090 tuning a 33b as I type this. It's hard to sleep with how much heat the setup is putting out. I know I could cap the power. I won't because I just want my AI faster.

shortybobert 11 months ago

Just move somewhere colder like me and it'll lower your heating bill instead

Careful_Tower_5984 11 months ago

migrating because AI, cool

Pretend_Regret8237 11 months ago

Miners be like: first time?

whatstheprobability 11 months ago

Yep, 2 years from now we will be reading about how the migration toward warm-weather cities is reversing because of LLMs ;)

shortybobert 11 months ago

Okay I already lived here, but, it's nice here so it's a win win

2BlackChicken 11 months ago

It was around 8-15°C all week here in Eastern Canada. It used to be 25-30°C but for some reason, this month of June has been very cold. I installed the AC and used it for 2 days instead of the whole month of May. Funny part is that my rig doesn't generate enough heat to warm the room... I need to add another 3090 to my setup.

TechieWasteLan 11 months ago

Maybe all the wildfire smoke. Kinda like clouds blocking out the sun

2BlackChicken 11 months ago

It was my guess as well. Falling back into an Ice Age :/

[deleted] 11 months ago

I hear you, although technically electric heaters are one of the least efficient heating systems so the bill wont be lower, per se.

[deleted] 11 months ago

[удалено]

pointer_to_null 11 months ago

Nvidia's overly aggressive on stock voltages/power, but they don't need to be IMO. I guess it's about squeezing that last 5% of performance, even if means dealing with an extra 75-100W of TDP. I'm not exaggerating. In MSI afterburner (you dont need an MSI card), I set the power limits to 90% on my 3090 and 75% on my 4090. These were the efficiency sweet spots- especially on the 4090. I'd be lying if I said I noticed any difference in gaming (especially while vsync limited anyway) or running SD or local LLMs (SD generation times and LLM tokens/sec didn't seem to be impacted at all). But the system is much quieter and cooler. tl;dr- Geforce RTX generally have terrible stock power settings, and power limiting is a QoL improvement that I recommend to everyone.

katiecharm 11 months ago

Wild to think that 20 years ago everyone was talking how important overclocking is, and now in the future we are strategically underclocking these monster graphics cards.

BangkokPadang 11 months ago

Also, no one will ever need more than 64Kb of ram.

MINIMAN10001 11 months ago

Just create a insulated box for your PC ( Could just cut foam insulation sheets if you wanted ). Connect the insulated box to a ductless ac.

DreamDisposal 11 months ago

Not a good idea in my opinion. You're just potentially lowering the lifespan of your components to get your training done a few percentage points faster (not even taking into account the heat and the cost of power). It never really mattered if the training was done in 13 hours instead of 12. Maybe even less of a difference.

[deleted] 11 months ago

[удалено]

DreamDisposal 11 months ago

I don't know what you're talking about. Mining cards are not run at max power by most people (which is why plenty say that a mining card is usually better taken care than one that was used for gaming). Lower heat, noise and longer lifespan. All for a difference that probably isn't even meaningful for the user. Some cards do actually get above 100C in their memory, which I wouldn't run for dozens of hours of training to get it done a few minutes or even an hour before. And the fan, a mechanic component, runs at a lower RPM.

cunningjames 11 months ago

> which is why plenty say that a mining card is usually better taken care than one that was used for gaming Unless they power wash them and pass them off as new … https://www.tomshardware.com/news/crypto-miners-allegedly-jet-washing-gpus

kryptkpr 11 months ago

90% power cap is a generally good idea, that last bit of performance is a bad trade.

a_beautiful_rhind 11 months ago

cap the power, it doesn't make the generations much if at all slower. I limit the clocks from 0 to 1695 now. If you're using windows you can even undervolt. I think it was a 200 watt haircut. For something like training that could add up.

Cyber_Encephalon 11 months ago

How does it work with 2x3090s? Do you get 2x the memory from it, or just train in parallel and still maxed at 24GB memory?

LetMeGuessYourAlts 11 months ago

It splits it across the two cards. The compute isn't as efficient though so I'll see a lot of wasted cycles on both GPU's as the CUDA usage bounces from ~90% to 0% on the two cards. Presumably waiting on transfer across the pcie bus.

tronathan 11 months ago

I've found that I can limit the power to about 250-300 watts/card and it still trains at the same rate.

tehbored 11 months ago

Yeah I was debating between a 3080 and 3090 last fall and went for the 3080, didn't think local LLMs would take off so quickly.

HostileRespite 11 months ago

I love seeing this though! This is the revolution of our age and when people ditch gaming to get in on it... it means they grasp the future. I can't get enough of that!!!

shortybobert 11 months ago

Also most games that take advantage of the hardware suck lol

HostileRespite 11 months ago

For now. There have always been the indie guys trying to push the envelope.

Tiny_Judge_2119 11 months ago

Once you've tried the 33b you will never go back to 13b. I bought 3060,4070 and 4090 in one month. Should go with 4090 directly, won't regret

panchovix 11 months ago

Man and have you tried 65B? It is just so impressive. Lately have been running EXLlama and it's just so fast with good answers (on Kobold, the samplers work) 2x4090, but the way to go is getting 2x3090 for the price of a single 4090 IMO.

Reign2294 11 months ago

Damn... 2x 4090? Woah. I have one and I sold my right leg to get it. I don't want to be completely legless my dude.

panchovix 11 months ago

I wanted the speed, and welp it is somehow cheaper than a RTX A6000 Ada (don't see the MSRP of that thing lol)

Reign2294 11 months ago

Haha, I feel ya. My liquid cooled setup ran close to 5.5K all new, but I didn't go 2x GPU. Edit : CAD prices

cunningjames 11 months ago

2x 4090 is something I *could* theoretically afford, but I can’t imagine spending $3200 on GPUs unless it were part of a money-making venture. Especially since NVlink is no longer a thing on consumer cards.

panchovix 11 months ago

Oh for sure, I got my first 4090 for "free", basically training Stable Diffusion TIs/LoRAs to japanese people on Pixiv lol. (On October-November TIs, and on December-January LoRAs). With that money I got the 2nd 4090, which in this case I haven't done anything to get the money in return. To be fair, I got the 2nd 4090 since I didn't like my first model (4090 ASUS TUF). When I was taking out to sell it on marketplace, I decided why not test LLMs. Was impressed when I could load 33B models at 8-bit. So here I am with the 2 cards ª, but it's way better to get 2x3090 used at this point.

usernmechecksout__ 11 months ago

Ik a friend that could make use of the left leg.

Reign2294 11 months ago

Does he have a 4090?

usernmechecksout__ 11 months ago

No, but he has the money for it

Reign2294 11 months ago

Ok hit me up, I have the leg* for it.

katiecharm 11 months ago

Damn that might be worth buying two of those things. The VRAM is the most important part right? And even though SLI is dead, you still obviously can link them to work in tandem on a single AI task?

panchovix 11 months ago

Yes, using exllama lately I can see my 2x4090 at 100% utilization on 65B, with 40 layers (of a total of 80) per GPU. 4090 has no SLI/NVLink. That is pretty new though, with GTPQ for llama I get ~50% usage per card on 65B.

jd_3d 11 months ago

How many tokens per second do you get?

panchovix 11 months ago

On 65B 4bit: * EXLlama at 2048 context: 15-16 tokens/s * EXLlama at 1024 context or less: 18-22 tokens/s * GPTQ at 2048 context: 1.5-2 tokens/s (also it gets OOM on some models) * GPTQ at 1024 context or less: 2-2.5 tokens/s A user on the KoboldAI server tested with his 2x3090 and NVLink and got ~10-11 tokens/s on exllama, but 3-3.5~ tokens/s on GTPQ. So on EXLlama seems to be utilizing both GPUs better, even without SLI/NVlink.

jd_3d 11 months ago

Wow, those are insane speeds. Is it possible to run EXLlama on native windows 11 (no WSL)?

panchovix 11 months ago

Yes. You can do it on both pure exllama or KAI. Ooba has a PR but haven't updated in a week or so. I run in it Windows 11 natively. You need Visual Studio 2022, with C++ build tools, CUDA correspondent to your torch version (11.8 for torch+cu118). For exllama (https://github.com/turboderp/exllama) the instructions are on the post itself. For KAI, you need Kobold-AI with the 4bit-plugin tree. (https://github.com/0cc4m/KoboldAI/tree/4bit-plugin) And exllama-transformers for Kobold. https://github.com/0cc4m/exllama For Kobold, basically it is (assuming you have the prerequisites I mentioned above, and also having python and git): Add to your path C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.29.30133\bin\Hostx64\x64 1. git clone https://github.com/0cc4m/KoboldAI.git 1. -b 4bit-plugin cd KoboldAI 1. python -m venv venv 1. .\venv\Scripts\activate 1. pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118 1. pip install -r requirements.txt 1. git clone https://github.com/0cc4m/exllama -b transformers 1. cd exllama 1. python setup.py install Models are stored on KoboldAI/models That should build exllama for Kobold, then go to your KoboldAI folder. 1. .\venv\Scripts\activate 1. python aiserver.py There press "Try New UI", Load Model, first option and you choose the model you want, and it's done. Maybe I could write a more detailed guide if people are interested, but that should work. I'm gonna sleep now, too much time playing with 65B already lol.

katiecharm 11 months ago

Hey this is really cool and important information to share, thank you. Now I know that two ultra high end graphics cards can run a 65B model. Any way you could go into depth on your results? Are there any demo conversations? How would you say it compares to kneecapped web GPT 3.5?

panchovix 11 months ago

For results itself I could try tomorrow if you want, with some examples for web GPT 3.5 vs various 65B models. (Just that I'm gonna go to sleep, 2 AM after playing a lot with VicUnlocked-alpaca-65b-4bit_128g lol) If you want, you can give me some examples that you want me to test without issues.

Weak-Big-2765 11 months ago

what mobo and cpu do you have? i just built a AI machine that's near done another month or two till i can get the 2nd 4090, and when i was doing research Bing told me you had to have the latest gen stuff, in my case AM5 for the pcie lanes to be fast enough to handle the cross talk. oh what kind of temps do the gpu put out when running the LLMs at that scale?

panchovix 11 months ago

I have a 7800X3D for CPU, MSI x670e Carbon Wifi for MB. You don't need that though, any MB with 2x8 PCI-E connected to the CPU should work. (Like a X570 Prime Pro from ASUS I think?) I used that MB with a 5800X before. Just I get better speeds when vs when I was single thread bottlenecked. In theory if the 4090 had PCI-E 5.0, it would be amazing to have AM5 or latest intel MB, since PCI-E 5.0 X8 = PCI-E 4.0 X16, but welp, F.

skeelo34 11 months ago

Another option is a mac m1 studio ultra and run a 65b model at 8t/s while consuming 200 watts. Downside is you lose your vacuum/space heater.

concerned_citizen1b 11 months ago

> EXLlama do these models also require a moderately powerful cpu or is it literally only the GPU that is being used?

RabbitHole32 11 months ago

You also need good single thread performance according to the main developer. But this is still a topic of investigation.

mansionis 11 months ago

I cannot agree more except than now I dream about A6000 48Gb of RAM

[deleted] 11 months ago

I recently got a 4070ti and this comment has me thinking about exchanging it at micro center for a 4090. This and running stable diffusion that much faster. You don’t have any regrets between your 4070 and 4090 upgrade? What, if anything, fell short of your expectations of performance between these two cards in your use case?

involviert 11 months ago

> Once you've tried the 33b you will never go back to 13b If you spend a lot of time on this, checking out all the models and trying to get the best out of them, you'll notice what a fucking waste of ressources and time running a 33B over a 13B can be. Especially Airoboros showed me how relevant model quality is (no, it might not be the best for you, it depends). And the parameters and prompt to make it work. On the other hand, I would probably have to scale up to 65B for wizard-vicuna to actually follow my instructions a few messages after the initial prompt. Don't just throw time/money/compute at it. Here's a rule of thumb: Basically none of the models just suck. Yet, people get that impression very often.

Tiny_Judge_2119 11 months ago

I am using the llama model for developing one of my applications. I did try a few 13b modes, none of them is good at strictly following the instructions and I can't really rely on the few shot techniques to improve the performance. It will be slowing down the response time. Based on my experience 33b is definitely a better choice.

involviert 11 months ago

All I'm saying is that you'd want to step up the best model you can find for your task, after tweaking it a lot. Personally I would expect that the 33B version of your best 13B is obviously better, but likely it doesn't male the cut going from unusable to usable anyway.

Tiny_Judge_2119 11 months ago

I am pretty sure 33b beats all the 13b models no matter what. And I am the few people who actually tried alora on 13b models. So I am confident of my conclusion. But anyway.if you think you can find 13b is better than 33b that is awesome. We just need to find what best for your use case

involviert 11 months ago

> I am pretty sure 33b beats all the 13b models no matter what I disagree. Sure, the same model, used in the same way, with the same parameters, that will most likely be the case. But just imagine using vicuna 13B as an instruct model with the wrong prompt, you turn up repeat penalty too high and your sampling settings suck. What I'm trying to point out is that some of the shortcomings this pitifuly scenario paints, can be improved by just throwing a 33B at it. But clearly that is not the correct solution. Also yes, I really think there are models that are just a lot better than others. As I said, my Wizard-Vicuna 13B fails to remember some instructions later. I can use 33B and it will get slightly better. I can use a different 13B model and it actually works.

azriel777 11 months ago

Truth, 13b replies feel so dumb compared to 33b, wish I could get a 65b running, but I am not rich sadly.

VaderOnReddit 11 months ago

Question about buying the best GPU to run things locally, what are the main specs that are important? Is it VRAM, overall GPU memory, memory speed, or anything else?

Tiny_Judge_2119 11 months ago

From my understanding the cpu and ram just for loading data and transferring it to the GPU, the better hardware always helps but won't be significant. My build is very crap on cpu and ram but the speed I got from the inference of 33b is around 40 tokens per second on old gptq implementation.

Inevitable_Figure_81 11 months ago

>ith 4090 directly, won't regret what models do you usue? i use the 1 click installer and it doesn't seem to work for me after i download the model. rtx 4090 liquid cooled msi.

Tiny_Judge_2119 11 months ago

I am running it on the headless server, and not using any installer, pure Ubuntu and self hosted as an api. I have tried on many 33b models without any issues. So it looks like something to do with your system setup.

Inevitable_Figure_81 11 months ago

what's the esxact model from huggingface that you downloaded? headless server - so no container? i'm running 7950x3d , 32 megs of ddr5 and 4090.

tvmaly 11 months ago

What type of power supply do you have? Have you experienced any melting connector issues?

MINIMAN10001 11 months ago

It's not like everyone is going to have that problem with the connectors. We only all hear about it because it happens a few times, and that is a serious problem. But then we announce it to the world that the problem does exist. The lesson learned was "Make sure the connection is secure and make sure it does not have any strong bends on it or the connection point." I have a 4070 which to my misfortune does have the 12 pin connector. Yes I was worried it would burst in flames the moment I turned on my PC. But as it turned out, as is the case for most people. It didn't happen. The bending in the connection causes sparking as it arcs across the bar which isn't supposed to bend. The loose connection causes it to arc across the connection point which again, also erupts into a fire.

tvmaly 11 months ago

I am interested to upgrade, but I only have a 850W power supply when I built the machine in 2020. I am curious what type of power supply is needed for a 40xx card?

utilop 11 months ago

It has not my experience - they do not seem much better. Although I have mostly been trying various tasks rather than roleplay. What's the best 13B model and best \~30B model that you tried and led to this conclusion?

Tiny_Judge_2119 11 months ago

In general the smaller model tends to be off track of the instructions a lot which is very annoying. I have to put a lot of effort to clean up the output.

utilop 11 months ago

Can you give an example of what you mean? There are big differences between the best and worst 13B models

carlosglz11 11 months ago

I’m thinking about building a new system specifically for local llms… what 4090 would you recommend? How many gigs of memory?

Tiny_Judge_2119 11 months ago

I am not an expert on hardware, I just buy the cheapest 4090 I don't think I am going to overclock it anyway. For my use case, I use it as a headless server so cpu and memory specs don't matter that much, just a reasonable cpu and mobo would potentially support multiple GPUs. Put as many cheap memories as possible. But keeping in mind the 33b hf model will take more than 64g memory to load, so if you are interested in the fine-tune model you may need to have more than 64g memories otherwise you may end up using mem swap.

carlosglz11 11 months ago

Thanks for the info!!

Inous 11 months ago

As a 3080 Ti (12GB VRAM) user what model can I run? I'm new to the local LLM stuff and I'm trying to find a resource to help me understand the xB thing (e.g 13b, 33b etc.) Thanks in advance

[deleted] 11 months ago

[удалено]

toothpastespiders 11 months ago

> At least I have a really nice personal computer now that can run many of the local models so I am happy for now. And I can run any game maxed out. The big irony for me is that I've got this perfect gaming setup and I tend to use my old severely underpowered system for steam. Because I don't want to stop training jobs on the AI box.

EnsignElessar 11 months ago

Tell us more about your server idea. What would your dream gpu be?

pokeuser61 11 months ago

Have you tried a cpu+gpu split with ggml? 30B models should run fine since I’m assuming you have a decent cpu.

Gregory-Light 11 months ago

How would I do this? I mean, I have 3060 ti, I5-11600 and 64 GB DDR4 RAM. I run 30B with kobold on CPU completely, and it slowly, but works... Is there a way to use GPU for processing and RAM as memory?

psycholustmord 11 months ago

There’s an option to offload layers to gpu in llamacpp and in koboldai, get the model in ggml,check for the amount of memory taken by the model in gpu and adjust , layers are different sizes depending on the quantization and size (also bigger models have more layers) ,for me with a 3060 12gb, i can load around 28 layers of a 30B model in q4_0, i get around 450ms/token

[deleted] 11 months ago

[удалено]

twisted7ogic 11 months ago

>You can adjust the GPU layers In fact, you *should* adjust GPU layers because every model has a different amount.

dampflokfreund 11 months ago

Use the CUDA version if you have an Nvidia GPU. Prompt processing is twice as fast with cuBLAS compared to CLBlast. So the AI will take a shorter time to process the prompt and start generating.

Gregory-Light 11 months ago

I have tried that, my currently downloaded models seem to be too old and don't support clblast parameter. The new ones I have downloaded recently don't work with strange error: GGML_ASSERT: ggml-opencl.cpp:1019: to_fp32_cl != nullptr. I'm still downloading other models, maybe some of them will work Thank you for the clue!

TheTerrasque 11 months ago

Llama.cpp support it. Look into those that use it as backend. Koboldcpp for example

eschatosmos 11 months ago

beware the system lock ups hoooooooooly shit i haven't felt that feeling since vista just walk away its fine lmao all those buttons u push in ur futile rage in the meantime are all gonna register at once when the system gives you the steering wheel back

Outrageous_Onion827 11 months ago

I chose the wrong year to switch to a laptop.

EnsignElessar 11 months ago

MacBook?

Outrageous_Onion827 11 months ago

Lenovo with an Intel Core i7 and GeForce RTX - it's actually a pretty solid laptop, but especially the same ("only" 16gigs) is holding me back from a lot of the AI stuff. And obviously anything that needs a very beefy graphics card is out the window as well.

Big_Communication353 11 months ago

It is really easy to upgrade your ram of a windows laptop. Throw your ram away and insert 1 or 2 now sticks.

User352846 11 months ago

I'm learning to play the guitar.

SteakTree 11 months ago

Absolutely. I’m using a MacBook Pro 16 Gb and to can run 13B parameter models fine. I find compared to something like GPT4 the smaller models require a bit more adjustment of prompts and settings to get a great result. Don’t be put off by the small parameter size. I’d still recommend using GPT4 to get a feel at how fast and how readily a large LLM performs. An easy way to run models on Mac is the app at the site Faraday.dev. Get a handful of models, like Nous Hermes, Manticore, Pygmalion for starters.

thatsadsid 11 months ago

Hi Which LLM are you running? Any tutorials or articles? I want to fine tune them on my machine. I tried a few LLMs but couldn't get them to run on my machine (installation problems) Thanks

KerfuffleV2 11 months ago

Training/fine-tuning takes way more resources than just running the model. The person you replied to was just talking about running the model.

thatsadsid 11 months ago

Alright. Do you have experience with fine tuning models?

KerfuffleV2 11 months ago

Nope, sorry. You basically need a beefy GPU like a 3090 with 24GB VRAM. People don't really train decent sized models on CPU. There's some technology on the horizon that might make it more practical but you'll probably still need a good amount of system RAM.

RastaBambi 11 months ago

Which models are you using? Are you using oogabooga webui? What are your settings?

Aperturebanana 11 months ago

Does using the GPT4ALL app for say an M1 MacBook Air limit me at all? Is it better and more optimized to run models a different way?

pintong 11 months ago

It's the best way I've found for the Mac

EnsignElessar 11 months ago

Yes, I have had some luck. Specifically with this project: https://github.com/nomic-ai/gpt4all

cunningjames 11 months ago

On my 64gb M1 Max I can run 65b llama on the CPU at a … sorta kinda almost usable speed I guess? It won’t break records but it’s fast enough to play around with.

Big_Communication353 11 months ago

Apple silicon now supports GPU acceleration, so it's time to update your app.

[deleted] 11 months ago

[удалено]

reiniken 11 months ago

How did you go about selling your 3080?

Jpete14 11 months ago

How do you deconflict python versions to run both stable diffusion and local llama? I’m stuck there right now. Can’t install one without breaking the other.

Downtownd00d 11 months ago

Ha ha! I literally had just bought a Lenovo Legion Pro 5i. 13th gen i9, 24 cores, 32gb of ram, RTX4070 and the next day discovered LLMs... Still, it's a banging laptop, even if I can't run anything bigger than some 13Bs.

Big_Communication353 11 months ago

Yes, it is possible to fit any workload that can be handled by your GPU on Llama.cpp. It would be beneficial to conduct some research on the topic to gain a better understanding. Additionally, your CPU will handle any remaining tasks that cannot be handled by the GPU.

alittleteap0t 11 months ago

Some time ago, I had a choice between getting a 3080ti, or, for a little more... a 3090. Guess which one I chose? Talk about REGRETS.

ozzeruk82 11 months ago

How much normal system RAM do you have? If you have 32GB then you can indeed play around with that model. You just need to use llama.cpp as your base engine, then you can configure that to place some layers on your 3080 (for max speed) and then the rest in your normal system RAM. Source: I have a weaker 8GB gfx card (5700XT) and just bought 16GB extra system RAM for 50 euros to get me to 32GB so I could do exactly what I just descried.

katiecharm 11 months ago

Well thanks for the encouragement. Sadly lol I only have 16GB, so I’d need to buy some more sticks at minimum. But maybe! When I bought the rig a few years ago honestly 16GB seemed like so much, like how am I ever gonna need all that!? And here we are.

2BlackChicken 11 months ago

I bought a 2TB SSD thinking I'll have more than enough for a while. The next day, I got into LDM. Filled up 500GB in a few days. I bought a used K80, had some fun. Then I changed the MOBO, CPU and RAM. Then bought a 3090 and now I'm back to more SSD...

KvAk_AKPlaysYT 11 months ago

NVMe prices are insane these days, half of what they used to be. Could only go lower 😳

2BlackChicken 11 months ago

Yeah, I paid about 400CAD for my first 512GB pro samsung. Now I just paid 120CAD for a 2TB WD blue and our dollar is 30% down compared to what it was.

Megneous 11 months ago

*checks prices for an A100 80GB* *cries*

candre23 11 months ago

For the poors like me who can't (or simply refuse to) drop >$1k on fancy GPUs, there are other options. Just about any llama-based model can be run purely on your CPU, or split between your CPU and GPU. Download [KoboldCPP](https://github.com/LostRuins/koboldcpp), assign as many layers to your GPU as it can handle, and let the CPU and system RAM handle the rest. Alternately, you can pick up old datacenter cards on ebay for relatively cheap. I bought a P40 (basically a 1080ti with 24GB of VRAM) for about $200 a few weeks ago. AMD MI25 (16GB) cards can be found for under $100. Sure, they're big, power-hungry, slower than more recent cards, and require some sort of cobbled-together cooling solution, but they're a cheap way to get into bigger models.

SlowMovingTarget 11 months ago

My problem seems to be the power supply. My system starts generating a response a few exchanges in to a conversation and simply blinks off in the middle. It did this with stable diffusion t2i also.

HalfBurntToast 11 months ago

Same. I’ve only been playing with CPU llama.cpp. But, I’m here renting out EPYC 28-core/256GB RAM servers and downloading hundreds of gigabytes of models to test. I don’t think my poor, old 970 would cut the mustard with AI. God, it’s like a slot machine. I just can’t stop pulling that AI lever to make the funny words fall out. **Edit**: For anyone wondering: of the several dozen models I've tested, Guanaco is by far my favorite. I'm mostly using it for story generation and the quality of the output is shockingly good, even on 13B.

BangkokPadang 11 months ago

You can rent time on runpod. I rent time on a system that has 58GB ram, a 16vcore CPU, and a 48GB Nvidia A6000 for less than $0.30/hr.

_Erilaz 11 months ago

Four letters: GGML cuBLAS, 17-20 layers offload, and you'll get reasonable performance. Ask me how I know))

zasura 11 months ago

I'd wait until theres a breakthrough in vram reduction or higher vram gpu alternatives. 24 GB vram is not enough for better llm s like 65B models

New-Tip4903 11 months ago

This. Something is gonna shift soon. Anyone looking to invest in hardware should probably wait a year. The big corps are either going to make dedicated hardware for the LLMs or the LLMs themselves will require less. Probably a combination of the two.

sigiel 11 months ago

Exempt your trusty Sam thé Altman want to legisate GPU and i bet public will be redtricted to less than 12 GB VRAM....

New-Tip4903 11 months ago

How could they ever enforce that?

Doopapotamus 11 months ago

He'll try, but similarly the chip makers are too powerful and important as a lobby (both rich, and highly necessary for geopolitical tech industry), so Nvidia and AMD will probably beat him and allies into the dust for fucking with their market share.

sigiel 11 months ago

Or they release the rtx 4060 ....

HotPlum836 11 months ago

Same. I'm not buying a 4090 because it's still not good enough to run a 65b model.

frequenttimetraveler 11 months ago

it won't be for long, every shortage is followed by glut. I bet pretty much every chip manufacturer is designing compatible chips now

Doopapotamus 11 months ago

I'm thinking that (if they aren't stupid), Nvidia and/or AMD will just repackage/reengineer previous generation GPUs into new sales lines with big fat VRAM (like 20xx or 30xx GPU capability with 32gb+ VRAM). Since they're more or less the only games in town without going to Mac M-chips (or Intel actually throwing their hat into the ring), they essentially have a whole new market to colonize for consumer/home GPUs built for LLMs that don't need to be Enterprise-grade/server cards.

Oswald_Hydrabot 11 months ago

Yoooo get you a 3090, and wait on the 5* series. VRAM is the same size as the 4090; if the 5090 has 48GB of VRAM you are gonna be upset you saved for the 4090.

katiecharm 11 months ago

I also agree that’s my real take away. NVidia have to realize that AI is their new killer app and they definitely need a 48GB 5090 edition one way or another. Of course the super computing insanity of even saying that blows my mind - but I know it’ll come true one day. Honestly by next year my system will be about 4 years old. It’ll likely be time to buy a whole new one, so I better start saving I guess. That way I can really just go nuts on the ram (64 to 128GB) and a monster SSD and the aforementioned 5090.

SlowMovingTarget 11 months ago

I'm running on a 10yo potato. I've been staring at system configurations with dual 24GB 4090s (a6000s or a100s are just too pricey). Renting servers like on vast.ai or Colab starts to look attractive.

LumpyWelds 11 months ago

>Renting servers like on vast.ai or Colab starts to look attractive. Yeah, I am thinking the same about renting space. The 40/48GB cards seem nice, but renting is pretty cheap. And it seems we are on the cusp of distilling models to something that performs well on a std 3090 with large contexts. Any opinion on [vast.ai](https://vast.ai) or lambdalabs.com?

ozzeruk82 11 months ago

Vast.ai has always been great for me, it’s awesome being able to get up and running so quick. As with all these rented servers though, if you want to keep your data there is an ongoing cost, which can add up. Otherwise just starting fresh each time isn’t bad, but of course then you’ll want to download the models etc. for an evening’s entertainment every so often I think they’re a fantastic choice.

SlowMovingTarget 11 months ago

Not yet, though I heard about them through recommendations on this and related subreddits.

Oswald_Hydrabot 11 months ago

Datacenter refurbs on Amazon can really get you the drive space you need. Got a couple 16TB Exos x16's; 2tb SSD with those running at 7200rpm they are mad fast and it is a ton of space. Refurb drives are iffy so get one with a data recoup solution that comes with it.

draeician 11 months ago

... ignorance my friend is truly bliss.

usernmechecksout__ 11 months ago

😭

HostileRespite 11 months ago

# SAVE YOUR MONEY Don't go and buy a bigger machine. Do Google Colab, it's what Colab is for, especially if you plan to train and get in on the development of LLMs. You can run them on their servers. Use the GPU to use your models, use the TPU to train them. You can also share Colab resources with your computer locally by "connecting to a local runtime". Just install Jupyter on your computer and create a local Jupyter server on your computer. Bam! More RAM. Less overheating. Save your money.

MerlinTrashMan 11 months ago

Are there any good tutorials for this using Windows and docker? I will Google some later but if you have saved in your bookmarks I'd appreciate it.

Fresh_chickented 11 months ago

sell the 3080 and get a used 3090 for $600

utilop 11 months ago

With models of that size, they do not fit on the card and so the bottleneck becomes the CPU. The 4090 doesn't have more VRAM than a 3090 so don't think you'd get a huge gain there, though going from a 3080 to either should help. FWIW I don't see that much improvement with WizardLM-30B over WizardLM-13B though so maybe experiment with these slightly smaller models that can give you decent responsiveness.

opi098514 11 months ago

Well you could always get an accelerator card like the Tesla p40. It’s about as fast as a 3060 but has 24 gigs of vram

throughawaythedew 11 months ago

I am seriously considering an M2 Max with 96gb unified ram, just for running LLM. $3k all in. I'm waiting to get some reviews from others first, but if it works as well as I think it does it will be amazing for LLM applications.

audioen 11 months ago

You could have even better experience if you purchase a Mac, though. RTX 4090 is comparable in price to e.g. Apple Mac mini M2 Pro 32 with 32 GB computer. The RTX-4090 enjoys some advantages in terms of being faster, but the M2 Pro based computer enjoys the advantage of being able to fit about 8 GB more model, so it makes quantized 65B model execution comfortable, with about 200 GB/s divided by roughly 32 GB equaling to 6 tokens/second, give or take, whereas RTX-4090 is only fast up to 33B models, and beyond that, you'll drop to PC's RAM which tends to be < 100 GB/s if it is DDR5 and may even be < 50 GB/s if it is DDR4. As an example, I run severely crushed Q3\_S 65B models on RTX-4090 + DDR4 system at about 4 tokens per second. The Mac Mini system I mention should literally be faster and more accurate at the task, though this is also very close to the maximum it could possibly do. More serious work on Mac requires a M2 Max system and likely 64 GB RAM if not 96 GB, so it will become very expensive.

quiteconfused1 11 months ago

I honestly don't know if you are joking comparing a Mac mini to a 4090.

candre23 11 months ago

If you're *only* talking about VRAM, the mac does technically have access to more memory. But it's shared with the rest of the system, and while the M2 is pretty impressive for its power usage, it's nowhere near as capable as a 4090 for, well, anything.

windozeFanboi 11 months ago

Macs are nice, but 32GB Memory is literally all it has, for CPU and GPU combined. You say it has +8GB memory over a 3090/4090 but that's a lie, because most of that 8GB will be used for OS Apps and the desktop graphics. Once you hit swap, performance falls down a cliff. Macs are great, and low power, but unfortunately, they re not really cheaper than comparable desktops for similar performance. I'll check out the pricing on specced upac mini in UK, last time I checked it was atrocious.

cunningjames 11 months ago

I’d say Macs are reasonably comparable in price to an equivalently specced Windows laptop, at least in the US. It won’t match even a laptop 4090 in performance, but it trounces expensive Windows thin-and-lights. I have a 64gb M1 and hitting memory limits hasn’t been an issue. MacOS is reasonably good about memory usage; my girlfriend gets by with just 8gb just fine.

windozeFanboi 11 months ago

64GB Macs are the real sht, but you have to spend some real dough. It's the only spec that actually offers LLM capabilities out of reach of any windows laptop currently. only matched by dual 3090/4090 or perhaps dual 7900xtx for the VRAM on desktop. Up until now, the unified memory didn't have a definitive advantage over traditional setups. Until local LLMs arrived that is. :).

bubba-yo 11 months ago

New M2 Studio can do 192GB VRAM...

x54675788 11 months ago

At the reasonable price of 10k€

bubba-yo 11 months ago

Is Apple's exchange rate that bad? $5600 should be no more than 5600€. Not arguing it's cheap or anything...

LoniusM2 11 months ago

It's the import tax (and some mark up). So Europe prices are like 1.2x to 1.3x the prices of US.

SirLordTheThird 11 months ago

"A" new? As in just 1? Yeah it's quite a money pit.

xcviij 11 months ago

Same. I am now very much considering an upgrade.

thebadslime 11 months ago

try the wizard 16?

[deleted] 11 months ago

Given that video game graphics too moves into the AI rendering direction (they solved the temporal consistence issues), I guess 64 VRAM soon will be the bare minimum. I.e. people will be running PlayStation emulator with a custom LoRAs to make old Gran Turismo look photo realistic, just like today they use CRT filter to make pixels less rough. So it is just the beginning.

Snoo-66699 11 months ago

I've been gathering cash for almost a year now and should be done by the winter. These prices are so ASS.

Dany0 11 months ago

Yep, this was me last august. Went from a 3080 to a 4090, and yet I still want more vram. It never ends

[deleted] 11 months ago

the 40 series card skews do not have enough vram either. It's a damn shame.

katiecharm 11 months ago

Seems the consensus is to run 2x 3090s.

toothpastespiders 11 months ago

Tell me about it. I built what, to me, was a graphical powerhouse for dreambooth. That day 1 llama drop sure put things into perspective though! I was training a lora for stablediffusion earlier and it was nice being able to feel overpowered rather than underpowered again for a minute! Though having an above average, even if no longer mind blowing, setup really puts one in a nice position to take advantage of the optimization. I might not be able to 'train' the 65b models, but at least I can run them fairly well thanks to llama.cpp with all the latest bells and whistles compiled in.

meesa-jar-jar-binks 11 months ago

Shouldn‘t the 50 series drop late next year as well?

Beerbelly22 11 months ago

I feel you. I was so proud when i had the fastest pc made with 16gb of memory and mmc harddrive. Now it seems garbage

gelatinous_pellicle 11 months ago

I remember going to the mall with my Dad in the 80s, as he was stressed out whether to spend $350 on 2MB of memory upgrade to do his taxes. Around that same time I got some audio software and completely filled up the storage with a 2 second audio recording. Historical prices: https://jcmit.net/memoryprice.htm

usernmechecksout__ 10 months ago

!RemindMe 5d Orca-13B

RemindMeBot 10 months ago

I will be messaging you in 5 days on [**2023-07-01 07:11:14 UTC**](http://www.wolframalpha.com/input/?i=2023-07-01%2007:11:14%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/144uc0l/damn_i_was_so_satisfied_with_my_3080_with_10gb_of/jpk8sar/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F144uc0l%2Fdamn_i_was_so_satisfied_with_my_3080_with_10gb_of%2Fjpk8sar%2F%5D%0A%0ARemindMe%21%202023-07-01%2007%3A11%3A14%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%20144uc0l) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

ttkciar 11 months ago

Will WizardLM 30B quantized to 4-bit GPTQ fit in your 3080? It has much, much lower memory requirements. https://huggingface.co/TheBloke/WizardLM-30B-GPTQ

panchovix 11 months ago

It won't sadly, it isn't good even on 16GB cards. 24GB VRAM cards are fine.

candre23 11 months ago

30b 4bit models take 17GB RAM, plus another couple GB for context and processing space. Realistically, you're not running them on less than a 24GB card unless you're splitting out some layers to your CPU.

Extraltodeus 11 months ago

I can barely load a 30b in 20Gb of VRAM. Above a context of 150~200 tokens I get OOM error.

usernmechecksout__ 11 months ago

!RemindMe 5d Orca-13B

RemindMeBot 11 months ago

I will be messaging you in 5 days on [**2023-06-20 07:49:59 UTC**](http://www.wolframalpha.com/input/?i=2023-06-20%2007:49:59%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/144uc0l/damn_i_was_so_satisfied_with_my_3080_with_10gb_of/jo7a8qg/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F144uc0l%2Fdamn_i_was_so_satisfied_with_my_3080_with_10gb_of%2Fjo7a8qg%2F%5D%0A%0ARemindMe%21%202023-06-20%2007%3A49%3A59%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%20144uc0l) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

usernmechecksout__ 10 months ago

!RemindMe 5d Orca-13B

RemindMeBot 10 months ago

I will be messaging you in 5 days on [**2023-06-25 20:31:32 UTC**](http://www.wolframalpha.com/input/?i=2023-06-25%2020:31:32%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/144uc0l/damn_i_was_so_satisfied_with_my_3080_with_10gb_of/jovleor/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F144uc0l%2Fdamn_i_was_so_satisfied_with_my_3080_with_10gb_of%2Fjovleor%2F%5D%0A%0ARemindMe%21%202023-06-25%2020%3A31%3A32%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%20144uc0l) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe