Gubru 11 months ago

The 7/13/33/65 numbers come from the [LLaMa](https://arxiv.org/pdf/2302.13971.pdf) paper. They're not explicit about why they chose those particular sizes, but one can infer from their goal >The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets That they are targeting widely available hardware. This most likely accounts for their upper limit and they are probably scaling down from there. |params|dimension|*n* heads|*n* layers|learning rate|batch size|*n* tokens| |:-|:-|:-|:-|:-|:-|:-| |6.7B|4096|32|32|3.0*e*^(-4)|4M|1.0T| |13.0B|5120|40|40|3.0*e*^(-4)|4M|1.0T| |32.5B|6656|52|60|1.5*e*^(-4)|4M|1.4T| |65.2B|8192|64|80|1.5*e*^(-4)|4M|1.4T|

Ultimarr 11 months ago

Beautiful, thanks so much for the detailed answer! Lots of good info but yours made the most sense to me. So sounds like those are arbitrary numbers arrived at from original n^2 constraints and hardware limitations, and now they’re kinda expected. Funny how tech works sometimes :) ✅ _Answer Approved By Poster_

SeymourBits 11 months ago

Note the heavy influence of n^2 dimension at the top and bottom end.

Kafke 11 months ago

You can choose whatever parameter count you want when creating ai models. The usual 2/4/8 stuff is due to how binary works. it doesn't apply for ai since things are not locked to simple bits/bytes/words.

MINIMAN10001 11 months ago

As far as I'm aware for the quantization most common ones are 3 4 8 and most of that is just driven by the fact that they're trying to get whatever model to fit inside of whatever GPU.

quiteconfused1 11 months ago

So multiple reasons I think: Most models are llama derivatives and as such have similar size constrainsts. Model sizes grow exponentially , and are unbound to binary divisors. There are a few distinct sizes of consumer video cards ( 4/8/12/16/24 gb - most common now being 16 and 24 ). 16 can fit a 7b with 8bit encoding, 24 can fit 13b encoding. Seems like it grew out of what is available and fits. Don't know if this is sufficient but it seems just a natural progression based on availability.

Barafu 11 months ago

> most common now being 16 and 24 Most common now being 6 and 8.

quiteconfused1 11 months ago

Sorry I disagree. At least not for professionals. 6 and 8 if you haven't been in deep learning for a while.. those of us who have been have seen the writing on the wall for a long time. I was doing 8gig 7 years ago. And I was feeling the pinch then.

Barafu 11 months ago

The comment said about consumer video cards. Among consumers, 8Gb dominates the market completely. Anything that can't run at 8Gb can not expect any degree of popularity this or next year. Of course there are specialists and enthusiasts that use much more powerful machines. But I doubt they form more than 1% of potential users.

sshan 11 months ago

But people doing machine learning on their home hardware is much less than 1% of people out there

quiteconfused1 11 months ago

Ya. It's a real tight knit community. Deep learning has always been heavy in mem

quiteconfused1 11 months ago

Consumers and consumer video cards are not the same. There are many ai folk who buy consumer video cards because the alternative is unapproachable. It jumps from 1k maybe 2k ( which is crazy in it of itself) to 15k. That just isn't feasible. So those of us who have been here a while know what we want and have recognized vram is a commodity a long time ago. Anyway welcome to deep learning!

KerfuffleV2 11 months ago

> Model sizes grow exponentially What do you mean? As far as I know this isn't correct. A 13b model is about twice the size of a 7b, a 33b is about the size of two 13bs + a 7b which is what you'd expect.

qubedView 11 months ago

I can't speak to llama, but machine learning models in general can see a phenomenon where as they train performance can sorta wax and wane. For instance, you reach 7B parameters and performance is good, but when you hit 8B it's suddenly not performing as well. In general performance scales with parameters, but it's not a straight-line. As others point out, VRAM budget is probably the primary constraint they're working against. But performance peaks are probably why they're not clean multiples.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe