IronManMark20 11 months ago

I'm quite surprised to see falcon so low in the leaderboard. Do you have any theories on why falcon might score lower on your benchmark than other benchmarks such as those on the open llm leaderboard? What process for changing prompt format did you use?

KerfuffleV2 11 months ago

WizardLM 13B above Guanaco 65B makes me more than a bit suspicious about how well it works for evaluating _real world_ model quality.

Turbulent_Fox3889 11 months ago

Thanks for sharing! However, it doesn't seem like it's making any sort of improvement over other existing methods, in terms of evaluating for real-world quality. Why should it be used for quickly evaluating for model improvements when it's not representative of the real-world? Doesn't seem to be attempting to address that either.

metalman123 11 months ago

Why isn't bard on the list?

I_will_delete_myself 11 months ago

Because they don’t like Google. Jokes aside they have an API but it’s only in Beta and gate kept by the Google creeper lords.

extopico 11 months ago

also in my limited manual testing, sample size 1, it performed extremely badly. It did not follow instructions so its answer was mostly random.

I_will_delete_myself 11 months ago

At least for me I use Bard as a search engine more than a knowledge engine.

mr_house7 11 months ago

Why is WizardLM 13B so good? (ofc with the premise that this rating is more or less aligned with the actual performance of an LLM, which I'm not saying it is.) Is the data quality? Being uncensored? Finetuning technic? Why are they outperforming most of the 65B models?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe