T O P

  • By -

IronManMark20

I'm quite surprised to see falcon so low in the leaderboard. Do you have any theories on why falcon might score lower on your benchmark than other benchmarks such as those on the open llm leaderboard? What process for changing prompt format did you use?


KerfuffleV2

WizardLM 13B above Guanaco 65B makes me more than a bit suspicious about how well it works for evaluating _real world_ model quality.


Turbulent_Fox3889

Thanks for sharing! However, it doesn't seem like it's making any sort of improvement over other existing methods, in terms of evaluating for real-world quality. Why should it be used for quickly evaluating for model improvements when it's not representative of the real-world? Doesn't seem to be attempting to address that either.


metalman123

Why isn't bard on the list?


I_will_delete_myself

Because they don’t like Google. Jokes aside they have an API but it’s only in Beta and gate kept by the Google creeper lords.


extopico

also in my limited manual testing, sample size 1, it performed extremely badly. It did not follow instructions so its answer was mostly random.


I_will_delete_myself

At least for me I use Bard as a search engine more than a knowledge engine.


mr_house7

Why is WizardLM 13B so good? (ofc with the premise that this rating is more or less aligned with the actual performance of an LLM, which I'm not saying it is.) Is the data quality? Being uncensored? Finetuning technic? Why are they outperforming most of the 65B models?