LLMFAO: Large Language Model Feedback Analysis and Optimization¶
This is a minimalistic large language model (LLM) leaderboard that is based on human and machine feedback on pairwise responses of the models based on a carefully-selected set of 13 prompts and 59 different models.
When you see the outputs of two different systems for the specific query, you can determine the better one using a smaller instruction. I decided to give pairwise comparisons a try on the data kindly provided by the llmonitor.com team.
I asked carefully chosen crowd annotators to evaluate every pair to determine the winner. If both models performed similarly well or poorly, it’s a tie. Five different annotators evaluated every pair according to the instruction; there were 124 annotators in total. I also asked GPT-3.5 Turbo Instruct and GPT-4 to do the same using a shorter evaluation prompt, but I subjectively found human performance to be superior.
A more detailed description of this study is available at https://evalovernite.substack.com/p/llmfao-human-ranking.
The datasets and code are available at https://huggingface.co/datasets/dustalov/llmfao and https://github.com/dustalov/llmfao under open-source licenses.
The pairwise comparisons are transformed into scores using the Evalica library.
Human Judgements¶
df_crowd = aggregate("crowd-comparisons.csv")
df_crowd
score | pairs | rank | |
---|---|---|---|
GPT 4 | 0.041215 | 158 | 1 |
Platypus-2 Instruct (70B) | 0.029233 | 159 | 2 |
command | 0.028852 | 322 | 3 |
ReMM SLERP L2 13B | 0.027150 | 153 | 4 |
LLaMA-2-Chat (70B) | 0.026384 | 161 | 5 |
Claude v1 | 0.026259 | 160 | 6 |
GPT 3.5 Turbo | 0.025871 | 366 | 7 |
Jurassic 2 Mid | 0.025849 | 175 | 8 |
Jurassic 2 Ultra | 0.025310 | 165 | 9 |
command-nightly | 0.025225 | 169 | 10 |
Mythalion 13B | 0.024030 | 143 | 11 |
GPT 3.5 Turbo (16k) | 0.023981 | 381 | 12 |
Falcon Instruct (40B) | 0.023752 | 348 | 13 |
GPT-NeoXT-Chat-Base (20B) | 0.023268 | 160 | 14 |
Chronos Hermes (13B) | 0.023229 | 163 | 15 |
Claude v2 | 0.022926 | 167 | 16 |
Claude Instant v1 | 0.022809 | 163 | 17 |
MPT-Chat (7B) | 0.022213 | 174 | 18 |
LLaMA-2-Chat (7B) | 0.021364 | 324 | 19 |
LLaMA 2 SFT v10 (70B) | 0.020693 | 167 | 20 |
Claude v1.2 | 0.019842 | 275 | 21 |
Guanaco (65B) | 0.018083 | 187 | 22 |
Pythia-Chat-Base (7B) | 0.017827 | 147 | 23 |
MythoMax-L2 (13B) | 0.017503 | 173 | 24 |
PaLM 2 Bison (Code Chat) | 0.017395 | 156 | 25 |
LLaMA-2-Chat (13B) | 0.017348 | 157 | 26 |
Guanaco (13B) | 0.017318 | 161 | 27 |
Alpaca (7B) | 0.016575 | 166 | 28 |
Luminous Supreme Control | 0.016541 | 147 | 29 |
Guanaco (33B) | 0.016505 | 345 | 30 |
Vicuna v1.5 (13B) | 0.016458 | 161 | 31 |
Jurassic 2 Light | 0.015635 | 368 | 32 |
Luminous Base Control | 0.015556 | 121 | 33 |
Qwen-Chat (7B) | 0.015487 | 161 | 34 |
MPT-Chat (30B) | 0.015332 | 164 | 35 |
Vicuna v1.3 (13B) | 0.015278 | 167 | 36 |
RedPajama-INCITE Chat (7B) | 0.014452 | 159 | 37 |
Falcon Instruct (7B) | 0.013655 | 150 | 38 |
command-light | 0.013632 | 547 | 39 |
Luminous Extended Control | 0.013155 | 126 | 40 |
Vicuna v1.3 (7B) | 0.011941 | 159 | 41 |
Weaver 12k | 0.011844 | 2762 | 42 |
PaLM 2 Bison | 0.011222 | 321 | 43 |
Luminous Base | 0.010406 | 550 | 44 |
RedPajama-INCITE Chat (3B) | 0.010148 | 242 | 45 |
Code Llama Instruct (34B) | 0.010096 | 254 | 46 |
Code Llama Instruct (13B) | 0.009999 | 315 | 47 |
Airoboros L2 70B | 0.009754 | 325 | 48 |
Dolly v2 (12B) | 0.009161 | 1003 | 49 |
StarCoderChat Alpha (16B) | 0.008508 | 533 | 50 |
Open-Assistant Pythia SFT-4 (12B) | 0.008371 | 428 | 51 |
Luminous Extended | 0.008072 | 728 | 52 |
Luminous Supreme | 0.007237 | 369 | 53 |
Code Llama Instruct (7B) | 0.007230 | 297 | 54 |
Open-Assistant StableLM SFT-7 (7B) | 0.006986 | 390 | 55 |
Koala (13B) | 0.006894 | 264 | 56 |
Dolly v2 (7B) | 0.006343 | 216 | 57 |
Vicuna-FastChat-T5 (3B) | 0.006304 | 251 | 58 |
Dolly v2 (3B) | 0.006294 | 239 | 59 |
pairwise(df_crowd)
Evaluation with GPT-4¶
df_gpt4 = aggregate("gpt4-crowd-comparisons.csv")
df_gpt4
score | pairs | rank | |
---|---|---|---|
GPT 3.5 Turbo | 0.192920 | 90 | 1 |
GPT 3.5 Turbo (16k) | 0.182214 | 89 | 2 |
Airoboros L2 70B | 0.113152 | 75 | 3 |
Claude v1.2 | 0.100809 | 66 | 4 |
LLaMA-2-Chat (7B) | 0.047854 | 78 | 5 |
GPT 4 | 0.036313 | 39 | 6 |
Claude v1 | 0.028917 | 39 | 7 |
LLaMA-2-Chat (70B) | 0.027002 | 39 | 8 |
Claude Instant v1 | 0.017927 | 39 | 9 |
Mythalion 13B | 0.016102 | 36 | 10 |
Claude v2 | 0.015999 | 39 | 11 |
LLaMA 2 SFT v10 (70B) | 0.015298 | 39 | 12 |
ReMM SLERP L2 13B | 0.015095 | 39 | 13 |
Platypus-2 Instruct (70B) | 0.014798 | 39 | 14 |
command | 0.012952 | 79 | 15 |
Guanaco (65B) | 0.012044 | 42 | 16 |
command-nightly | 0.011518 | 39 | 17 |
MythoMax-L2 (13B) | 0.009588 | 39 | 18 |
PaLM 2 Bison | 0.009217 | 75 | 19 |
Jurassic 2 Mid | 0.007445 | 39 | 20 |
PaLM 2 Bison (Code Chat) | 0.007409 | 39 | 21 |
MPT-Chat (30B) | 0.007137 | 39 | 22 |
Falcon Instruct (40B) | 0.007106 | 85 | 23 |
LLaMA-2-Chat (13B) | 0.006734 | 39 | 24 |
Vicuna v1.5 (13B) | 0.006050 | 39 | 25 |
Vicuna v1.3 (13B) | 0.004837 | 39 | 26 |
Luminous Extended Control | 0.004672 | 30 | 27 |
Chronos Hermes (13B) | 0.004635 | 39 | 28 |
Jurassic 2 Ultra | 0.004062 | 39 | 29 |
Guanaco (13B) | 0.003941 | 39 | 30 |
Code Llama Instruct (7B) | 0.003637 | 69 | 31 |
Qwen-Chat (7B) | 0.003572 | 39 | 32 |
MPT-Chat (7B) | 0.003561 | 39 | 33 |
GPT-NeoXT-Chat-Base (20B) | 0.003227 | 39 | 34 |
Luminous Supreme Control | 0.003062 | 33 | 35 |
Code Llama Instruct (34B) | 0.002829 | 60 | 36 |
Dolly v2 (7B) | 0.002821 | 53 | 37 |
Luminous Base Control | 0.002795 | 30 | 38 |
StarCoderChat Alpha (16B) | 0.002740 | 129 | 39 |
Vicuna v1.3 (7B) | 0.002725 | 39 | 40 |
Falcon Instruct (7B) | 0.002668 | 39 | 41 |
Pythia-Chat-Base (7B) | 0.002473 | 39 | 42 |
Alpaca (7B) | 0.002334 | 39 | 43 |
RedPajama-INCITE Chat (7B) | 0.002206 | 39 | 44 |
Weaver 12k | 0.001951 | 664 | 45 |
Code Llama Instruct (13B) | 0.001636 | 78 | 46 |
Dolly v2 (3B) | 0.001487 | 58 | 47 |
Jurassic 2 Light | 0.001462 | 88 | 48 |
Koala (13B) | 0.001286 | 61 | 49 |
command-light | 0.001212 | 129 | 50 |
Vicuna-FastChat-T5 (3B) | 0.001192 | 60 | 51 |
Guanaco (33B) | 0.001133 | 85 | 52 |
Open-Assistant StableLM SFT-7 (7B) | 0.001073 | 88 | 53 |
RedPajama-INCITE Chat (3B) | 0.000960 | 56 | 54 |
Open-Assistant Pythia SFT-4 (12B) | 0.000575 | 105 | 55 |
Dolly v2 (12B) | 0.000569 | 234 | 56 |
Luminous Base | 0.000485 | 136 | 57 |
Luminous Supreme | 0.000384 | 87 | 58 |
Luminous Extended | 0.000198 | 177 | 59 |
pairwise(df_gpt4)
Evaluation with GPT-3¶
df_gpt3 = aggregate("gpt3-crowd-comparisons.csv")
df_gpt3
score | pairs | rank | |
---|---|---|---|
command | 0.045338 | 79 | 1 |
GPT 3.5 Turbo (16k) | 0.041943 | 89 | 2 |
GPT 3.5 Turbo | 0.037131 | 90 | 3 |
LLaMA-2-Chat (7B) | 0.036827 | 78 | 4 |
LLaMA-2-Chat (70B) | 0.033971 | 39 | 5 |
ReMM SLERP L2 13B | 0.033579 | 39 | 6 |
MythoMax-L2 (13B) | 0.032547 | 39 | 7 |
Claude v1.2 | 0.031428 | 66 | 8 |
GPT 4 | 0.029375 | 39 | 9 |
GPT-NeoXT-Chat-Base (20B) | 0.026620 | 39 | 10 |
PaLM 2 Bison (Code Chat) | 0.026151 | 39 | 11 |
Weaver 12k | 0.020631 | 664 | 12 |
Falcon Instruct (40B) | 0.020577 | 85 | 13 |
Claude v2 | 0.020364 | 39 | 14 |
Claude Instant v1 | 0.020276 | 39 | 15 |
Airoboros L2 70B | 0.019754 | 75 | 16 |
StarCoderChat Alpha (16B) | 0.018511 | 129 | 17 |
Guanaco (33B) | 0.018430 | 85 | 18 |
Platypus-2 Instruct (70B) | 0.017478 | 39 | 19 |
Claude v1 | 0.017469 | 39 | 20 |
Jurassic 2 Ultra | 0.017469 | 39 | 20 |
MPT-Chat (30B) | 0.017349 | 39 | 22 |
Mythalion 13B | 0.017199 | 36 | 23 |
LLaMA 2 SFT v10 (70B) | 0.017117 | 39 | 24 |
Guanaco (65B) | 0.017003 | 42 | 25 |
Jurassic 2 Mid | 0.016576 | 39 | 26 |
Dolly v2 (12B) | 0.016302 | 234 | 27 |
command-light | 0.015918 | 129 | 28 |
Vicuna v1.5 (13B) | 0.015845 | 39 | 29 |
LLaMA-2-Chat (13B) | 0.015642 | 39 | 30 |
Jurassic 2 Light | 0.014918 | 88 | 31 |
Qwen-Chat (7B) | 0.014715 | 39 | 32 |
Chronos Hermes (13B) | 0.014497 | 39 | 33 |
RedPajama-INCITE Chat (3B) | 0.014269 | 56 | 34 |
Guanaco (13B) | 0.013697 | 39 | 35 |
PaLM 2 Bison | 0.013621 | 75 | 36 |
command-nightly | 0.012650 | 39 | 37 |
Vicuna v1.3 (13B) | 0.012555 | 39 | 38 |
RedPajama-INCITE Chat (7B) | 0.011942 | 39 | 39 |
Luminous Supreme Control | 0.011921 | 33 | 40 |
Luminous Extended Control | 0.011434 | 30 | 41 |
Alpaca (7B) | 0.010622 | 39 | 42 |
Vicuna v1.3 (7B) | 0.010599 | 39 | 43 |
Code Llama Instruct (34B) | 0.010185 | 60 | 44 |
Code Llama Instruct (13B) | 0.009669 | 78 | 45 |
MPT-Chat (7B) | 0.009658 | 39 | 46 |
Pythia-Chat-Base (7B) | 0.009321 | 39 | 47 |
Luminous Supreme | 0.009269 | 87 | 48 |
Falcon Instruct (7B) | 0.008273 | 39 | 49 |
Luminous Base Control | 0.008245 | 30 | 50 |
Luminous Base | 0.008158 | 136 | 51 |
Dolly v2 (3B) | 0.007369 | 58 | 52 |
Code Llama Instruct (7B) | 0.007295 | 69 | 53 |
Open-Assistant Pythia SFT-4 (12B) | 0.006479 | 105 | 54 |
Luminous Extended | 0.006098 | 177 | 55 |
Dolly v2 (7B) | 0.005092 | 53 | 56 |
Vicuna-FastChat-T5 (3B) | 0.004321 | 60 | 57 |
Open-Assistant StableLM SFT-7 (7B) | 0.004262 | 88 | 58 |
Koala (13B) | 0.004046 | 61 | 59 |
pairwise(df_gpt3)
Correlations¶
df_ranks = pd.concat((df_crowd["rank"], df_gpt4["rank"], df_gpt3["rank"]), axis=1)
df_ranks.columns = ["Humans", "GPT-4", "GPT-3"] # type: ignore[assignment]
df_ranks.corr()
Humans | GPT-4 | GPT-3 | |
---|---|---|---|
Humans | 1.000000 | 0.730918 | 0.716610 |
GPT-4 | 0.730918 | 1.000000 | 0.717136 |
GPT-3 | 0.716610 | 0.717136 | 1.000000 |