LLMFAO: Large Language Model Feedback Analysis and Optimization¶

This is a minimalistic large language model (LLM) leaderboard that is based on human and machine feedback on pairwise responses of the models based on a carefully-selected set of 13 prompts and 59 different models.

When you see the outputs of two different systems for the specific query, you can determine the better one using a smaller instruction. I decided to give pairwise comparisons a try on the data kindly provided by the llmonitor.com team.

I asked carefully chosen crowd annotators to evaluate every pair to determine the winner. If both models performed similarly well or poorly, it’s a tie. Five different annotators evaluated every pair according to the instruction; there were 124 annotators in total. I also asked GPT-3.5 Turbo Instruct and GPT-4 to do the same using a shorter evaluation prompt, but I subjectively found human performance to be superior.

A more detailed description of this study is available at https://evalovernite.substack.com/p/llmfao-human-ranking.

The datasets and code are available at https://huggingface.co/datasets/dustalov/llmfao and https://github.com/dustalov/llmfao under open-source licenses.

The pairwise comparisons are transformed into scores using the Evalica library.

Human Judgements¶

In [4]:
df_crowd = aggregate("crowd-comparisons.csv")
df_crowd
Out[4]:
score pairs rank
GPT 4 0.041215 158 1
Platypus-2 Instruct (70B) 0.029233 159 2
command 0.028852 322 3
ReMM SLERP L2 13B 0.027150 153 4
LLaMA-2-Chat (70B) 0.026384 161 5
Claude v1 0.026259 160 6
GPT 3.5 Turbo 0.025871 366 7
Jurassic 2 Mid 0.025849 175 8
Jurassic 2 Ultra 0.025310 165 9
command-nightly 0.025225 169 10
Mythalion 13B 0.024030 143 11
GPT 3.5 Turbo (16k) 0.023981 381 12
Falcon Instruct (40B) 0.023752 348 13
GPT-NeoXT-Chat-Base (20B) 0.023268 160 14
Chronos Hermes (13B) 0.023229 163 15
Claude v2 0.022926 167 16
Claude Instant v1 0.022809 163 17
MPT-Chat (7B) 0.022213 174 18
LLaMA-2-Chat (7B) 0.021364 324 19
LLaMA 2 SFT v10 (70B) 0.020693 167 20
Claude v1.2 0.019842 275 21
Guanaco (65B) 0.018083 187 22
Pythia-Chat-Base (7B) 0.017827 147 23
MythoMax-L2 (13B) 0.017503 173 24
PaLM 2 Bison (Code Chat) 0.017395 156 25
LLaMA-2-Chat (13B) 0.017348 157 26
Guanaco (13B) 0.017318 161 27
Alpaca (7B) 0.016575 166 28
Luminous Supreme Control 0.016541 147 29
Guanaco (33B) 0.016505 345 30
Vicuna v1.5 (13B) 0.016458 161 31
Jurassic 2 Light 0.015635 368 32
Luminous Base Control 0.015556 121 33
Qwen-Chat (7B) 0.015487 161 34
MPT-Chat (30B) 0.015332 164 35
Vicuna v1.3 (13B) 0.015278 167 36
RedPajama-INCITE Chat (7B) 0.014452 159 37
Falcon Instruct (7B) 0.013655 150 38
command-light 0.013632 547 39
Luminous Extended Control 0.013155 126 40
Vicuna v1.3 (7B) 0.011941 159 41
Weaver 12k 0.011844 2762 42
PaLM 2 Bison 0.011222 321 43
Luminous Base 0.010406 550 44
RedPajama-INCITE Chat (3B) 0.010148 242 45
Code Llama Instruct (34B) 0.010096 254 46
Code Llama Instruct (13B) 0.009999 315 47
Airoboros L2 70B 0.009754 325 48
Dolly v2 (12B) 0.009161 1003 49
StarCoderChat Alpha (16B) 0.008508 533 50
Open-Assistant Pythia SFT-4 (12B) 0.008371 428 51
Luminous Extended 0.008072 728 52
Luminous Supreme 0.007237 369 53
Code Llama Instruct (7B) 0.007230 297 54
Open-Assistant StableLM SFT-7 (7B) 0.006986 390 55
Koala (13B) 0.006894 264 56
Dolly v2 (7B) 0.006343 216 57
Vicuna-FastChat-T5 (3B) 0.006304 251 58
Dolly v2 (3B) 0.006294 239 59
In [5]:
pairwise(df_crowd)

Evaluation with GPT-4¶

In [6]:
df_gpt4 = aggregate("gpt4-crowd-comparisons.csv")
df_gpt4
Out[6]:
score pairs rank
GPT 3.5 Turbo 0.192920 90 1
GPT 3.5 Turbo (16k) 0.182214 89 2
Airoboros L2 70B 0.113152 75 3
Claude v1.2 0.100809 66 4
LLaMA-2-Chat (7B) 0.047854 78 5
GPT 4 0.036313 39 6
Claude v1 0.028917 39 7
LLaMA-2-Chat (70B) 0.027002 39 8
Claude Instant v1 0.017927 39 9
Mythalion 13B 0.016102 36 10
Claude v2 0.015999 39 11
LLaMA 2 SFT v10 (70B) 0.015298 39 12
ReMM SLERP L2 13B 0.015095 39 13
Platypus-2 Instruct (70B) 0.014798 39 14
command 0.012952 79 15
Guanaco (65B) 0.012044 42 16
command-nightly 0.011518 39 17
MythoMax-L2 (13B) 0.009588 39 18
PaLM 2 Bison 0.009217 75 19
Jurassic 2 Mid 0.007445 39 20
PaLM 2 Bison (Code Chat) 0.007409 39 21
MPT-Chat (30B) 0.007137 39 22
Falcon Instruct (40B) 0.007106 85 23
LLaMA-2-Chat (13B) 0.006734 39 24
Vicuna v1.5 (13B) 0.006050 39 25
Vicuna v1.3 (13B) 0.004837 39 26
Luminous Extended Control 0.004672 30 27
Chronos Hermes (13B) 0.004635 39 28
Jurassic 2 Ultra 0.004062 39 29
Guanaco (13B) 0.003941 39 30
Code Llama Instruct (7B) 0.003637 69 31
Qwen-Chat (7B) 0.003572 39 32
MPT-Chat (7B) 0.003561 39 33
GPT-NeoXT-Chat-Base (20B) 0.003227 39 34
Luminous Supreme Control 0.003062 33 35
Code Llama Instruct (34B) 0.002829 60 36
Dolly v2 (7B) 0.002821 53 37
Luminous Base Control 0.002795 30 38
StarCoderChat Alpha (16B) 0.002740 129 39
Vicuna v1.3 (7B) 0.002725 39 40
Falcon Instruct (7B) 0.002668 39 41
Pythia-Chat-Base (7B) 0.002473 39 42
Alpaca (7B) 0.002334 39 43
RedPajama-INCITE Chat (7B) 0.002206 39 44
Weaver 12k 0.001951 664 45
Code Llama Instruct (13B) 0.001636 78 46
Dolly v2 (3B) 0.001487 58 47
Jurassic 2 Light 0.001462 88 48
Koala (13B) 0.001286 61 49
command-light 0.001212 129 50
Vicuna-FastChat-T5 (3B) 0.001192 60 51
Guanaco (33B) 0.001133 85 52
Open-Assistant StableLM SFT-7 (7B) 0.001073 88 53
RedPajama-INCITE Chat (3B) 0.000960 56 54
Open-Assistant Pythia SFT-4 (12B) 0.000575 105 55
Dolly v2 (12B) 0.000569 234 56
Luminous Base 0.000485 136 57
Luminous Supreme 0.000384 87 58
Luminous Extended 0.000198 177 59
In [7]:
pairwise(df_gpt4)

Evaluation with GPT-3¶

In [8]:
df_gpt3 = aggregate("gpt3-crowd-comparisons.csv")
df_gpt3
Out[8]:
score pairs rank
command 0.045338 79 1
GPT 3.5 Turbo (16k) 0.041943 89 2
GPT 3.5 Turbo 0.037131 90 3
LLaMA-2-Chat (7B) 0.036827 78 4
LLaMA-2-Chat (70B) 0.033971 39 5
ReMM SLERP L2 13B 0.033579 39 6
MythoMax-L2 (13B) 0.032547 39 7
Claude v1.2 0.031428 66 8
GPT 4 0.029375 39 9
GPT-NeoXT-Chat-Base (20B) 0.026620 39 10
PaLM 2 Bison (Code Chat) 0.026151 39 11
Weaver 12k 0.020631 664 12
Falcon Instruct (40B) 0.020577 85 13
Claude v2 0.020364 39 14
Claude Instant v1 0.020276 39 15
Airoboros L2 70B 0.019754 75 16
StarCoderChat Alpha (16B) 0.018511 129 17
Guanaco (33B) 0.018430 85 18
Platypus-2 Instruct (70B) 0.017478 39 19
Claude v1 0.017469 39 20
Jurassic 2 Ultra 0.017469 39 20
MPT-Chat (30B) 0.017349 39 22
Mythalion 13B 0.017199 36 23
LLaMA 2 SFT v10 (70B) 0.017117 39 24
Guanaco (65B) 0.017003 42 25
Jurassic 2 Mid 0.016576 39 26
Dolly v2 (12B) 0.016302 234 27
command-light 0.015918 129 28
Vicuna v1.5 (13B) 0.015845 39 29
LLaMA-2-Chat (13B) 0.015642 39 30
Jurassic 2 Light 0.014918 88 31
Qwen-Chat (7B) 0.014715 39 32
Chronos Hermes (13B) 0.014497 39 33
RedPajama-INCITE Chat (3B) 0.014269 56 34
Guanaco (13B) 0.013697 39 35
PaLM 2 Bison 0.013621 75 36
command-nightly 0.012650 39 37
Vicuna v1.3 (13B) 0.012555 39 38
RedPajama-INCITE Chat (7B) 0.011942 39 39
Luminous Supreme Control 0.011921 33 40
Luminous Extended Control 0.011434 30 41
Alpaca (7B) 0.010622 39 42
Vicuna v1.3 (7B) 0.010599 39 43
Code Llama Instruct (34B) 0.010185 60 44
Code Llama Instruct (13B) 0.009669 78 45
MPT-Chat (7B) 0.009658 39 46
Pythia-Chat-Base (7B) 0.009321 39 47
Luminous Supreme 0.009269 87 48
Falcon Instruct (7B) 0.008273 39 49
Luminous Base Control 0.008245 30 50
Luminous Base 0.008158 136 51
Dolly v2 (3B) 0.007369 58 52
Code Llama Instruct (7B) 0.007295 69 53
Open-Assistant Pythia SFT-4 (12B) 0.006479 105 54
Luminous Extended 0.006098 177 55
Dolly v2 (7B) 0.005092 53 56
Vicuna-FastChat-T5 (3B) 0.004321 60 57
Open-Assistant StableLM SFT-7 (7B) 0.004262 88 58
Koala (13B) 0.004046 61 59
In [9]:
pairwise(df_gpt3)

Correlations¶

In [10]:
df_ranks = pd.concat((df_crowd["rank"], df_gpt4["rank"], df_gpt3["rank"]), axis=1)
df_ranks.columns = ["Humans", "GPT-4", "GPT-3"]  # type: ignore[assignment]
df_ranks.corr()
Out[10]:
Humans GPT-4 GPT-3
Humans 1.000000 0.730918 0.716610
GPT-4 0.730918 1.000000 0.717136
GPT-3 0.716610 0.717136 1.000000