LLMFAO: Large Language Model Feedback Analysis and Optimization¶
This is a minimalistic large language model (LLM) leaderboard that is based on human and machine feedback on pairwise responses of the models based on a carefully-selected set of 13 prompts and 59 different models.
When you see the outputs of two different systems for the specific query, you can determine the better one using a smaller instruction. I decided to give pairwise comparisons a try on the data kindly provided by the llmonitor.com team.
I asked carefully chosen crowd annotators to evaluate every pair to determine the winner. If both models performed similarly well or poorly, it’s a tie. Five different annotators evaluated every pair according to the instruction; there were 124 annotators in total. I also asked GPT-3.5 Turbo Instruct and GPT-4 to do the same using a shorter evaluation prompt, but I subjectively found human performance to be superior.
A more detailed description of this study is available at https://evalovernite.substack.com/p/llmfao-human-ranking.
The datasets and code are available on GitHub at https://github.com/dustalov/llmfao under open-source licenses.
Loaded as API: https://dustalov-pair2rank.hf.space/ ✔
Human Judgements¶
df_crowd = pair2rank('crowd-comparisons.csv')
df_crowd
score | pairs | rank | |
---|---|---|---|
item | |||
GPT 4 | 0.041049 | 158 | 1 |
Platypus-2 Instruct (70B) | 0.029227 | 159 | 2 |
command | 0.028848 | 322 | 3 |
ReMM SLERP L2 13B | 0.027151 | 153 | 4 |
LLaMA-2-Chat (70B) | 0.026391 | 161 | 5 |
Claude v1 | 0.026260 | 160 | 6 |
GPT 3.5 Turbo | 0.025873 | 366 | 7 |
Jurassic 2 Mid | 0.025850 | 175 | 8 |
Jurassic 2 Ultra | 0.025312 | 165 | 9 |
command-nightly | 0.025230 | 169 | 10 |
Mythalion 13B | 0.024035 | 143 | 11 |
GPT 3.5 Turbo (16k) | 0.023987 | 381 | 12 |
Falcon Instruct (40B) | 0.023755 | 348 | 13 |
GPT-NeoXT-Chat-Base (20B) | 0.023272 | 160 | 14 |
Chronos Hermes (13B) | 0.023233 | 163 | 15 |
Claude v2 | 0.022930 | 167 | 16 |
Claude Instant v1 | 0.022814 | 163 | 17 |
MPT-Chat (7B) | 0.022218 | 174 | 18 |
LLaMA-2-Chat (7B) | 0.021367 | 324 | 19 |
LLaMA 2 SFT v10 (70B) | 0.020699 | 167 | 20 |
Claude v1.2 | 0.019849 | 275 | 21 |
Guanaco (65B) | 0.018090 | 187 | 22 |
Pythia-Chat-Base (7B) | 0.017832 | 147 | 23 |
MythoMax-L2 (13B) | 0.017507 | 173 | 24 |
PaLM 2 Bison (Code Chat) | 0.017399 | 156 | 25 |
LLaMA-2-Chat (13B) | 0.017354 | 157 | 26 |
Guanaco (13B) | 0.017323 | 161 | 27 |
Alpaca (7B) | 0.016579 | 166 | 28 |
Luminous Supreme Control | 0.016545 | 147 | 29 |
Guanaco (33B) | 0.016508 | 345 | 30 |
Vicuna v1.5 (13B) | 0.016462 | 161 | 31 |
Jurassic 2 Light | 0.015638 | 368 | 32 |
Luminous Base Control | 0.015560 | 121 | 33 |
Qwen-Chat (7B) | 0.015491 | 161 | 34 |
MPT-Chat (30B) | 0.015336 | 164 | 35 |
Vicuna v1.3 (13B) | 0.015282 | 167 | 36 |
RedPajama-INCITE Chat (7B) | 0.014456 | 159 | 37 |
Falcon Instruct (7B) | 0.013657 | 150 | 38 |
command-light | 0.013633 | 547 | 39 |
Luminous Extended Control | 0.013158 | 126 | 40 |
Vicuna v1.3 (7B) | 0.011943 | 159 | 41 |
Weaver 12k | 0.011846 | 2762 | 42 |
PaLM 2 Bison | 0.011224 | 321 | 43 |
Luminous Base | 0.010408 | 550 | 44 |
RedPajama-INCITE Chat (3B) | 0.010150 | 242 | 45 |
Code Llama Instruct (34B) | 0.010098 | 254 | 46 |
Code Llama Instruct (13B) | 0.010001 | 315 | 47 |
Airoboros L2 70B | 0.009755 | 325 | 48 |
Dolly v2 (12B) | 0.009163 | 1003 | 49 |
StarCoderChat Alpha (16B) | 0.008509 | 533 | 50 |
Open-Assistant Pythia SFT-4 (12B) | 0.008373 | 428 | 51 |
Luminous Extended | 0.008074 | 728 | 52 |
Luminous Supreme | 0.007238 | 369 | 53 |
Code Llama Instruct (7B) | 0.007231 | 297 | 54 |
Open-Assistant StableLM SFT-7 (7B) | 0.006987 | 390 | 55 |
Koala (13B) | 0.006896 | 264 | 56 |
Dolly v2 (7B) | 0.006344 | 216 | 57 |
Vicuna-FastChat-T5 (3B) | 0.006305 | 251 | 58 |
Dolly v2 (3B) | 0.006295 | 239 | 59 |
pairwise(df_crowd)
Evaluation with GPT-4¶
df_gpt4 = pair2rank('gpt4-crowd-comparisons.csv')
df_gpt4
score | pairs | rank | |
---|---|---|---|
item | |||
GPT 3.5 Turbo | 0.120143 | 90 | 1 |
GPT 3.5 Turbo (16k) | 0.116557 | 89 | 2 |
Airoboros L2 70B | 0.090668 | 75 | 3 |
Claude v1.2 | 0.085138 | 66 | 4 |
LLaMA-2-Chat (7B) | 0.053131 | 78 | 5 |
GPT 4 | 0.041652 | 39 | 6 |
LLaMA-2-Chat (70B) | 0.033978 | 39 | 7 |
Claude v1 | 0.033438 | 39 | 8 |
Claude Instant v1 | 0.025295 | 39 | 9 |
Mythalion 13B | 0.024571 | 36 | 10 |
Claude v2 | 0.022957 | 39 | 11 |
ReMM SLERP L2 13B | 0.022642 | 39 | 12 |
LLaMA 2 SFT v10 (70B) | 0.022242 | 39 | 13 |
Platypus-2 Instruct (70B) | 0.020938 | 39 | 14 |
command | 0.019886 | 79 | 15 |
Guanaco (65B) | 0.019804 | 42 | 16 |
command-nightly | 0.018989 | 39 | 17 |
MythoMax-L2 (13B) | 0.015216 | 39 | 18 |
PaLM 2 Bison | 0.014308 | 75 | 19 |
PaLM 2 Bison (Code Chat) | 0.012014 | 39 | 20 |
Jurassic 2 Mid | 0.011836 | 39 | 21 |
MPT-Chat (30B) | 0.011676 | 39 | 22 |
LLaMA-2-Chat (13B) | 0.011275 | 39 | 23 |
Falcon Instruct (40B) | 0.011153 | 85 | 24 |
Vicuna v1.5 (13B) | 0.009947 | 39 | 25 |
Vicuna v1.3 (13B) | 0.007923 | 39 | 26 |
Luminous Extended Control | 0.007887 | 30 | 27 |
Chronos Hermes (13B) | 0.007697 | 39 | 28 |
Jurassic 2 Ultra | 0.006698 | 39 | 29 |
Guanaco (13B) | 0.006654 | 39 | 30 |
Code Llama Instruct (7B) | 0.005996 | 69 | 31 |
Qwen-Chat (7B) | 0.005967 | 39 | 32 |
MPT-Chat (7B) | 0.005901 | 39 | 33 |
GPT-NeoXT-Chat-Base (20B) | 0.005460 | 39 | 34 |
Luminous Supreme Control | 0.005073 | 33 | 35 |
Dolly v2 (7B) | 0.004762 | 53 | 36 |
Code Llama Instruct (34B) | 0.004732 | 60 | 37 |
Luminous Base Control | 0.004659 | 30 | 38 |
Vicuna v1.3 (7B) | 0.004528 | 39 | 39 |
StarCoderChat Alpha (16B) | 0.004481 | 129 | 40 |
Falcon Instruct (7B) | 0.004436 | 39 | 41 |
Pythia-Chat-Base (7B) | 0.004115 | 39 | 42 |
Alpaca (7B) | 0.003899 | 39 | 43 |
RedPajama-INCITE Chat (7B) | 0.003733 | 39 | 44 |
Weaver 12k | 0.003200 | 664 | 45 |
Code Llama Instruct (13B) | 0.002735 | 78 | 46 |
Dolly v2 (3B) | 0.002497 | 58 | 47 |
Jurassic 2 Light | 0.002419 | 88 | 48 |
Koala (13B) | 0.002158 | 61 | 49 |
command-light | 0.002001 | 129 | 50 |
Vicuna-FastChat-T5 (3B) | 0.001997 | 60 | 51 |
Guanaco (33B) | 0.001886 | 85 | 52 |
Open-Assistant StableLM SFT-7 (7B) | 0.001776 | 88 | 53 |
RedPajama-INCITE Chat (3B) | 0.001598 | 56 | 54 |
Open-Assistant Pythia SFT-4 (12B) | 0.000958 | 105 | 55 |
Dolly v2 (12B) | 0.000943 | 234 | 56 |
Luminous Base | 0.000805 | 136 | 57 |
Luminous Supreme | 0.000640 | 87 | 58 |
Luminous Extended | 0.000329 | 177 | 59 |
pairwise(df_gpt4)
Evaluation with GPT-3¶
df_gpt3 = pair2rank('gpt3-crowd-comparisons.csv')
df_gpt3
score | pairs | rank | |
---|---|---|---|
item | |||
command | 0.045235 | 79 | 1 |
GPT 3.5 Turbo (16k) | 0.041898 | 89 | 2 |
GPT 3.5 Turbo | 0.037120 | 90 | 3 |
LLaMA-2-Chat (7B) | 0.036816 | 78 | 4 |
LLaMA-2-Chat (70B) | 0.033973 | 39 | 5 |
ReMM SLERP L2 13B | 0.033572 | 39 | 6 |
MythoMax-L2 (13B) | 0.032544 | 39 | 7 |
Claude v1.2 | 0.031440 | 66 | 8 |
GPT 4 | 0.029380 | 39 | 9 |
GPT-NeoXT-Chat-Base (20B) | 0.026622 | 39 | 10 |
PaLM 2 Bison (Code Chat) | 0.026155 | 39 | 11 |
Weaver 12k | 0.020637 | 664 | 12 |
Falcon Instruct (40B) | 0.020581 | 85 | 13 |
Claude v2 | 0.020367 | 39 | 14 |
Claude Instant v1 | 0.020280 | 39 | 15 |
Airoboros L2 70B | 0.019759 | 75 | 16 |
StarCoderChat Alpha (16B) | 0.018517 | 129 | 17 |
Guanaco (33B) | 0.018434 | 85 | 18 |
Platypus-2 Instruct (70B) | 0.017481 | 39 | 19 |
Claude v1 | 0.017472 | 39 | 20 |
Jurassic 2 Ultra | 0.017472 | 39 | 20 |
MPT-Chat (30B) | 0.017355 | 39 | 22 |
Mythalion 13B | 0.017204 | 36 | 23 |
LLaMA 2 SFT v10 (70B) | 0.017124 | 39 | 24 |
Guanaco (65B) | 0.017010 | 42 | 25 |
Jurassic 2 Mid | 0.016579 | 39 | 26 |
Dolly v2 (12B) | 0.016305 | 234 | 27 |
command-light | 0.015920 | 129 | 28 |
Vicuna v1.5 (13B) | 0.015848 | 39 | 29 |
LLaMA-2-Chat (13B) | 0.015647 | 39 | 30 |
Jurassic 2 Light | 0.014921 | 88 | 31 |
Qwen-Chat (7B) | 0.014718 | 39 | 32 |
Chronos Hermes (13B) | 0.014500 | 39 | 33 |
RedPajama-INCITE Chat (3B) | 0.014273 | 56 | 34 |
Guanaco (13B) | 0.013701 | 39 | 35 |
PaLM 2 Bison | 0.013624 | 75 | 36 |
command-nightly | 0.012654 | 39 | 37 |
Vicuna v1.3 (13B) | 0.012557 | 39 | 38 |
RedPajama-INCITE Chat (7B) | 0.011945 | 39 | 39 |
Luminous Supreme Control | 0.011923 | 33 | 40 |
Luminous Extended Control | 0.011438 | 30 | 41 |
Alpaca (7B) | 0.010625 | 39 | 42 |
Vicuna v1.3 (7B) | 0.010601 | 39 | 43 |
Code Llama Instruct (34B) | 0.010188 | 60 | 44 |
Code Llama Instruct (13B) | 0.009672 | 78 | 45 |
MPT-Chat (7B) | 0.009660 | 39 | 46 |
Pythia-Chat-Base (7B) | 0.009323 | 39 | 47 |
Luminous Supreme | 0.009272 | 87 | 48 |
Falcon Instruct (7B) | 0.008274 | 39 | 49 |
Luminous Base Control | 0.008248 | 30 | 50 |
Luminous Base | 0.008160 | 136 | 51 |
Dolly v2 (3B) | 0.007371 | 58 | 52 |
Code Llama Instruct (7B) | 0.007297 | 69 | 53 |
Open-Assistant Pythia SFT-4 (12B) | 0.006481 | 105 | 54 |
Luminous Extended | 0.006100 | 177 | 55 |
Dolly v2 (7B) | 0.005093 | 53 | 56 |
Vicuna-FastChat-T5 (3B) | 0.004322 | 60 | 57 |
Open-Assistant StableLM SFT-7 (7B) | 0.004262 | 88 | 58 |
Koala (13B) | 0.004047 | 61 | 59 |
pairwise(df_gpt3)
Correlations¶
df_ranks = pd.concat((df_crowd['rank'], df_gpt4['rank'], df_gpt3['rank']), axis=1)
df_ranks.columns = ['Humans', 'GPT-4', 'GPT-3'] # type: ignore[assignment]
df_ranks.corr(method='spearman')
Humans | GPT-4 | GPT-3 | |
---|---|---|---|
Humans | 1.000000 | 0.730041 | 0.715703 |
GPT-4 | 0.730041 | 1.000000 | 0.716463 |
GPT-3 | 0.715703 | 0.716463 | 1.000000 |