(@asaf-yehudai)

JuStRank - a Hugging Face Space by ibm Discover amazing ML apps made by the community

Yes!
huggingface.co/spaces/ibm/J...

13.12.2024 13:06 👍 2 🔁 0 💬 1 📌 0

JuStRank - a Hugging Face Space by ibm Discover amazing ML apps made by the community

Checkout our full leaderboard here:
huggingface.co/spaces/ibm/J...

13.12.2024 10:16 👍 1 🔁 0 💬 0 📌 0

Paper page - JuStRank: Benchmarking LLM Judges for System Ranking Join the discussion on this paper page

Many more details are in the paper:
huggingface.co/papers/2412....

Thanks for the amazing collaborators: Ariel Gera, Odellia Boni, @yperlitz.bsky.social, Roy Bar-Haim, Lilach Eden, from IBM Research.

13.12.2024 10:16 👍 1 🔁 0 💬 1 📌 0

Overall, we found:
1⃣strong correlation between judge ranking abilities and decisiveness
2⃣and Negative correlation with its tendency for System-specific biases

13.12.2024 10:16 👍 0 🔁 0 💬 1 📌 0

Surprisingly, we found that self-bias is less prevalent than we thought

13.12.2024 10:16 👍 0 🔁 0 💬 1 📌 0

Secondly, we define a new type of Bias:

System-specific bias

Where a judge prefers or dislikes a specific system

Our results demonstrate large biases that affect systems-ranking

13.12.2024 10:16 👍 0 🔁 0 💬 1 📌 0

Analyzing these figures, we found an emergent judge behavior:

We call it decisiveness!
decisive judges prefer stronger systems, more than humans do!

We measure it based on the empirical fit

13.12.2024 10:16 👍 2 🔁 0 💬 1 📌 0

What does JuStRank tell us about general judge behavior?

For that, we turn to the system preference task
Given a pair of systems, which one is better!

We plot gold and judge predicted win-rates

13.12.2024 10:16 👍 0 🔁 0 💬 1 📌 0

With JuStRank we found:
1⃣Smaller dedicated judges are on par with big ones
2⃣LLM judge's realization matters a lot
3⃣Comparative judgment is not the best for most judges

🕺💃

13.12.2024 10:16 👍 0 🔁 0 💬 1 📌 0

So how did we do it?

For LLMs, we took 4 unique realizations
➕ Reward models
they judge the responses of 64 systems
and got each judge's system ranking

Then we compare the ranking to Arena's gold rank

13.12.2024 10:16 👍 0 🔁 0 💬 1 📌 0

There are many new judge benchmarks
But most focus on evaluating the judge's ability to choose a better response

We focus on the judge's ability to choose a better system

13.12.2024 10:16 👍 0 🔁 0 💬 1 📌 0

JuStRank: Benchmarking LLM Judges for System Ranking Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such eva...

New preprint! ✨
Interested in LLM-as-a-Judge?
Want to get the best judge for ranking your system?
our new work is just for you:
"JuStRank: Benchmarking LLM Judges for System Ranking"
🕺💃
arxiv.org/abs/2412.09569

13.12.2024 10:16 👍 9 🔁 5 💬 1 📌 1

Latest posts by @asaf-yehudai