AIツール

LMSysチャットボットアリーナ

大規模言語モデルをクラウドソーシングによるバトルを通じて評価・比較するためのオープンプラットフォーム。GPT-4、Claude、Geminiなどを並べて比較できます。

shipped 2025年11月25日chatbotfreemium

chatbotLLMbenchmark

LMSys Chatbot Arena — product screenshot

注目ポイント

1チャットボット

2Could you please provide the text you would like to have translated?

3ベンチマーク

Stork’s verdict on LMSys Chatbot Arena

Chatbot Arenaは動的なElo式リーダーボードを提供しますが、そのスコアは特定のプロンプトスタイルに最適化されたモデルによって偏る可能性があります。

overview

概要

大規模言語モデルを評価し、クラウドソーシングによるバトルで比較するオープンプラットフォーム。GPT-4、Claude、Geminiなどを並べて比較できます。

how to use

LMSys Chatbot Arena の使い方

LMSys Chatbot Arena は、大規模言語モデルと対話し評価するためのシンプルな Web ベースのインターフェースを提供します。ユーザーは '対戦' に参加して、動的なリーダーボードに貢献します。

1プラットフォームにアクセス: Web ブラウザで arena.ai（旧 lmarena.ai）にアクセスします。
2対戦を開始: 'Battle Mode' を選択して、匿名かつランダムな 1 対 1 の比較を開始します。
3LLM と対話: 用意されたチャットインターフェースで、正体不明の 2 つの LLM に同時にプロンプトを投げかけます。
4回答を評価: 両モデルの回答の品質、有用性、関連性を比較します。
5投票する: より優れた回答に投票するか、引き分けを宣言するか、両方の回答が悪い場合はそれを示します。
6リーダーボードを見る: 'Leaderboard' セクションにアクセスし、ユーザーの累積投票に基づく各 LLM の動的な Elo 風ランキングを確認します。

Pros

+数百万件の実際のユーザーインタラクションに基づく、人間の選好に根ざした動的なリーダーボードを提供します。
+匿名かつランダムな 1 対 1 の比較を提供し、評価におけるバイアスの軽減に役立ちます。
+2024 年 6 月以降のマルチモーダル対応を含め、新しいモデルや機能で継続的に更新されています。
+実際のユーザーからの新しいプロンプトを継続的に取り込むことで、静的ベンチマークの限界に対処します。
+研究や再現性のために、貴重な会話データセットと open-source インフラ（FastChat）を提供します。

Cons

−モデルが Arena 形式のプロンプトに特化して最適化され、汎化しない可能性のある水増しされたスコアにつながる恐れがあります。
−すべての評価ニーズに対応する '万能ベンチマーク' ではなく、専門家はタスクベースの評価と併用することを推奨しています。
−本質的に会話タスクに偏っており、高度に専門的、あるいは長く複雑なやり取りでの性能を正確に反映しない場合があります。
−プラットフォームの影響力が高まるにつれ、企業による影響や結果操作の可能性への懸念が存在します。
−匿名化はバイアスを減らす一方で、対戦後に正体を明かさないため、特定のモデルの限界を理解しづらくすることがあります。

類似ツール

代替製品を比較

検討すべき他のツール

WhatLLM.org↗

It aggregates benchmark data, real-world pricing, and throughput metrics for a vast number of LLMs, offering a unified interface for comparison.

Unlike LMSys Chatbot Arena's crowdsourced battles, WhatLLM.org focuses on aggregating and presenting quantitative benchmark data, pricing, and speed metrics for developers and researchers to make informed decisions.

Artificial AnalysisOn Stork Compare

Provides comprehensive comparisons of leading AI chatbots based on their own detailed benchmarking of intelligence, features, context windows, and performance metrics.

While both offer comparisons, Artificial Analysis provides its own structured benchmarks and detailed metrics, whereas LMSys Chatbot Arena relies on real-time, anonymous human preference battles to generate its leaderboard.

Google LLM ComparatorOn Stork Compare

It's a web app and Python library designed for scalable analysis of side-by-side LLM evaluations with interactive visualizations, helping users understand *why* model performance differs.

Unlike the public, crowdsourced nature of LMSys Chatbot Arena, Google LLM Comparator is a tool for developers to analyze side-by-side evaluation results more deeply, focusing on identifying and understanding performance discrepancies.

OpenAI EvalsOn Stork Compare

An open-source framework that allows developers to build, run, and share custom benchmarks and evaluation tasks for LLMs, fostering community contribution to testing.

OpenAI Evals is a framework for creating and running benchmarks, offering a programmatic approach to evaluation, whereas LMSys Chatbot Arena is a user-facing platform for interactive, crowdsourced model comparisons.

Hugging Face Open LLM LeaderboardOn Stork Compare

It provides a public, continuously updated leaderboard that ranks open-source LLMs based on standardized benchmarks, offering transparency and a central reference for model performance.

While both provide rankings, the Hugging Face Open LLM Leaderboard focuses on objective, benchmark-driven scores for open-source models, contrasting with LMSys Chatbot Arena's human-preference-based Elo rating system for a broader range of models.

LMSys Chatbot Arena を訪問↗