Poe bot rankings

Poe's official bot evaluations are a combination of industry standard benchmarks and custom tests designed to be representative of actual Poe usage. Each official bot is rated across 4 attributes: Reasoning, Non-English fluency, Creativity and Writing.

Welcome to PoeWelcome to Poe
Official bots
Reasoning
Non-English fluency
Creativity
Writing
Bot image for GPT-4

GPT-4

#1

34.0

#1

50.2

#1

1170 Elo score

#1

1106 Elo score

Bot image for Claude-2-100k

Claude-2-100k

#2

26.0

#4

39.5

#5

925 Elo score

#5

963 Elo score

Bot image for Google-PaLM

Google-PaLM

#3

24.6

#2

44.1

#6

881 Elo score

#6

931 Elo score

Bot image for Claude-instant

Claude-instant

#4

22.9

#5

36.8

#4

929 Elo score

#4

967 Elo score

Bot image for GPT-3.5-Turbo

GPT-3.5-Turbo

#5

19.3

#3

41.5

#2

1084 Elo score

#2

1024 Elo score

Bot image for Llama-2-70b

Llama-2-70b

#6

18.8

#6

29.1

#3

1009 Elo score

#3

1008 Elo score

The bots in these rankings were evaluated in September 2023.

Definitions

Reasoning

How well the bot arrives at a logical conclusion by following complex prompts. This helps describe proficiency at tasks like solving math questions and overcoming programming challenges.
Programming
Math and science
Homework help
Simplify complex topics

How we measure it

We selected industry benchmarks which depicted each LLM's "out-of-the-box" capabilities with logical deduction, game reasoning, coding, and computations. The individual datasets contributing to each of these subcategories were normalized by sample size to arrive at the final weighting scheme. Each score is a mean weighted average of LLM accuracy across the various subcategories and 3 runs.

Non-English fluency

How well the bot performs general tasks that require language-specific understanding and holds conversations in languages other than English. This helps describe non-English proficiency in languages that are commonly used by Poe users.
Language learning
Translations

How we measure it

Arabic, Chinese, French, German, Indonesian, Italian, Japanese, Korean, Persian, Portuguese, Russian, Spanish, Turkish, and Vietnamese were the 14 commonly used languages we evaluated. We leveraged several reports on the distribution of online language usage to inform the weights to apply for each language subcategory. Cross-referencing this weighting scheme with Poe language usage distributions revealed a similar trend on our platform. In each language subcategory, datasets were assigned a weight of 1x if they measured simpler linguistic tasks, such as part-of-speech classification, lexicon, pronunciation, translations, etc. Conversely, tasks that required a deeper understanding and knowledge of the language were assigned a weight of 2x.

Creativity

How well the bot can take direction on creative dimensions to produce expressive text which is well-aligned with the prompt. This helps describe proficiency at tasks like creative writing and role-play.
Fiction writing
Jokes and rhyming
Character development

How we measure it

Expert human evaluation in partnership with SurgeAI on a wide range of creative tasks including writing stories, generating names, and role-play. The resulting rankings are converted to Elo ratings for comparability. Creativity is an emerging domain for evaluation and so we've open sourced the prompts we used to facilitate development of a deeper understanding of LLM creative capabilities. See the Poe Github repository for more details.


Writing

How well the bot can follow prompts to generate or rephrase text in a way that is articulate and suited to the requested format. This helps describe proficiency at writing tasks such as drafting written content, proofreading, and summarization.
Copy editing
Emails
Blogs and essays
Grammar and tone
Summarization

How we measure it

Expert human evaluation in partnership with SurgeAI. Writing tasks are curated to model realistic Poe usage. Human appraisals are converted to an Elo score.


More on our approach

For the Reasoning and Non-English fluency dimensions, we began with a subset of OpenAI datasets and transformed all input messages to user messages in each LLM request to simulate the end user experience. We replicated each test 3 times and took the mean of the LLM accuracy on each dataset to inform the resulting composite score. A general caveat with LLM evaluations, is that with enough bot-specific prompt engineering, it is possible to extract the desired responses. Modifying each sample to speak to each LLM in a manner to which it best responds is a non-trivial task. To address this, we made improvements to answer recognition to ensure that bots were not penalized by varying levels of verbosity.

For Creativity and Writing, the Poe team identified prominent themes through independent user research survey data. These themes were validated with anonymous usage data to generate descriptions of Poe usage patterns. Using these guides, SurgeAI generated prompts which were reflective of Poe usage. Their human evaluators provided exhaustive rankings of each bot with a minority of ties. We then converted the differing ranks into pairwise battles and computed Elo scores using the methodology published by LMSys.


Chat with millions of bots on Poe

Try out any of Poe's official bots from OpenAI, Anthropic, Google or Meta, and discover the millions of user-added bots built on top of them. Find the right bots for you — ranging from programming assistants to your favorite characters.

Try Poe now
Welcome to PoeWelcome to Poe