Poe bot rankings

Poe's official bot evaluations are a combination of industry standard benchmarks and custom tests designed to be representative of actual Poe usage. Each official bot is rated across 4 attributes: Reasoning, Non-English fluency, Creativity and Writing.

Official bots	Reasoning	Non-English fluency	Creativity	Writing
GPT-4-Turbo	#1 34.0	#1 50.2	#1 1170 Elo score	#1 1106 Elo score
Claude-2-100k	#2 26.0	#4 39.5	#5 925 Elo score	#5 963 Elo score
Google-PaLM	#3 24.6	#2 44.1	#6 881 Elo score	#6 931 Elo score
Claude-instant	#4 22.9	#5 36.8	#4 929 Elo score	#4 967 Elo score
GPT-3.5-Turbo-Raw	#5 19.3	#3 41.5	#2 1084 Elo score	#2 1024 Elo score
Llama-2-70b Private	#6 18.8	#6 29.1	#3 1009 Elo score	#3 1008 Elo score

The bots in these rankings were evaluated in September 2023.

Definitions

Reasoning

How well the bot arrives at a logical conclusion by following complex prompts. This helps describe proficiency at tasks like solving math questions and overcoming programming challenges.

Programming

Math and science

Homework help

Simplify complex topics

How we measure it

We selected industry benchmarks which depicted each LLM's "out-of-the-box" capabilities with logical deduction, game reasoning, coding, and computations. The individual datasets contributing to each of these subcategories were normalized by sample size to arrive at the final weighting scheme. Each score is a mean weighted average of LLM accuracy across the various subcategories and 3 runs.

Sample evaluation prompts

Poker Winner

Acid pH

Non-English fluency

How well the bot performs general tasks that require language-specific understanding and holds conversations in languages other than English. This helps describe non-English proficiency in languages that are commonly used by Poe users.

Language learning

Translations

How we measure it

Arabic, Chinese, French, German, Indonesian, Italian, Japanese, Korean, Persian, Portuguese, Russian, Spanish, Turkish, and Vietnamese were the 14 commonly used languages we evaluated. We leveraged several reports on the distribution of online language usage to inform the weights to apply for each language subcategory. Cross-referencing this weighting scheme with Poe language usage distributions revealed a similar trend on our platform. In each language subcategory, datasets were assigned a weight of 1x if they measured simpler linguistic tasks, such as part-of-speech classification, lexicon, pronunciation, translations, etc. Conversely, tasks that required a deeper understanding and knowledge of the language were assigned a weight of 2x.

Sample evaluation prompts

Sarcasmo

Parts of Speech

Creativity

How well the bot can take direction on creative dimensions to produce expressive text which is well-aligned with the prompt. This helps describe proficiency at tasks like creative writing and role-play.

Fiction writing

Jokes and rhyming

Character development

How we measure it

Expert human evaluation in partnership with SurgeAI on a wide range of creative tasks including writing stories, generating names, and role-play. The resulting rankings are converted to Elo ratings for comparability. Creativity is an emerging domain for evaluation and so we've open sourced the prompts we used to facilitate development of a deeper understanding of LLM creative capabilities. See the Poe Github repository for more details.

Sample evaluation prompts

Rhyming names

A funny story

Writing

How well the bot can follow prompts to generate or rephrase text in a way that is articulate and suited to the requested format. This helps describe proficiency at writing tasks such as drafting written content, proofreading, and summarization.

Copy editing

Emails

Blogs and essays

Grammar and tone

Summarization

How we measure it

Expert human evaluation in partnership with SurgeAI. Writing tasks are curated to model realistic Poe usage. Human appraisals are converted to an Elo score.

Sample evaluation prompts

Presentation help

A For Rent Ad

More on our approach

For the Reasoning and Non-English fluency dimensions, we began with a subset of OpenAI datasets and transformed all input messages to user messages in each LLM request to simulate the end user experience. We replicated each test 3 times and took the mean of the LLM accuracy on each dataset to inform the resulting composite score. A general caveat with LLM evaluations, is that with enough bot-specific prompt engineering, it is possible to extract the desired responses. Modifying each sample to speak to each LLM in a manner to which it best responds is a non-trivial task. To address this, we made improvements to answer recognition to ensure that bots were not penalized by varying levels of verbosity.

For Creativity and Writing, the Poe team identified prominent themes through independent user research survey data. These themes were validated with anonymous usage data to generate descriptions of Poe usage patterns. Using these guides, SurgeAI generated prompts which were reflective of Poe usage. Their human evaluators provided exhaustive rankings of each bot with a minority of ties. We then converted the differing ranks into pairwise battles and computed Elo scores using the methodology published by LMSys.

Chat with millions of bots on Poe

Try out any of Poe's official bots from OpenAI, Anthropic, Google or Meta, and discover the millions of user-added bots built on top of them. Find the right bots for you — ranging from programming assistants to your favorite characters.

Try Poe now