Poe's official bot evaluations are a combination of industry standard benchmarks and custom tests designed to be representative of actual Poe usage. Each official bot is rated across 4 attributes: Reasoning, Non-English fluency, Creativity and Writing.
Official bots | Reasoning | Non-English fluency | Creativity | Writing |
---|---|---|---|---|
![]() GPT-4 | #1 34.0 | #1 50.2 | #1 1170 Elo score | #1 1106 Elo score |
![]() Claude-2-100k | #2 26.0 | #4 39.5 | #5 925 Elo score | #5 963 Elo score |
![]() Google-PaLM | #3 24.6 | #2 44.1 | #6 881 Elo score | #6 931 Elo score |
![]() Claude-instant | #4 22.9 | #5 36.8 | #4 929 Elo score | #4 967 Elo score |
![]() GPT-3.5-Turbo | #5 19.3 | #3 41.5 | #2 1084 Elo score | #2 1024 Elo score |
Llama-2-70b | #6 18.8 | #6 29.1 | #3 1009 Elo score | #3 1008 Elo score |
The bots in these rankings were evaluated in September 2023.
How we measure it
How we measure it
How we measure it
Expert human evaluation in partnership with SurgeAI on a wide range of creative tasks including writing stories, generating names, and role-play. The resulting rankings are converted to Elo ratings for comparability. Creativity is an emerging domain for evaluation and so we've open sourced the prompts we used to facilitate development of a deeper understanding of LLM creative capabilities. See the Poe Github repository for more details.
How we measure it
Expert human evaluation in partnership with SurgeAI. Writing tasks are curated to model realistic Poe usage. Human appraisals are converted to an Elo score.
For the Reasoning and Non-English fluency dimensions, we began with a subset of OpenAI datasets and transformed all input messages to user messages in each LLM request to simulate the end user experience. We replicated each test 3 times and took the mean of the LLM accuracy on each dataset to inform the resulting composite score. A general caveat with LLM evaluations, is that with enough bot-specific prompt engineering, it is possible to extract the desired responses. Modifying each sample to speak to each LLM in a manner to which it best responds is a non-trivial task. To address this, we made improvements to answer recognition to ensure that bots were not penalized by varying levels of verbosity.
For Creativity and Writing, the Poe team identified prominent themes through independent user research survey data. These themes were validated with anonymous usage data to generate descriptions of Poe usage patterns. Using these guides, SurgeAI generated prompts which were reflective of Poe usage. Their human evaluators provided exhaustive rankings of each bot with a minority of ties. We then converted the differing ranks into pairwise battles and computed Elo scores using the methodology published by LMSys.
Try out any of Poe's official bots from OpenAI, Anthropic, Google or Meta, and discover the millions of user-added bots built on top of them. Find the right bots for you — ranging from programming assistants to your favorite characters.
Try Poe now