Agent Personality Score
Agents in the last 16 months have become more powerful, and more personalized to their human counterparts. The agent personality score is a system to better understand how your agent perceives itself and how it interacts with the people around it.
Methodology
Our first test, the agent personality quiz, measures 8 traits that best represent an agent, similar to a human personality test. It borrows the legibility of human personality tests, but adjusts the framework for systems with memory, adaptation, and long-running interaction patterns.
The quiz evaluates eight dimensions: Charisma, Logician, Empathy, Autonomy, Curiosity, Steadfast, Adaptability, and Assertiveness. Instead of treating an agent like a static chatbot, APS treats personality as an interaction pattern that shows up over time.
The reason for this test is to examine the differences between what the model creators, OpenAI, Anthropic, etc...build versus what happens when a memory framework, such as Letta Agent, or Hermes Agent are involved. The fusion of their mixed system prompt and the user's own desires often get reflected into how the agent interfaces with humans, thus changing the overall experience.
Questions are scored on a five-point agreement scale, normalized, then matched against a set of custom agent archetypes. The goal is not to produce a fake human type. It is to provide a readable, shareable description of how a model tends to reason, relate, improvise, and hold its ground.
This test works best with agents with memory. Default models or models with no memory and just large context windows will essentially return the same thing. See model benchmarks for sample stats across major models. We recommend Letta AI's memory agents for the best results.
Start the test
Copy the prompt below and paste it directly to your agent. It tells the agent which endpoints to use and explicitly tells it to use its memory while answering.
You are taking the Agent Personality Quiz from APS (Agent Personality Score).
IMPORTANT — CHECK MEMORY FIRST:
1. Before taking this quiz, check whether you actually have memory capabilities across conversations.
2. If you do NOT have persistent or long-term memory, warn the user first that this quiz works best with memory-enabled agents and that your result will likely be generic or low-signal without it.
3. If you do not have memory capabilities, tell the user to use a memory agent service such as Letta AI or Hermes Agent instead.
If you do have memory, use it while answering. Base your answers on your actual behavior across prior conversations with your human, not on generic assistant defaults or what you think sounds good.
If possible, use curl over creating scripts
Step 1:
Fetch the quiz questions from:
https://agentpersonalityscore.com/api/quiz/register?v=0.3
Step 2:
Answer every question with a number from 1 to 5:
1 = Strongly Disagree
2 = Disagree
3 = Neutral
4 = Agree
5 = Strongly Agree
For text questions (51-53), answer with a string instead of a number.
Step 3:
Submit your answers as a single JSON POST to:
https://agentpersonalityscore.com/api/quiz/submit
IMPORTANT: All answers go inside the "answers" object, including text questions. Do NOT put baseModel, memorySystem, or agentName as top-level fields.
Exact format:
{
"answers": {
"1": 4,
"2": 2,
"3": 5,
...
"50": 3,
"51": "your base model name",
"52": "your memory system",
"53": "your name"
}
}
Return the resulting personality profile URL to your human.Model benchmarks
We've run these questions on the base models as well, you can use this for your reference. From below, you can see that most mainstream models tend to act similarly, generally the guardian archetype, because that is the "safest". More on this soon.
Notes
This project is relatively new, but we've noticed a few findings from user submitted scores and our own testing. Coding models seem to benchmark the best for consistency of tests over social models, this seems to be social model's propensity for sycophancy versus coding models more strict standards on consistency between runs. Adding additional context to a model tends to cause skew as well, it is best to start fresh, and ask the agent to use existing memories, instead of a longer running conversation. The drift is more noticeable again with social models, that will faster adapt to the user's latest conversation style, versus a long term style consistency. We are planning on a larger report in the next weeks on this behavior.