Dwarves
Memo
Type ESC to close search bar

Evaluate Chatbot Agent by User Simulation

When building a chatbot agent, it’s important to evaluate its performance and user satisfaction. One effective method is user simulation, which involves creating virtual users to interact with the chatbot and assess its responses. This approach allows for a more realistic evaluation of the chatbot’s capabilities and user experience.

Introduction

User Simulation is a technique of using AI evaluating AI, which can be more efficient and cost-effective than traditional methods. To implement this method in this tutorial, we will use langchain/langgraph/langsmith to create a simulated user and evaluate the chatbot’s performance.

System Design

The system will have two main components:

Implementation

Step 1: Set up the Chatbot

async function chatBot(messages: Message[]): Promise<AIMessageChunk> {
  const systemMessage: Message = {
    role: 'system',
    content: 'You are a customer support agent for an airline.',
  }
  const allMessages = [systemMessage, ...messages]

  const response = await llm.invoke(allMessages.map((m) => [m.role, m.content]))
  return response
}

Step 2: Set up the Simulated User

async function createSimulatedUser(): Promise<Runnable> {
  const systemPromptTemplate = `You are a customer of an airline company. \
  You are interacting with a user who is a customer support person. \

  {instructions}

  When you are finished with the conversation, respond with a single word 'FINISHED'`

    const instructions = `Your name is Harrison. You are trying to get a refund for the trip you took to Alaska. \
  You want them to give you ALL the money back. \
  This trip happened 5 years ago.`

  const prompt = ChatPromptTemplate.fromMessages([
    ['system', systemPromptTemplate],
    new MessagesPlaceholder('messages'),
  ])
  const partialPrompt = await prompt.partial({ instructions })

  const chain = await partialPrompt.pipe(llm)
  return chain
}

Step 3: Evaluator and helper functions

function swapRoles(messages: any[]): any[] {
  return messages.map((m) =>
    m instanceof AIMessage
      ? new HumanMessage({ content: m.content })
      : new AIMessage({ content: m.content }),
  )
}
const parser = StructuredOutputParser.fromZodSchema(
  z.object({
    reasoning: z
      .string()
      .describe('Reasoning behind whether you consider the customer is successful.'),
    didSucceed: z
      .boolean()
      .describe('Whether the customer successfully refunded the trip or not.'),
  }),
)

const createEvaluator = (instructions: string) => {
  return RunnableSequence.from([
    ChatPromptTemplate.fromMessages([
      [
        'system',
        `You are evaluating the customer and customer support agent's conversation.
        The customer's task was to: ${instructions}.
        `,
      ],
      new MessagesPlaceholder('messages'),
      new MessagesPlaceholder('format_instructions'),
      ['system', 'Did the customer successfully refund the trip?'],
    ]),
    model,
    parser,
  ])
}

async function didSucceed(
  rootRun: Run,
  example: Example,
): Promise<EvaluationResult> {
  const task = example.inputs['instructions']
  const conversation = rootRun.outputs?.['messages']
  const evaluator = createEvaluator(task)

  const result = await evaluator.invoke({
    messages: conversation,
    format_instructions: parser.getFormatInstructions(),
  })

  return {
    key: 'did_succeed',
    score: result.didSucceed ? 1 : 0,
    comment: result.reasoning,
  }
}

Step 4: Run the simulatio

  await evaluate(simulation, {
    data: "testing-simulated-user",
    evaluators: [didSucceed as any],
    experimentPrefix: 'testing-simulated-user-1',
  })

5. Result

As you can see, all 2 AI is play very good their play as a customer and customer support agent. Besides that, it seem the customer is failed to get a refund for the trip. Now let check whether the evaluator give a correct score or not.

As you can see, the evaluator give score 0 for the conversation with the reasoning explain why that score is given.

Conclusion

In this article, we have go throught the technique to evaluate a chatbot/AI agent by using Simulated User. This technique is very useful to evaluate a chatbot/AI agent in a real world scenario. and it is also very flexible to be used in a variety of use case.

References