Dwarves
Memo
Type ESC to close search bar

Evaluation Guidelines for LLM Applications

Overview

Evaluation is a hard part of building an RAG system, especially for application-integrated LLM solving your business problem. This guide outlines a clear, step-by-step approach to effectively evaluating and optimizing the integration of a third-party Large Language Model (LLM) into your application. By following these articles, you’ll make sure the model fits your business goals and technical needs.

Evaluation checklist

The evaluation checklist helps make sure that all important parts of the LLM are reviewed during integration. Each checklist item should address a key part of the system or model to confirm it meets technical, business, and user needs.

By providing a structured way to assess the system’s performance, the checklist helps we ensure that the model meets both technical and business needs while delivering a positive user experience. For additional insights, you can refer to the following articles: LLM Product Development Checklist and Understanding LLM User Experience Expectations.

Product evaluation checklist

In case RAG system:

graph TD
    A[Retrieval System] --> B[Search Engine]
    B --> C[Metric Precision, Recall]
    C --> F[How to Test: Compare Retrieved Docs]
    B --> D[Task-Specific Search]
    D --> G[How to Measure: Check Relevant Sections for Task]

    A --> H[Retrieval Efficiency]
    H --> I[Latency]
    I --> J[How to Measure: Time from Query to Retrieved Document]
    H --> K[Scalability]
    K --> L[How to Measure: Stress Testing with Multiple Users]

    A --> M[Response Generation]
    M --> N[LLM as a judge]
    N --> P[Evaluation with library evaluation]

    M --> R[Human-in-the-Loop]
    R --> S[User Satisfaction]
    S --> T[How to Measure: Human Feedback on Relevance and Usefulness]
    R --> U[Edge Cases]
    U --> V[How to Test: Humans Handle Specific Complex Cases]

    A --> W[Cost Efficiency]
    W --> X[Token Usage per Query]
    X --> Y[How to Measure: Track Token Usage in API Calls]

In case of fine-tuning model:

graph TD
    J[Fine-Tuning Model]
    J --> K[Apply Fine-Tuning on Task-Specific Data]
    K --> L[How to Measure: Monitor Loss, Accuracy During Fine-Tuning]

    J --> M[Post-Fine-Tuning]
    M --> N[Evaluate Performance Post-Fine-Tuning]
    N --> O[How to Test: Compare Pre and Post Model performance]
    M --> P[Prevent Overfitting and Bias]
    P --> Q[How to Measure: Track Validation vs. Training Performance]

    M --> R[Optimize Model]
    R --> S[How to Measure: Monitor Inference Speed and Token]
    M --> T[Task-Specific Accuracy and Generalization]
    T --> U[How to Measure: Analysis feedback user]

Business and user expectation

This section is all about putting users first! It helps us understand what users need and ensures they get quick, personalized responses. By matching the assistant’s replies to what users really want, we create a satisfying experience for everyone.

graph TD
  A[User Expected]
  A --> B[Understand User Needs]
  B --> D[Match Assistant Responses to User Want]

  A --> E[Happy Case]
  E --> J[Quick Responses]
  E --> M[Personalize Responses Based on Conversation]

Here, we focus on our goals as a business. This part guides us in making sure our system runs smoothly, stays affordable, and meets user needs effectively. By keeping an eye on performance and costs, we can deliver a reliable and efficient service that users want.

graph TD
  A[Business Goal]

  A --> B[User Expectations]
  B --> C[Understand User Needs]
  C --> D[Match Responses to User Intent]
  B --> E[Improve User Satisfaction]
  E --> F[Personalize Interactions]
  E --> G[Provide Fast Responses]

  A --> H[Technical Adoption]
  H --> I[Optimize Performance]
  I --> J[Monitor Latency and Throughput]
  I --> K[Ensure Low Error Rates]
  H --> L[Cost Efficiency]
  L --> N[Control API and Infrastructure Costs]

The type of evaluation

Model evaluations

Let’s look at the key metrics for calculates accuracy of search engine.

MetricDescriptionExample
PrecisionHow many of the documents you retrieved are actually relevant.If you retrieved 10 documents and 8 were relevant, your precision is 80%.
RecallHow many of the relevant documents were actually retrieved.If there were 20 relevant documents total and you retrieved 15, your recall is 75%.
F1 ScoreA balance between precision and recall, giving you a single accuracy score.With a precision of 80% and recall of 75%, your F1 score would be around 77%.
Hit RateThe percentage of searches that returned at least one relevant document.If users made 100 searches and found relevant info in 85, your hit rate is 85%.
Top-K AccuracyHow many relevant documents are in the top K results returned.If your system returns 10 documents and 7 of them are relevant, your top-10 accuracy is 70%.
Mean Average Precision (MAP)The average precision for several queries, taking into account the order of results.If you had 5 different queries, you could average their precisions to get MAP.
Mean Reciprocal Rank (MRR)The average position where the first relevant document shows up in the results.If relevant docs appear at positions 1, 3, and 5 across multiple searches, MRR would reflect the average of those positions.
Normalized Discounted Cumulative Gain (NDCG)Measures how useful the ranked results are, considering their positions.If your top result is highly relevant and the second is less so, NDCG will reflect that importance.

LLMs can act as reliable judges for evaluating outputs quickly. Below is a list of common metrics used for evaluation.

MetricWhat it ChecksWhen to UseExample
CorrectnessEnsures the output is factually accurate based on the information provided.Use when verifying that responses are grounded in correct information or facts.Checking if the answer to “Who is the current president of the US?” returns the correct name.
Answer RelevancyDetermines if the response is directly related to the user’s query.Use when you need to evaluate whether the response is aligned with the question asked.Ensuring that a question about weather forecasts gives weather-related responses.
FaithfulnessVerifies whether the output stays true to the source material without hallucinating or adding incorrect info.Use when you need to guarantee that a summary or paraphrase accurately reflects the original content.Checking if a model’s summary of an article stays true to the key points without adding extra information.
CoherenceChecks whether the response logically flows and makes sense as a whole.Use for long-form answers where the response needs to be consistent and easy to follow.Reviewing if a multi-sentence response explaining a technical concept is coherent and logical.
Contextual RecallMeasures how well the response retrieves all relevant information from the context provided.Use when evaluating the completeness of information retrieval tasks.Ensuring that a model answers all aspects of a multi-part question based on the context provided.
Contextual RelevancyEnsures the response uses the given context to directly address the user’s query.Use when it’s critical for the response to be specifically tied to the context or previous conversation.Checking if a chatbot follows up correctly on a previous conversation about booking a flight.
Contextual PrecisionMeasures the relevance and precision of the retrieved information from the context.Use when the response must be highly accurate and precise based on the context.Evaluating if a model picks the most relevant part of a conversation to respond to a follow-up query.
BiasDetects whether the response shows signs of prejudice or unfair bias in its content.Use when ensuring fairness, especially in sensitive or controversial topics.Checking if a model-generated description of a profession avoids gender or racial bias.
ToxicityIdentifies if the response contains harmful, offensive, or inappropriate language.Use when generating public-facing content where safety and neutrality are priorities.Evaluating a chatbot response to ensure it avoids offensive or inflammatory language.

Tools to define and evaluate these metrics

Product evaluations

Defining baselines, targets, and acceptable ranges for our RAG system metrics helps us stay on track and reach our goals. These benchmarks guide improvements and adapt to changes, ensuring we deliver the best experience for users while adding value to our organization.

MetricBaselineTargetAcceptable Range
Accuracy85% correct responses90% correct responses85% – 95%
Latency700ms per query400ms per query300ms – 500ms
Throughput100 queries/second150 queries/second120 – 200 queries/second
Cost per Query$0.01/query$0.008/query$0.007 – $0.012/query
Context Window Size4,096 tokens8,192 tokens6,000 – 10,000 tokens
Error Rate3% failure rate1% failure rate0.5% – 2%

Tools for tracing and monitoring

Considerations

Coverage and monitoring

To keep your LLM application running smoothly, you’ll want to:

Use analytics and user feedback

Need fine-tuning model

RAG systems are fantastic for retrieving information, but they sometimes miss the mark when it comes to understanding the finer details of specific tasks. Fine-tuning serves as a solution to this challenge by adapting pre-trained models to specific datasets to apply specific tasks.

  1. Deeper Understanding of Context: Fine-tuning allows a model to learn the ins and outs of specific tasks, making it better at understanding details that are important for accurate responses
  2. Fewer errors in specific scenarios: By focusing on task-related examples, fine-tuning reduces the chances of mistakes, allowing the model to perform reliably—especially in complex or unique requests.
  3. Handling edge cases: Fine-tuning prepares the model to tackle unusual or rare scenarios better, ensuring it can provide the right answers when faced with unexpected questions.

Assume how the model’s performance changes before and after fine-tuning:

MetricBefore Fine-TuningAfter Fine-TuningChange
Task-Specific Accuracy75%90%+15%
Error Rate5%2%–3%
Edge Case Handling70%85%+15%
Search Precision80%95%+15%

Summary

This guide provides a simple, step-by-step approach to evaluating and optimizing your RAG system, ensuring it meets your business goals and user needs. With handy checklists and tools, you’ll effectively assess model performance and improve user experience!

Reference


Next: AI-as-a-Judge