Dwarves
Memo
Type ESC to close search bar

LLM as a judge

With the robust growth of LLM models currently, there is a new method used to evaluate the performance of large language models (LLMs): LLM-as-a-Judge, also known as LLM-evaluators. This approach takes advantages of other advanced language models to assess the quality and effectiveness of responses generated by other LLMs.

Introduction

LLM-as-a-Judge is a powerful solution that uses LLMs to evaluate LLM responses based on any specific criteria of your choice, which means using LLMs to carry out LLM (system) evaluation. This approach offers an alternative to traditional human evaluation, which can be both costly and time-consuming. The LLM-as-a-Judge framework encompasses three main types:

Example:

prompt= """
Given the folowing question and answer, evaluate how good the answer is for the question. Use the score from 1 to 5:

Q: {{question}}
A: {{answer}}
Score:
"""

The idea is simple: give an AI language model a set of criteria and let it evaluate responses for you.

Problems

As you might expect, LLM judges are not all rainbows and sunshines. They also suffer from several drawbacks, which includes:

Improving LLM judgements

Chain-of-thought prompting

Chain-of-thought (CoT) prompting helps LLM explain their thinking step-by-step. When using this method for AI evaluators, we make them reasoning detailed instructions on how to judge, rather than vague guidelines. This approach helps the AI make more accurate and consistent evaluations. It also makes the AI’s judgments more in line with what humans would expect.

prompt= """
Decide if the following summary is consistent with the corresponding article. Note that
consistency means all information in the summary is supported by the article.

Article: [Article]
Summary: [Summary]
Explain your reasoning step by step then answer (yes or no) the question:

"""

Confining LLM judgements

Instead of giving LLMs the entire generated output to evaluate, you can consider breaking it down into more fine-grained evaluations. For example, for question-answer-generation (QAG), you can first extract all sentences in output and pass each of them through LLM with prompt = Is this sentence relevant to the input? answer yes or no only. After that, calculate the proportion of relevant sentences. This proportion becomes the “answer relevancy score.”

Using LLM judges in LLM evaluation metrics

LLM judges can be and are currently most widely used to evaluate LLM systems by incorporating it as a scorer in an LLM evaluation metric.

Fine-tuning LLM judges

Fine-tuning LLM judges can help improve their performance. This involves training the LLM on a dataset of examples where the correct score is already known. This can help the LLM learn to be more consistent and accurate in its evaluations.

Conclusion

LLM-as-a-Judge contributes a significant impact to the field of AI evaluation. By leveraging the power of advanced language models to evaluate other models, we’re entering a new era of more accurate, scalable, and insightful AI assessment. While challenges remain, such as potential biases and the need for careful prompt engineering, the benefits of this approach are clear.

As LLMs continue to evolve and improve, as well as their ability to serve as judges. The relationship between LLMs and AI evaluation is likely to become even more symbiotic, with each side benefiting from the other.

References