Skip to main content

Comparison Evaluators

Comparison evaluators in LangChain help measure two different chains or LLM outputs. These evaluators are helpful for comparative analyses, such as A/B testing between two language models, or comparing different versions of the same model. They can also be useful for things like generating preference scores for ai-assisted reinforcement learning.

These evaluators inherit from the PairwiseStringEvaluator or LLMPairwiseStringEvaluator class, providing a comparison interface for two strings - typically, the outputs from two different prompts or models, or two versions of the same model. In essence, a comparison evaluator performs an evaluation on a pair of strings and returns a dictionary containing the evaluation score and other relevant details.

To create a custom comparison evaluator, inherit from the PairwiseStringEvaluator or LLMPairwiseStringEvaluator abstract classes exported from langchain/evaluation and overwrite the _evaluateStringPairs method.

Here's a summary of the key methods and properties of a comparison evaluator:

  • _evaluateStringPairs: Evaluate the output string pairs. This function should be overwritten when creating custom evaluators.
  • requiresInput: This property indicates whether this evaluator requires an input string.
  • requiresReference: This property specifies whether this evaluator requires a reference label.

Detailed information about creating custom evaluators and the available built-in comparison evaluators is provided in the following sections.