The AI panel, or "LLM-as-a-judge", consists of having an answer assessed by several language models acting as judges. By cross-referencing models from different providers, you limit self-preference bias and obtain a more robust score than with a single model.
The principle
Rather than scoring an AI answer with a single model — which would tend to favour itself — you convene a panel of several judge models from different providers. Each one scores the answer against precise rubrics (presence, accuracy, sentiment), and the scores are then aggregated.
Why several judges?
- Anti self-preference: a model cannot favour its own answers if the panel is diverse.
- Robustness: the occasional errors of one judge are smoothed out by the others.
- Detecting hallucinations: an unverifiable claim flagged by the judges lowers the score.
Reliability and traceability
The agreement between judges is measured by an inter-rater reliability coefficient. The exact composition of the panel for each audit is tracked via a configuration fingerprint, ensuring comparability over time. This panel is at the core of the AGS; see the methodology and the demonstration.
Every question asked to ChatGPT without your name in the answer is a competitor recommended instead of you — measured across 6,820 real AI answers.