AI Metric

Inter-rater reliability (GRC)

Inter-rater reliability measures the degree of agreement between several judges scoring the same answer. High agreement means the score is stable and reproducible; low agreement signals an ambiguous answer on which the judges diverge.

Measuring the panel's agreement

When an AI panel scores an answer, you still need to know whether the judges agree with each other. Inter-rater reliability quantifies that agreement: the stronger it is, the more trustworthy the score. In the AGS, we publish this coefficient (which we call GRC internally) for each audit.

What is it for?

  • Confidence: a score backed by strong agreement is solid.
  • Warning signal: marked disagreement reveals an ambiguous answer or an edge case worth examining.

Transparency

Together with the confidence interval, inter-rater reliability is one of the indicators we display to show the robustness — and the limits — of each measurement. Details on the AGS methodology.

Only 16% of brands appear when their customers ask AIs. Does yours?

Every question asked to ChatGPT without your name in the answer is a competitor recommended instead of you — measured across 6,820 real AI answers.