AGS Methodology — AI Grading System | AI Labs Audit
AGS (AI Grading System) is AI Labs Audit's scoring engine. It has every AI response graded by 5 AI judges that calibrate against each other, then publishes an inter-judge reliability coefficient so you know exactly how defensible your score is.
In 30 seconds
The AGS (AI Grading System) is an open method for measuring a brand's visibility in AI responses. For each response, we grade three things — is the brand present, is what's said favourable, and is the response reliable — using a jury of several independent AIs. The final score combines these three grades in such a way that a weakness cannot be masked. Above all, it is a method: you can understand it, verify it and challenge it.
This method can be read at several levels: a 30-second definition, a diagram, an explanation of the dimensions, the formula, and finally the reproducibility appendix. A marketing director, a GEO consultant and a researcher should all be able to find their way around it, each at the level of detail that speaks to them.
The diagram
+---------------------------------------------+
Prompt ---> | Response from an AI model (ChatGPT, etc.) |
+---------------------------------------------+
|
v
+---------------------------------------------+
| JURY = several independent AI judges |
| (different providers, anti self-pref.) |
+---------------------------------------------+
|
+--------------------------+--------------------------+
v v v
P - Presence I - Influence Q - Quality
(M, RP, WC) (INF, UNQ, REL) (SENT, ACC)
+--------------------------+--------------------------+
v
AGS = (P x I x Q) ^ (1/3)
(geometric mean: a zero on one
dimension collapses the score)
The geometric mean is deliberate: you cannot offset an absence of presence with good sentiment. Far harder to inflate than a conventional average.
What is AGS?
AGS is an open-source multi-judge scoring protocol. Instead of relying on a single LLM to assess your brand's visibility (bias, hallucinations, model drift), AGS queries 5 AI judges in parallel (GPT-4o, Claude Sonnet, Gemini Pro, Mistral Large, Llama 3.1) and publishes the spread between them. The smaller the spread, the more reliable the score.
The 3 evaluated dimensions
- P (Precision): does the answer mention your brand correctly, without confusion with a competitor or homonym? Measures hallucinations and attribution errors.
- I (Informativeness): does the answer provide useful and differentiating information about your brand, or just name it? Measures the depth of the citation.
- Q (Quality): is the answer factually correct and up-to-date? Measures information freshness and conformity to verifiable facts.
The formula (public method, proprietary weightings)
AGS = (P x I x Q) ^ (1/3), each dimension graded from 0 to 100. Each dimension is a weighted combination of sub-metrics:
- P — Presence = a combination of Mention (M), Rank / Position (RP) and Coverage (WC).
- I — Influence = a combination of Informativeness (INF), Uniqueness (UNQ) and Relevance (REL).
- Q — Quality = a combination of Sentiment (SENT) and Accuracy (ACC).
The exact weighting values are part of our proprietary methodology: they are normalised (sum = 1), calibrated and versioned, and tracked by the judge_config_hash (which guarantees that two audits sharing the same hash are strictly comparable). We publish the method — structure, geometric mean, dimensions, sub-metrics, jury, reliability — without disclosing the exact weighting, which is part of our know-how.
Jury & reproducibility
Each response is graded by several judge models from different providers to prevent a model from favouring itself. As of publication, the jury combines, for example, models from OpenAI, Anthropic, Google, Mistral and DeepSeek — but models evolve: this list is an example as of today, not a fixed promise.
The actual configuration of each audit (models + weights + rubrics) is tracked by its judge_config_hash (SHA-256): that is the stable reference. Two audits sharing the same hash are strictly comparable, and any change to the jury is tracked.
- GRC: an inter-judge reliability coefficient published for each audit (the degree of agreement between judges).
- A Wilson confidence interval is shown on presence scores: the uncertainty is displayed, not just a single figure.
- Anchor set: a panel of benchmark brands re-measured continuously detects model drift; the client's score is corrected for this drift.
Evaluation protocol
For each audited prompt, AGS executes 5 parallel calls to the AI judges with identical instructions (zero-shot scoring). Scores are aggregated via a weighted average using each judge's declared confidence. The final result includes the average score, inter-judge standard deviation, and 95% bootstrap confidence interval.
Inter-judge reliability coefficient
AGS publishes the Fleiss kappa coefficient (multi-rater agreement measure) for each audit. A kappa above 0.80 indicates strong consensus among judges (highly reliable score). Between 0.60 and 0.80: moderate consensus. Below 0.60: weak consensus — the score should be interpreted with caution and the question rephrased.
Transparency and reproducibility
Each AGS audit produces a cryptographic hash of the prompts, raw responses, and individual scores. This signature proves the score has not been manipulated. AGS code is open source (MIT license) on github.com/sarsator/aqa-specification, and the scoring formula is versioned and published. Any customer can verify or challenge a score.
AGS Acronyms
- GRC
- Generative Response Coverage: percentage of prompts where at least one judge cites the brand.
- GIS
- Generative Inclusion Score: weighted average score based on brand position in the response (first mentioned = 100%, last = 0%).
- ASR
- Answer Sentiment Rating: tonality of the mention (positive/neutral/negative) on a -1 to +1 scale.
- BVI
- Brand Visibility Index: composite score (GRC × GIS × ASR), from 0 to 100, that summarises the brand's overall performance across tested AIs.
- CIA
- Citation Inter-judge Agreement: Fleiss kappa coefficient measuring agreement among the 5 AI judges on citation presence.
30 Advanced GEO Checks 2026
Sprint 15 delivered 30 new GEO/AEO signals measured passively (zero ToS-violating scraping). These signals complement the AGS scoring through the 6th category 'advanced_signals' (15% composite weight).
6 market differentiators
- A08 — Specificity score (Princeton GEO 2024) — Density of tier-1 sourced statistics (Princeton GEO KDD 2024: +27 to +40% LLM citations).
- A09 — Counter-arguments markers — No competing tool measures balanced-argumentation markers.
- A07 — Date-stamped statements
- S05 — Common Crawl inclusion
- S08 — llms.txt RFC validation
- B10 — Stack Overflow brand mentions
Module 6 — External Authority Signals
New module dedicated to external authority signals: LinkedIn, ProductHunt, G2/Capterra, Stack Overflow, GitHub, Substack/Medium.
Checks by GEO module
- SSR / Crawlability
- mainEntity · QAPage · Video transcripts · Speakable · @graph @id · inLanguage · Common Crawl · llms.txt · IndexNow · ai.txt · Verifications · HTTP/3 · Brotli
- Entity Health
- Wikidata · DBpedia
- Citation Readiness
- Sourced stats · Balanced argumentation · Inline dating · ItemList · Dataset · Blockquote cite · Internal links · Anchor entropy · News Sitemap
- External Authority
- Stack Overflow · LinkedIn · GitHub · ProductHunt · B2B reviews · Newsletter
All checks use safe_external_call (retry + cache + circuit breaker) and store their results in audits.advanced_checks_v2 (JSONB + GIN index).
Full article on the blog: 30 new GEO/AEO 2026 signals — State of the art audited.
What we measure — and what we do not
What we measure
We query AI models through their official APIs, in native mode (the model's memory, no browsing) and in web mode (online search enabled), with reproducible questions. Every audit computes a cryptographic fingerprint of its configuration: two compared audits are genuinely comparable.
The default panel is aligned with the products the public actually uses (ChatGPT, Gemini, Perplexity, Claude, Mistral…), and the dashboard shows a visibility score weighted by each engine's real audience share (Statcounter / SimilarWeb sources, dated and revised).
What we do not measure
An answer obtained through an API can differ from the consumer interface of the same product: conversation memory, proprietary instructions, geolocation or provider-side A/B tests. We measure the engine, not the personalised session of a logged-in user.
AI answers are probabilistic: the same question can produce variants. That is why we measure across dozens of questions, with confidence intervals (Wilson), rather than on a single test.
The audience shares used for weighting are third-party estimates, dated and regularly revised — not unverifiable internal figures.
Why document our limits? Because a measurement whose scope is unknown is worth nothing. It is what makes our scores defensible in front of your clients.
Limits & variability
Measuring AI visibility is not an exact science. We document our limits and how the method accounts for them.
ReadThe proof, step by step
An anonymised example showing how an AI response actually becomes an AGS score.
ReadTechnical terms glossary
Go further
Measure your brand's real visibility in AI answers
Run an AGS audit and get an auditable score, openly stated limits and a concrete action plan.
See pricing