AGS Methodology — AI Grading System | AI Labs Audit
AGS (AI Grading System) is AI Labs Audit's scoring engine. It has every AI response graded by 5 AI judges that calibrate against each other, then publishes an inter-judge reliability coefficient so you know exactly how defensible your score is.
What is AGS?
AGS is an open-source multi-judge scoring protocol. Instead of relying on a single LLM to assess your brand's visibility (bias, hallucinations, model drift), AGS queries 5 AI judges in parallel (GPT-4o, Claude Sonnet, Gemini Pro, Mistral Large, Llama 3.1) and publishes the spread between them. The smaller the spread, the more reliable the score.
The 3 evaluated dimensions
- P (Precision): does the answer mention your brand correctly, without confusion with a competitor or homonym? Measures hallucinations and attribution errors.
- I (Informativeness): does the answer provide useful and differentiating information about your brand, or just name it? Measures the depth of the citation.
- Q (Quality): is the answer factually correct and up-to-date? Measures information freshness and conformity to verifiable facts.
Evaluation protocol
For each audited prompt, AGS executes 5 parallel calls to the AI judges with identical instructions (zero-shot scoring). Scores are aggregated via a weighted average using each judge's declared confidence. The final result includes the average score, inter-judge standard deviation, and 95% bootstrap confidence interval.
Inter-judge reliability coefficient
AGS publishes the Fleiss kappa coefficient (multi-rater agreement measure) for each audit. A kappa above 0.80 indicates strong consensus among judges (highly reliable score). Between 0.60 and 0.80: moderate consensus. Below 0.60: weak consensus — the score should be interpreted with caution and the question rephrased.
Transparency and reproducibility
Each AGS audit produces a cryptographic hash of the prompts, raw responses, and individual scores. This signature proves the score has not been manipulated. AGS code is open source (MIT license) on github.com/sarsator/aqa-specification, and the scoring formula is versioned and published. Any customer can verify or challenge a score.
AGS Acronyms
- GRC
- Generative Response Coverage: percentage of prompts where at least one judge cites the brand.
- GIS
- Generative Inclusion Score: weighted average score based on brand position in the response (first mentioned = 100%, last = 0%).
- ASR
- Answer Sentiment Rating: tonality of the mention (positive/neutral/negative) on a -1 to +1 scale.
- BVI
- Brand Visibility Index: composite score (GRC × GIS × ASR), from 0 to 100, that summarises the brand's overall performance across tested AIs.
- CIA
- Citation Inter-judge Agreement: Fleiss kappa coefficient measuring agreement among the 5 AI judges on citation presence.
30 Advanced GEO Checks 2026
Sprint 15 delivered 30 new GEO/AEO signals measured passively (zero ToS-violating scraping). These signals complement the AGS scoring through the 6th category 'advanced_signals' (15% composite weight).
6 market differentiators
- A08 — Specificity score (Princeton GEO 2024) — Density of tier-1 sourced statistics (Princeton GEO KDD 2024: +27 to +40% LLM citations).
- A09 — Counter-arguments markers — No competing tool measures balanced-argumentation markers.
- A07 — Date-stamped statements
- S05 — Common Crawl inclusion
- S08 — llms.txt RFC validation
- B10 — Stack Overflow brand mentions
Module 6 — External Authority Signals
New module dedicated to external authority signals: LinkedIn, ProductHunt, G2/Capterra, Stack Overflow, GitHub, Substack/Medium.
Checks by GEO module
- SSR / Crawlability
- mainEntity · QAPage · Video transcripts · Speakable · @graph @id · inLanguage · Common Crawl · llms.txt · IndexNow · ai.txt · Verifications · HTTP/3 · Brotli
- Entity Health
- Wikidata · DBpedia
- Citation Readiness
- Sourced stats · Balanced argumentation · Inline dating · ItemList · Dataset · Blockquote cite · Internal links · Anchor entropy · News Sitemap
- External Authority
- Stack Overflow · LinkedIn · GitHub · ProductHunt · B2B reviews · Newsletter
All checks use safe_external_call (retry + cache + circuit breaker) and store their results in audits.advanced_checks_v2 (JSONB + GIN index).
Full article on the blog: 30 new GEO/AEO 2026 signals — State of the art audited.