Methodology · Reporting integrity

How llemmy measures AI visibility (and what we don't claim)

The short version

  • We report presence as a rate, not a single yes or no, because LLMs are non-deterministic.
  • Every headline number carries its sample size (n) and a 95% Wilson confidence interval, shown right under the number in the app and in client reports.
  • A confidence interval handles noise, not bias. We name the biases it cannot fix instead of hiding them.
  • We separate branded from unbranded prompts, build the headline on earned (unbranded) prompts, and record every source each engine cited so you can open the evidence.

AI-visibility tooling has an honesty problem. A tool that tells you "17% share of voice, rank 3" as if it were a stable fact is selling precision that the underlying system cannot support. LLMs are non-deterministic: run the same prompt three times and you can get three different answers. So we built llemmy's reporting around what is actually measurable, and we are equally clear about what is not.

Presence is a rate, with a sample size and a confidence interval

Your visibility is the share of AI answers that mention your brand: mentions divided by the number of answers measured (n). Because that is a proportion estimated from a finite, noisy sample, we attach a 95% Wilson score interval (the Wilson interval is the right one for small samples, where the textbook normal approximation breaks down). In the app you see it inline, for example:

Visibility 42% · n=210 · 95% CI 35-49%

A wide interval is a signal to collect more answers before reading too much into the number. Share of voice (your mentions divided by all brand mentions across the same answers) gets the same treatment, with its own n. We would rather show you an honest "70%, n=18, 95% CI 60-80%" than a confident-looking "70%" that one extra noisy day could swing.

What a confidence interval does NOT fix

This is the part most tools skip. A confidence interval tells you how stable a measurement is at a given sample size. It says nothing about whether you are measuring the right thing. A biased sample, measured more times, just produces a more confident wrong number. So here is where our numbers come from, plainly:

We treat these as limits to name, not problems to claim away. If a vendor tells you they have "solved" the gap between an API and a logged-in app, that is the overclaiming you should be skeptical of.

Branded vs unbranded, and the evidence behind every number

Prompts that name your brand return a mention close to 100% of the time by construction, so including them would inflate your headline. llemmy classifies prompt intent and excludes branded prompts from the headline visibility score, building it on unbranded, buyer-intent prompts where a mention is earned. Branded prompts still matter, so they flow to a separate sentiment and accuracy view rather than the headline.

And we record the sources each engine cited on every answer, so a number is never a black box: you can open the underlying answers and the domains behind them.

What we will never do

We will not present a precise, stable number without the sample size, the prompt set, the engine, and the window behind it. We will not bury the methodology, and we will not claim to have eliminated biases that no API-based tool can eliminate. Directional truth, honestly bounded, beats false precision. That is the whole point of llemmy.

Questions about any of this, or think we have a measurement wrong? We would genuinely like to hear it. Related reading: What AI engines actually cite, AI share of voice, measured without fooling yourself, and How to track your brand across AI engines.

See how AI describes your brand

Run a free GEO audit — no signup needed to see your score — or start tracking your brand across every AI engine.