The short version
- We report presence as a rate, not a single yes or no, because LLMs are non-deterministic.
- Every headline number carries its sample size (n) and a 95% Wilson confidence interval, shown right under the number in the app and in client reports.
- A confidence interval handles noise, not bias. We name the biases it cannot fix instead of hiding them.
- We separate branded from unbranded prompts, build the headline on earned (unbranded) prompts, and record every source each engine cited so you can open the evidence.
AI-visibility tooling has an honesty problem. A tool that tells you "17% share of voice, rank 3" as if it were a stable fact is selling precision that the underlying system cannot support. LLMs are non-deterministic: run the same prompt three times and you can get three different answers. So we built llemmy's reporting around what is actually measurable, and we are equally clear about what is not.
Presence is a rate, with a sample size and a confidence interval
Your visibility is the share of AI answers that mention your brand: mentions divided by the number of answers measured (n). Because that is a proportion estimated from a finite, noisy sample, we attach a 95% Wilson score interval (the Wilson interval is the right one for small samples, where the textbook normal approximation breaks down). In the app you see it inline, for example:
Visibility 42% · n=210 · 95% CI 35-49%
A wide interval is a signal to collect more answers before reading too much into the number. Share of voice (your mentions divided by all brand mentions across the same answers) gets the same treatment, with its own n. We would rather show you an honest "70%, n=18, 95% CI 60-80%" than a confident-looking "70%" that one extra noisy day could swing.
What a confidence interval does NOT fix
This is the part most tools skip. A confidence interval tells you how stable a measurement is at a given sample size. It says nothing about whether you are measuring the right thing. A biased sample, measured more times, just produces a more confident wrong number. So here is where our numbers come from, plainly:
- API vs the real app. We query the official model APIs for ChatGPT, Claude, Gemini and Perplexity, and read Google AI Overviews from the search results page. The model API is not identical to a logged-in consumer app with memory and personalization. We report the model's answer to a defined prompt, not a claim about what each individual user sees.
- Personalization. We query from a consistent context, so our numbers are not personalized to any one user's history. That is a feature for comparability and a limit for realism, and we are not going to pretend otherwise.
- Model drift. We capture the exact resolved model version on every answer, so drift is visible in your data. We do not silently restate history when a provider ships a new model.
- Geography. You can set a location context per project; by default queries are global. Visibility is only as local as what you configure.
We treat these as limits to name, not problems to claim away. If a vendor tells you they have "solved" the gap between an API and a logged-in app, that is the overclaiming you should be skeptical of.
Branded vs unbranded, and the evidence behind every number
Prompts that name your brand return a mention close to 100% of the time by construction, so including them would inflate your headline. llemmy classifies prompt intent and excludes branded prompts from the headline visibility score, building it on unbranded, buyer-intent prompts where a mention is earned. Branded prompts still matter, so they flow to a separate sentiment and accuracy view rather than the headline.
And we record the sources each engine cited on every answer, so a number is never a black box: you can open the underlying answers and the domains behind them.
What we will never do
We will not present a precise, stable number without the sample size, the prompt set, the engine, and the window behind it. We will not bury the methodology, and we will not claim to have eliminated biases that no API-based tool can eliminate. Directional truth, honestly bounded, beats false precision. That is the whole point of llemmy.
Questions about any of this, or think we have a measurement wrong? We would genuinely like to hear it. Related reading: What AI engines actually cite, AI share of voice, measured without fooling yourself, and How to track your brand across AI engines.