GEO Playbook · Campaigns

How to measure content effectiveness in AI search

The short version

  • Baseline first. Capture where a set of prompts stands on day 0, before your content lands, or you have nothing honest to compare against.
  • Read the change against its confidence interval. A rate that ticked up is not proof. Ask whether the before and after 95% intervals stop overlapping. If they do, it is significant; if they still overlap, it is within noise.
  • Which pages win citations is the mechanism. It shows what the engine actually pulled into the answer, so you know why visibility moved (or who took the slot instead).
  • Time is the variable. Content takes weeks to land in AI answers, so measure over elapsed time on a day-0 trend, annotated with when content shipped and first got cited.

Most teams "measure" content in AI search by publishing something, checking a week later whether ChatGPT mentions them, and calling it a win or a loss. That is not measurement. It is a single reading with no anchor, taken from a system that gives different answers to the same question on different days. You cannot tell whether your content did anything, because you never recorded where you started.

Measuring content effectiveness in AI search is a before-and-after problem, and it has three honest requirements: a baseline, a way to tell a real change from noise, and a view of what the engines are actually citing. This is how to do all three, and how llemmy's Campaigns feature runs the method for you.

Start with a baseline, on day 0

A campaign is a time-bound initiative built around a specific set of prompts: the buyer questions a piece of content is supposed to influence. The moment you start it, you set a start date, and that date is your baseline, day 0. Everything after is measured relative to it.

This is the step teams skip, and skipping it quietly ruins the analysis. AI answers drift on their own as models update and the web changes. If you only read a visibility number after your content publishes, you are measuring your position, not your content's effect. A rate captured before the content lands is what turns "I think this helped" into a number you can defend. Pick the prompts before you publish, capture the baseline, then ship.

Read the change against a confidence interval, not a hunch

Here is the part almost every tool gets wrong. Visibility is a rate: the share of AI answers on your tracked prompts that mention your brand, mentions divided by the number of answers measured (n). Because it is a proportion estimated from a finite, non-deterministic sample, a single day's rate is noisy. So we never report it bare. Every figure carries its sample size and a 95% confidence interval, the range the true rate is very likely to sit in given how much data you have.

That changes the question from "did the number go up?" to "did it go up by more than the noise?" Concretely: compare the baseline window to the current window and look at whether their two confidence intervals still overlap.

llemmy makes this call for you and labels it. Each impact metric shows baseline versus current with both intervals, and a tag that reads significant when the intervals separate or within noise when they overlap. It is a conservative test on purpose: non-overlapping intervals is a stricter bar than a formal significance test, so a "significant" tag is one you can put in front of a client without a caveat. A rate that jumped from 31% to 38% sounds like a win until you notice both intervals still span 30 to 40 percent on small samples, at which point the honest read is "not yet."

Measure over elapsed time, because content lands slowly

Content does not hit AI answers the day you publish. Engines have to crawl, index and start retrieving the page, and how long that takes depends on the engine and the query. Anyone promising an exact number of days is guessing. So the right unit is not a promised date, it is elapsed time you actually measure.

A campaign plots a day-0-rebased trend: the X axis is days since you started, so movement is obvious and comparable regardless of when the campaign began. On top of that sits an annotatable timeline where you mark the events that explain the curve. The natural story is baseline captured, then content published, then the day a page first gets cited. When your visibility line lifts a couple of weeks after the "content published" marker and just after a page shows up in the citations, you are looking at cause and effect, not a coincidence you have to argue for.

Which pages win citations tells you why

Visibility going up tells you that something worked. The which pages win citations panel tells you why. AI answers are assembled from cited sources, and a campaign records every source the engines pulled into your tracked answers, ranked by how often each page was cited.

This is the difference between publishing and being surfaced:

Citation rate gets the same treatment as visibility: it is the share of your tracked answers that cite one of your pages, reported with its sample size and a 95% interval, and tagged significant or within noise against the baseline. So "our content is getting picked up" is a claim with evidence under it, not a feeling.

The full picture: four metrics, one before-and-after

A campaign reads the same set of prompts across four metrics, each shown as baseline versus current with its sample size, a 95% confidence interval, and a significant-or-within-noise tag:

Reading all four together stops the classic false win, like visibility rising while share of voice falls (the whole category got more visible, not you) or while sentiment slips (you are being mentioned more, and worse).

Turn the finding into work

Measurement only pays off if it drives action. A campaign does not have to start from a blank page: spin one up from a content opportunity, a GEO audit finding, an onboarding step, or from scratch, with AI-drafted campaign ideas and suggested prompts to seed it. Then assign the follow-up content work to teammates as tasks tied to the campaign, so the plan, the baseline and the proof all live in one place. For agencies, that means each client campaign is a self-contained before-and-after you can hand over, with the confidence interval doing the arguing for you.

What we will not claim

A confidence interval fixes noise, not bias. It tells you how stable a rate is at a given sample size; it does not fix the fact that we query the official model APIs and read Google AI Overviews off the search results page, which is not identical to a logged-in, personalized consumer app. We are explicit about that in how we measure. So a campaign will tell you, honestly and with its uncertainty attached, whether your content moved the model's answer to a defined prompt over time. It will not pretend to prove a personalized before-and-after for every individual user, and we would be suspicious of any tool that says it can.

How llemmy does it

Campaigns live inside a workspace. You tag the prompts a piece of content should influence, set a start date as your baseline, and llemmy measures the campaign's impact from your accumulating answers: baseline versus current on visibility, share of voice, sentiment and citation rate (each with n and a 95% Wilson interval and a significant-or-within-noise tag), a day-0-rebased trend with your annotations overlaid, the pages winning the citations, an annotatable timeline, and teammate task assignment. The brand math reuses the same engine as the dashboard headline cards, so campaign numbers reconcile with the rest of your account. Free plans run one campaign at a time; higher plans run more, up to unlimited on Enterprise. Run a free GEO audit or start free to draw your first day-0 line.

FAQ

How do you measure whether content improved your AI visibility?

Capture a baseline before the content lands, then compare the same prompts after it publishes. Read visibility as a rate with its sample size and a 95% confidence interval, not a single number. The content worked if the after rate is higher and its interval clears the baseline's. If the two intervals still overlap, the change is within sampling noise, so keep collecting answers before calling it.

Why do you need a baseline before publishing content for AI?

Without a day-0 baseline there is nothing to compare against, so any number you read afterward is unanchored. AI answers are non-deterministic and drift on their own, so a rate read only after publishing tells you where you are, not whether your content put you there. A baseline captured first turns a hope into a measurable before-and-after.

How long does content take to affect AI answers?

There is no fixed number. It depends on how engines crawl, index and retrieve, and on the query. Treat it as elapsed time you measure, not a promised date: mark when the content published, watch the day-0 trend, and note when a page first appears as a cited source. Weeks is common, and a confidence interval keeps you from over-reading an early wobble.

What does knowing which pages win citations tell you?

It tells you which content the engines actually pull into answers, the difference between publishing and being surfaced. If the page you shipped starts appearing in the cited sources, that is the mechanism behind any visibility lift. If a competitor's page keeps winning the citation, that page is your target: earn a place alongside it, publish a stronger answer, or correct the record.

By the llemmy team, July 2026. Related reading: How llemmy measures AI visibility (and what we don't claim), AI share of voice, without fooling yourself, and What AI engines actually cite.

See how AI describes your brand

Run a free GEO audit — no signup needed to see your score — or start tracking your brand across every AI engine.