PLAY 05

Measuring LLM Citations: The Real GEO Metric (And How to Track It)

The one number that actually predicts whether AI engines send you traffic

A DIY method to test what ChatGPT, Perplexity, and Gemini cite, no expensive tool required

How to handle the dirty secret of GEO measurement: the same prompt gives different answers

17 min readUpdated 2026By Shmul

KEY TAKEAWAYS

check_circleCitation share, not rank, is the metric that predicts whether AI engines send you buyers.
check_circleDefine a citation precisely: named, linked, primary source, and a separate flag for cited-but-wrong.
check_circleBuild a locked query set of real buyer questions, weighted toward problem and comparison stages, and stop testing your own brand name.
check_circleAnswers wobble run to run, so measure citation rate across multiple runs under fixed conditions, never a single snapshot.
check_circleTrack entity presence and URL presence separately, and log the third-party domains that keep getting cited.
check_circleReport share of voice over time against named rivals, then turn each gap into a content hypothesis you re-measure.

INSIDE THIS GUIDE

8 chapters. Jump to any of them.

01Why Citation Share Is the Real GEO MetricRankings tell you nothing about AI engines. Citation share is what counts.02What Actually Counts as a CitationA mention, a link, and being the sole source are three different things.03Building a Query Set That Mirrors Buyer QuestionsYour measurement is only as good as the questions you test.04Sampling and Variance: The Same Prompt Gives Different AnswersAI answers wobble run to run. Measure like a scientist or fool yourself.05Attributing a Citation: Brand and Entity vs a Single URLModels cite entities and URLs differently. Track both or miss the story.06Tracking Share of Voice Over TimeA single reading is a snapshot. The trend line is the truth.07The DIY Method: Measure Citations Without a Fancy ToolA spreadsheet and discipline will get you most of the way.08Turning Findings Into ActionMeasurement is worthless until it changes what you build next.

CHAPTER 02

What Actually Counts as a Citation

Before you can count anything, you have to decide what counts. This sounds pedantic. It is the whole game. If your definition of a citation is sloppy, your numbers are noise, and you will make bad decisions off them for months. I use three distinct categories, and I score them separately.

The three tiers I track

Brand mention: the model names you in prose but does not link you. Still valuable, because the name lands in the buyer's head.
Linked citation: the model names you and attaches a source link or footnote to your domain. This is the one that can drive a click.
Sole or primary source: the model leans on you as the main answer and treats competitors as secondary or omits them. This is dominance.

bolt

A mention without a link still moves buyers. Treat "named but not linked" as a real win, not a near miss.

Why split them? Because they call for different actions. If you are getting mentioned but never linked, your problem is source attribution and structure, not awareness. If you are not even getting mentioned, your problem is deeper. You are not in the model's consideration set at all. Lumping these together hides the diagnosis.

targetEntity vs URL

Models do not think in URLs the way Google's index does. They think in entities. ChatGPT may "know" your brand from training and name it without linking any page, while Perplexity does live retrieval and links a specific URL. So you must count two things in parallel: is the entity present, and is a URL present. I cover this split in depth in chapter five.

Example

Say you sell project management software and the prompt is "best lightweight project tools for a small agency." One model names you in a list of six with no link. Another names you plus links your comparison page. A third writes a paragraph that is basically your positioning and links only you. Same brand, three completely different citation outcomes. If your sheet records all three as "cited," you have thrown away the most useful signal you collected.

There is also the question of accuracy. The model can cite you and describe you wrong. I log a fourth flag for that: cited but inaccurate. A confident wrong description is a fire to put out, and it is invisible if you only count yes or no. This ties back to the structural fixes in how to get cited in ChatGPT, where clean, copy-ready claims reduce the odds the model garbles you.

Score three tiers plus an accuracy flag

Named, linked, primary, and a separate flag for cited-but-wrong. Four columns, not one yes/no.

CHAPTER 03

Building a Query Set That Mirrors Buyer Questions

The single biggest mistake I see in GEO measurement is testing the wrong queries. People test the prompts that flatter them, like their own brand name, and then celebrate. A buyer who already knows your brand name is not the buyer GEO wins. You need the questions people ask before they have decided anything. That is where citation share converts to revenue.

Start from the funnel, not from keywords. AI engines get used heavily for research and comparison, so your query set should be thick with problem-stage and comparison-stage prompts. These are conversational, longer, and messier than the keywords you targeted in keyword research for classic SEO. Write them the way a person actually talks to a chatbot.

Five buckets to fill

Problem-aware: "how do I stop X from happening" with no product in mind yet.
Solution-aware: "what kind of tool solves X" where they want a category, not a brand.
Comparison: "best X for Y type of buyer" and "X vs Y" head-to-heads.
Brand-adjacent: "is [competitor] good for X" and "alternatives to [competitor]."
Decision: "is X worth it for a small team" and pricing or trust questions.

targetHow many queries?

Enough to be stable, not so many you never refresh them. For a focused product I like a core set in the low dozens, weighted toward comparison and problem prompts because that is where AI engines actually shape buying. A sprawling thousand-prompt set looks impressive and never gets re-run, which makes it worthless for trend tracking.

bolt

Test the questions buyers ask before they know your name, not the ones they ask after.

Example

For a payroll app, a weak query set is "AcmePay reviews" and "AcmePay pricing." A strong set is "how do I run payroll for my first employee," "cheapest way to handle contractor payments," "best payroll software for a five-person company," and "alternatives to [big incumbent]." The weak set tells you nothing. The strong set tells you whether AI hands you new buyers.

lightbulbPRO TIP

Mine your real demand sources for phrasing: sales call recordings, support tickets, and the People Also Ask boxes. Buyers do not phrase questions the way marketers do. Steal their exact words.

Lock the set once it is good. The whole point of measurement is comparison over time, and you cannot compare if you change the questions every month. Keep a stable core you rerun, and a small rotating set for experiments. When you do add a query, mark its start date so you do not mistake a new question for a trend change.

CHAPTER 04

Sampling and Variance: The Same Prompt Gives Different Answers

Here is the dirty secret nobody selling a GEO dashboard wants to say out loud. Run the same prompt twice and you can get two different answers, with different brands cited. The models are non-deterministic by design. If you test each query once, your data is a coin flip dressed up as a metric. This chapter is the difference between real measurement and theater.

Treat each query like a sample, not a fact. One run is an anecdote. The honest unit of measurement is the citation rate across multiple runs of the same prompt. If you appear in three of five runs, your presence for that query is sixty percent, not a yes and not a no. That fractional view is the truth, and it is far more useful than a single snapshot.

bolt

One run is a rumor. A citation rate across several runs is a measurement.

How I control the wobble

1Run each core query several times, not once. More runs for high-value prompts, fewer for the long tail.
2Fix the conditions you can: same model version, same region setting, fresh chat with no memory carried over, search mode on or off held constant.
3Record the citation rate as a fraction, not a binary, for every query.
4Aggregate to the engine level so you compare ChatGPT to Perplexity fairly, since their variance differs.
5Re-run the full set on a fixed cadence so each period is measured the same way.

targetWhy personalization wrecks casual testing

If you test logged into your own account, the model may carry memory, location, and past chats into the answer, and you will see yourself cited because the system knows you. That is not what a stranger sees. Test in a clean session, logged out or in a fresh context, with personalization minimized, so your numbers reflect a real prospect and not a flattering mirror.

Hold conditions constant

Same model version, same region, fresh session, same search setting. Change one input and your trend line becomes meaningless.

warningWATCH OUT

Do not compare a number you collected in May at one model version to a number from August at a newer version and call the difference your doing. The vendor shipped a new model. That moved your number more than your content did. Always log which model version produced each reading.

This is also why I distrust any single-run "GEO audit" screenshot. It captures one roll of the dice. The discipline of repeated sampling is unglamorous, and it is exactly what separates a number you can act on from a number that lies to you. The same rigor I apply to measuring Core Web Vitals, where field data beats a single lab run, applies here.

CHAPTER 05

Attributing a Citation: Brand and Entity vs a Single URL

You need to be precise about what got cited. Sometimes a model names your brand from memory with no link. Sometimes it links one specific page it retrieved live. Sometimes it links a page about you that you do not even own, like a third-party review. If you only watch your own URLs, you will miss most of how AI talks about you, and you will optimize the wrong things.

Split attribution into two streams from day one. Stream one is the entity: is the brand named, regardless of any link? Stream two is the URL: which specific pages get linked, yours or anyone else's? These tell different stories. Strong entity presence with weak URL presence means the model knows you but does not trust a specific page enough to send a click. That is a fixable structural problem.

bolt

The model knows brands. It links pages. Measure both, because they fail in different ways.

targetChatGPT vs Perplexity attribution

Engines that lean on training knowledge tend to name entities without links, especially without search mode on. Engines built around live retrieval, like Perplexity, lean toward explicit URL citations you can see and count. Gemini and Google's AI Overviews sit in between and pull from the live index heavily. So the same brand can be "entity-strong, URL-weak" in one engine and the reverse in another. One number across all engines hides this.

Third-party citations count too

Often the linked source is not your site. It is a roundup, a review, a forum thread, or a directory that the model trusts more than your homepage. That is still a citation that shapes the buyer, and it is a roadmap. If models keep citing a particular review site when they describe your category, getting represented well on that site can move your share faster than any change to your own pages. This is the off-site half of E-E-A-T doing its quiet work.

Example

You run the prompt "best CRM for solo consultants" across engines. ChatGPT names you and two rivals, no links. Perplexity links a third-party comparison article that ranks you second. Gemini links your own pricing page. If you only tracked your own domain, you would record one citation. In reality you have an entity win, a third-party-mediated win, and a direct URL win, each pointing to a different next move.

Two streams: entity and URL

Count brand presence and link presence separately, and log the linked domain even when it is not yours.

lightbulbPRO TIP

Keep a running list of the non-owned domains that keep getting cited for your category. That list is a target list. Earning placement or accuracy on those sources is some of the highest-leverage GEO work there is, and it overlaps neatly with smart link building.

CHAPTER 07

The DIY Method: Measure Citations Without a Fancy Tool

You do not need a four-figure subscription to start measuring GEO. You need a clear method and the patience to run it. I built my first citation tracker in a spreadsheet, and for many businesses that is genuinely enough. Here is the exact process, start to finish, that you can run this week.

Set up the sheet

1List your locked query set, one prompt per row, tagged by funnel bucket.
2Add columns per engine for each run: named (yes/no), linked (yes/no), primary source (yes/no), accuracy flag, and the linked domain.
3Add a calculated column for citation rate per query per engine, the fraction of runs you appeared in.
4Add a competitor presence column so you can compute share, not just your own count.
5Date and version-stamp every batch so each period is comparable.

bolt

The tool does not measure GEO. The method does. A disciplined spreadsheet beats a sloppy dashboard.

Run the test

1Open each engine in a clean, logged-out or fresh session with personalization minimized.
2Paste each prompt, run it the agreed number of times, and record every field honestly, including the runs where you do not appear.
3Capture the linked sources verbatim, including the third-party domains, not just whether you showed up.
4Hold conditions constant across the whole batch: same day, same model versions, same search setting.
5Roll the per-query rates up into per-engine share and your overall share of voice.

targetWhen to graduate to a tool

Dedicated GEO platforms exist that automate the runs, handle multiple engines, and chart share over time for you. They save real labor once your query set grows and you are tracking many competitors across several engines on a tight cadence. The honest test: if you are spending more time copy-pasting prompts than thinking about what the data means, buy the tool. Until then, the sheet teaches you the mechanics, and you will use any tool far better for having done it by hand first.

warningWATCH OUT

Respect each engine's terms of service when you test. Manual, reasonable-volume checking for your own research is one thing. Hammering an API in ways that violate terms is another. Keep it sane and within the rules.

Spreadsheet first

Locked queries, multi-run rates, four citation columns, version stamps. That sheet is a complete, honest GEO measurement system.

I trust a humble spreadsheet I understand over a slick dashboard I cannot interrogate. If you cannot explain how a number was produced, do not bet a strategy on it.Shmul

CHAPTER 08

Turning Findings Into Action

Data that does not change a decision is decoration. The whole reason to measure citation share is to know exactly where to point your effort next. Once you have a few periods of clean data, the patterns tell you what to fix in plain terms. Here is how I read the sheet and turn it into a work list.

Start with your biggest gaps, weighted by query value. A comparison prompt where buyers decide is worth more than a vague problem-stage one. Find the high-value queries where your citation rate is low and your competitors are high. That intersection is your priority list. You are not trying to win every prompt. You are trying to win the prompts that move money.

bolt

Do not chase every gap. Chase the high-value queries where a rival beats you and you can realistically catch up.

Map the pattern to the fix

Not mentioned at all on a query: you lack a page that clearly addresses that question. Build it, structured for extraction, as I cover in content writing.
Mentioned but never linked: the entity is known but no page earns trust. Tighten structure, add clear claims and schema, per schema markup.
Cited but inaccurate: the model has stale or muddled facts. Publish an unambiguous, current source page and reinforce it.
A third-party domain keeps winning: go earn accurate representation on that source.
Strong on one engine, weak on another: study what the winning engine rewards and adapt, since Perplexity and ChatGPT do not weight the same things, as I explain in ranking in Perplexity.

targetClose the loop

Every fix is a hypothesis. You changed a page to win a query, so the next measurement period is the experiment that tests it. Annotate what you shipped, re-run the set under the same conditions, and read the move. This loop, measure, fix, re-measure, is the entire discipline. Do it for two or three cycles and you will have something most of your competitors never build: an evidence-based picture of what actually moves your share.

Example

Your data shows you are absent from "best [category] for small teams" across all three engines while one rival is cited every time. You build a genuinely useful, well-structured comparison page aimed at that exact question, add schema, and earn a mention on the third-party roundup the models kept citing. Next period you measure again and watch whether your rate on that query moved. If it did, you found a repeatable play. If it did not, you learned something and you try the next lever.

Measure, fix, re-measure

Treat each content change as a hypothesis and the next measurement cycle as its test. That loop is GEO.

lightbulbPRO TIP

Pair this measurement loop with the on-page fundamentals in on-page SEO and the structural work in winning AI Overviews. Citation share tells you where to aim. Those guides tell you what to build once you are aiming right.

Frequently asked

What is citation share in GEO?expand_more

It is the percentage of times an AI engine names your brand or links your page across a set of buyer questions, measured against the competitors who appear alongside you. It replaces keyword rank as the core GEO metric because there is no ranked results page inside a ChatGPT or Perplexity answer to measure position on.

Why do I get different AI answers when I run the same prompt twice?expand_more

The models are non-deterministic by design, so the same prompt can return different wording and different cited brands each run. That is why you should run each query several times and record your citation rate as a fraction, not test once and treat that single answer as a fact.

Do I need a paid GEO tool to measure citations?expand_more

No. A disciplined spreadsheet with a locked query set, multi-run citation rates, and four scoring columns is a complete measurement system. Paid tools save labor once you track many competitors across several engines on a tight cadence, but doing it by hand first teaches you the mechanics so you use any tool better.

Should I count a mention if the model does not link my site?expand_more

Yes, but score it separately. A brand mention with no link still plants your name with the buyer and is a real win. A linked citation can drive a click and signals deeper trust. Track them as distinct tiers because each one points to a different fix.

How many queries should be in my GEO test set?expand_more

Enough to be stable and few enough that you will actually re-run them every period, which for a focused product usually means a core set in the low dozens weighted toward comparison and problem-stage prompts. A thousand-prompt set looks impressive, never gets re-run, and is therefore useless for tracking a trend.

How often should I re-measure citation share?expand_more

On a fixed cadence so each period is comparable, and always note which model versions produced the readings. When a vendor ships a new model your numbers can move more than your content did, so version-stamping every batch is what keeps your trend line honest.

Want this done for you?

I help brands win on Google and get cited in AI search. Tell me about your project.

Work with me

Related plays

Play 02query_stats

Measuring LLM Citations: The Real GEO Metric (And How to Track It)

KEY TAKEAWAYS

What Actually Counts as a Citation

The three tiers I track

Building a Query Set That Mirrors Buyer Questions

Five buckets to fill

Sampling and Variance: The Same Prompt Gives Different Answers

How I control the wobble

Attributing a Citation: Brand and Entity vs a Single URL

Third-party citations count too

The DIY Method: Measure Citations Without a Fancy Tool

Set up the sheet

Run the test

Turning Findings Into Action

Map the pattern to the fix

Frequently asked

Want this done for you?

Related plays

Get cited in ChatGPT

Rank in Perplexity

What is GEO

Measuring LLM Citations: The Real GEO Metric (And How to Track It)

KEY TAKEAWAYS

Why Citation Share Is the Real GEO Metric

What Actually Counts as a Citation

The three tiers I track

Building a Query Set That Mirrors Buyer Questions

Five buckets to fill

Sampling and Variance: The Same Prompt Gives Different Answers

How I control the wobble

Attributing a Citation: Brand and Entity vs a Single URL

Third-party citations count too

Tracking Share of Voice Over Time

Slice the trend so it is useful

The DIY Method: Measure Citations Without a Fancy Tool

Set up the sheet

Run the test

Turning Findings Into Action

Map the pattern to the fix

Frequently asked

Want this done for you?

Related plays

Get cited in ChatGPT

Rank in Perplexity

What is GEO