Measuring LLM Citations: The Real GEO Metric (And How to Track It)
The one number that actually predicts whether AI engines send you traffic
A DIY method to test what ChatGPT, Perplexity, and Gemini cite, no expensive tool required
How to handle the dirty secret of GEO measurement: the same prompt gives different answers
KEY TAKEAWAYS
- check_circleCitation share, not rank, is the metric that predicts whether AI engines send you buyers.
- check_circleDefine a citation precisely: named, linked, primary source, and a separate flag for cited-but-wrong.
- check_circleBuild a locked query set of real buyer questions, weighted toward problem and comparison stages, and stop testing your own brand name.
- check_circleAnswers wobble run to run, so measure citation rate across multiple runs under fixed conditions, never a single snapshot.
- check_circleTrack entity presence and URL presence separately, and log the third-party domains that keep getting cited.
- check_circleReport share of voice over time against named rivals, then turn each gap into a content hypothesis you re-measure.
INSIDE THIS GUIDE
8 chapters. Jump to any of them.
CHAPTER 02
What Actually Counts as a Citation
Before you can count anything, you have to decide what counts. This sounds pedantic. It is the whole game. If your definition of a citation is sloppy, your numbers are noise, and you will make bad decisions off them for months. I use three distinct categories, and I score them separately.
The three tiers I track
- Brand mention: the model names you in prose but does not link you. Still valuable, because the name lands in the buyer's head.
- Linked citation: the model names you and attaches a source link or footnote to your domain. This is the one that can drive a click.
- Sole or primary source: the model leans on you as the main answer and treats competitors as secondary or omits them. This is dominance.
A mention without a link still moves buyers. Treat "named but not linked" as a real win, not a near miss.
Why split them? Because they call for different actions. If you are getting mentioned but never linked, your problem is source attribution and structure, not awareness. If you are not even getting mentioned, your problem is deeper. You are not in the model's consideration set at all. Lumping these together hides the diagnosis.
targetEntity vs URL
Models do not think in URLs the way Google's index does. They think in entities. ChatGPT may "know" your brand from training and name it without linking any page, while Perplexity does live retrieval and links a specific URL. So you must count two things in parallel: is the entity present, and is a URL present. I cover this split in depth in chapter five.
Example
Say you sell project management software and the prompt is "best lightweight project tools for a small agency." One model names you in a list of six with no link. Another names you plus links your comparison page. A third writes a paragraph that is basically your positioning and links only you. Same brand, three completely different citation outcomes. If your sheet records all three as "cited," you have thrown away the most useful signal you collected.
There is also the question of accuracy. The model can cite you and describe you wrong. I log a fourth flag for that: cited but inaccurate. A confident wrong description is a fire to put out, and it is invisible if you only count yes or no. This ties back to the structural fixes in how to get cited in ChatGPT, where clean, copy-ready claims reduce the odds the model garbles you.
Score three tiers plus an accuracy flag
Named, linked, primary, and a separate flag for cited-but-wrong. Four columns, not one yes/no.
CHAPTER 03
Building a Query Set That Mirrors Buyer Questions
The single biggest mistake I see in GEO measurement is testing the wrong queries. People test the prompts that flatter them, like their own brand name, and then celebrate. A buyer who already knows your brand name is not the buyer GEO wins. You need the questions people ask before they have decided anything. That is where citation share converts to revenue.
Start from the funnel, not from keywords. AI engines get used heavily for research and comparison, so your query set should be thick with problem-stage and comparison-stage prompts. These are conversational, longer, and messier than the keywords you targeted in keyword research for classic SEO. Write them the way a person actually talks to a chatbot.
Five buckets to fill
- Problem-aware: "how do I stop X from happening" with no product in mind yet.
- Solution-aware: "what kind of tool solves X" where they want a category, not a brand.
- Comparison: "best X for Y type of buyer" and "X vs Y" head-to-heads.
- Brand-adjacent: "is [competitor] good for X" and "alternatives to [competitor]."
- Decision: "is X worth it for a small team" and pricing or trust questions.
targetHow many queries?
Enough to be stable, not so many you never refresh them. For a focused product I like a core set in the low dozens, weighted toward comparison and problem prompts because that is where AI engines actually shape buying. A sprawling thousand-prompt set looks impressive and never gets re-run, which makes it worthless for trend tracking.
Test the questions buyers ask before they know your name, not the ones they ask after.
Example
For a payroll app, a weak query set is "AcmePay reviews" and "AcmePay pricing." A strong set is "how do I run payroll for my first employee," "cheapest way to handle contractor payments," "best payroll software for a five-person company," and "alternatives to [big incumbent]." The weak set tells you nothing. The strong set tells you whether AI hands you new buyers.
lightbulbPRO TIP
Mine your real demand sources for phrasing: sales call recordings, support tickets, and the People Also Ask boxes. Buyers do not phrase questions the way marketers do. Steal their exact words.
Lock the set once it is good. The whole point of measurement is comparison over time, and you cannot compare if you change the questions every month. Keep a stable core you rerun, and a small rotating set for experiments. When you do add a query, mark its start date so you do not mistake a new question for a trend change.
CHAPTER 04
Sampling and Variance: The Same Prompt Gives Different Answers
Here is the dirty secret nobody selling a GEO dashboard wants to say out loud. Run the same prompt twice and you can get two different answers, with different brands cited. The models are non-deterministic by design. If you test each query once, your data is a coin flip dressed up as a metric. This chapter is the difference between real measurement and theater.
Treat each query like a sample, not a fact. One run is an anecdote. The honest unit of measurement is the citation rate across multiple runs of the same prompt. If you appear in three of five runs, your presence for that query is sixty percent, not a yes and not a no. That fractional view is the truth, and it is far more useful than a single snapshot.
One run is a rumor. A citation rate across several runs is a measurement.
How I control the wobble
- 1Run each core query several times, not once. More runs for high-value prompts, fewer for the long tail.
- 2Fix the conditions you can: same model version, same region setting, fresh chat with no memory carried over, search mode on or off held constant.
- 3Record the citation rate as a fraction, not a binary, for every query.
- 4Aggregate to the engine level so you compare ChatGPT to Perplexity fairly, since their variance differs.
- 5Re-run the full set on a fixed cadence so each period is measured the same way.
targetWhy personalization wrecks casual testing
If you test logged into your own account, the model may carry memory, location, and past chats into the answer, and you will see yourself cited because the system knows you. That is not what a stranger sees. Test in a clean session, logged out or in a fresh context, with personalization minimized, so your numbers reflect a real prospect and not a flattering mirror.
Hold conditions constant
Same model version, same region, fresh session, same search setting. Change one input and your trend line becomes meaningless.
warningWATCH OUT
Do not compare a number you collected in May at one model version to a number from August at a newer version and call the difference your doing. The vendor shipped a new model. That moved your number more than your content did. Always log which model version produced each reading.
This is also why I distrust any single-run "GEO audit" screenshot. It captures one roll of the dice. The discipline of repeated sampling is unglamorous, and it is exactly what separates a number you can act on from a number that lies to you. The same rigor I apply to measuring Core Web Vitals, where field data beats a single lab run, applies here.
CHAPTER 05
Attributing a Citation: Brand and Entity vs a Single URL
You need to be precise about what got cited. Sometimes a model names your brand from memory with no link. Sometimes it links one specific page it retrieved live. Sometimes it links a page about you that you do not even own, like a third-party review. If you only watch your own URLs, you will miss most of how AI talks about you, and you will optimize the wrong things.
Split attribution into two streams from day one. Stream one is the entity: is the brand named, regardless of any link? Stream two is the URL: which specific pages get linked, yours or anyone else's? These tell different stories. Strong entity presence with weak URL presence means the model knows you but does not trust a specific page enough to send a click. That is a fixable structural problem.
The model knows brands. It links pages. Measure both, because they fail in different ways.
targetChatGPT vs Perplexity attribution
Engines that lean on training knowledge tend to name entities without links, especially without search mode on. Engines built around live retrieval, like Perplexity, lean toward explicit URL citations you can see and count. Gemini and Google's AI Overviews sit in between and pull from the live index heavily. So the same brand can be "entity-strong, URL-weak" in one engine and the reverse in another. One number across all engines hides this.
Third-party citations count too
Often the linked source is not your site. It is a roundup, a review, a forum thread, or a directory that the model trusts more than your homepage. That is still a citation that shapes the buyer, and it is a roadmap. If models keep citing a particular review site when they describe your category, getting represented well on that site can move your share faster than any change to your own pages. This is the off-site half of E-E-A-T doing its quiet work.
Example
You run the prompt "best CRM for solo consultants" across engines. ChatGPT names you and two rivals, no links. Perplexity links a third-party comparison article that ranks you second. Gemini links your own pricing page. If you only tracked your own domain, you would record one citation. In reality you have an entity win, a third-party-mediated win, and a direct URL win, each pointing to a different next move.
Two streams: entity and URL
Count brand presence and link presence separately, and log the linked domain even when it is not yours.
lightbulbPRO TIP
Keep a running list of the non-owned domains that keep getting cited for your category. That list is a target list. Earning placement or accuracy on those sources is some of the highest-leverage GEO work there is, and it overlaps neatly with smart link building.
CHAPTER 07
The DIY Method: Measure Citations Without a Fancy Tool
You do not need a four-figure subscription to start measuring GEO. You need a clear method and the patience to run it. I built my first citation tracker in a spreadsheet, and for many businesses that is genuinely enough. Here is the exact process, start to finish, that you can run this week.
Set up the sheet
- 1List your locked query set, one prompt per row, tagged by funnel bucket.
- 2Add columns per engine for each run: named (yes/no), linked (yes/no), primary source (yes/no), accuracy flag, and the linked domain.
- 3Add a calculated column for citation rate per query per engine, the fraction of runs you appeared in.
- 4Add a competitor presence column so you can compute share, not just your own count.
- 5Date and version-stamp every batch so each period is comparable.
The tool does not measure GEO. The method does. A disciplined spreadsheet beats a sloppy dashboard.
Run the test
- 1Open each engine in a clean, logged-out or fresh session with personalization minimized.
- 2Paste each prompt, run it the agreed number of times, and record every field honestly, including the runs where you do not appear.
- 3Capture the linked sources verbatim, including the third-party domains, not just whether you showed up.
- 4Hold conditions constant across the whole batch: same day, same model versions, same search setting.
- 5Roll the per-query rates up into per-engine share and your overall share of voice.
targetWhen to graduate to a tool
Dedicated GEO platforms exist that automate the runs, handle multiple engines, and chart share over time for you. They save real labor once your query set grows and you are tracking many competitors across several engines on a tight cadence. The honest test: if you are spending more time copy-pasting prompts than thinking about what the data means, buy the tool. Until then, the sheet teaches you the mechanics, and you will use any tool far better for having done it by hand first.
warningWATCH OUT
Respect each engine's terms of service when you test. Manual, reasonable-volume checking for your own research is one thing. Hammering an API in ways that violate terms is another. Keep it sane and within the rules.
Spreadsheet first
Locked queries, multi-run rates, four citation columns, version stamps. That sheet is a complete, honest GEO measurement system.
I trust a humble spreadsheet I understand over a slick dashboard I cannot interrogate. If you cannot explain how a number was produced, do not bet a strategy on it.Shmul
CHAPTER 08
Turning Findings Into Action
Data that does not change a decision is decoration. The whole reason to measure citation share is to know exactly where to point your effort next. Once you have a few periods of clean data, the patterns tell you what to fix in plain terms. Here is how I read the sheet and turn it into a work list.
Start with your biggest gaps, weighted by query value. A comparison prompt where buyers decide is worth more than a vague problem-stage one. Find the high-value queries where your citation rate is low and your competitors are high. That intersection is your priority list. You are not trying to win every prompt. You are trying to win the prompts that move money.
Do not chase every gap. Chase the high-value queries where a rival beats you and you can realistically catch up.
Map the pattern to the fix
- Not mentioned at all on a query: you lack a page that clearly addresses that question. Build it, structured for extraction, as I cover in content writing.
- Mentioned but never linked: the entity is known but no page earns trust. Tighten structure, add clear claims and schema, per schema markup.
- Cited but inaccurate: the model has stale or muddled facts. Publish an unambiguous, current source page and reinforce it.
- A third-party domain keeps winning: go earn accurate representation on that source.
- Strong on one engine, weak on another: study what the winning engine rewards and adapt, since Perplexity and ChatGPT do not weight the same things, as I explain in ranking in Perplexity.
targetClose the loop
Every fix is a hypothesis. You changed a page to win a query, so the next measurement period is the experiment that tests it. Annotate what you shipped, re-run the set under the same conditions, and read the move. This loop, measure, fix, re-measure, is the entire discipline. Do it for two or three cycles and you will have something most of your competitors never build: an evidence-based picture of what actually moves your share.
Example
Your data shows you are absent from "best [category] for small teams" across all three engines while one rival is cited every time. You build a genuinely useful, well-structured comparison page aimed at that exact question, add schema, and earn a mention on the third-party roundup the models kept citing. Next period you measure again and watch whether your rate on that query moved. If it did, you found a repeatable play. If it did not, you learned something and you try the next lever.
Measure, fix, re-measure
Treat each content change as a hypothesis and the next measurement cycle as its test. That loop is GEO.
lightbulbPRO TIP
Pair this measurement loop with the on-page fundamentals in on-page SEO and the structural work in winning AI Overviews. Citation share tells you where to aim. Those guides tell you what to build once you are aiming right.
Frequently asked
What is citation share in GEO?expand_more
Why do I get different AI answers when I run the same prompt twice?expand_more
Do I need a paid GEO tool to measure citations?expand_more
Should I count a mention if the model does not link my site?expand_more
How many queries should be in my GEO test set?expand_more
How often should I re-measure citation share?expand_more
Want this done for you?
I help brands win on Google and get cited in AI search. Tell me about your project.