PLAY 05

Measuring LLM Citations: The Real GEO Metric (And How to Track It)

The one number that actually predicts whether AI engines send you traffic

A DIY method to test what ChatGPT, Perplexity, and Gemini cite, no expensive tool required

How to handle the dirty secret of GEO measurement: the same prompt gives different answers

17 min readUpdated 2026By Shmul

KEY TAKEAWAYS

  • check_circleCitation share, not rank, is the metric that predicts whether AI engines send you buyers.
  • check_circleDefine a citation precisely: named, linked, primary source, and a separate flag for cited-but-wrong.
  • check_circleBuild a locked query set of real buyer questions, weighted toward problem and comparison stages, and stop testing your own brand name.
  • check_circleAnswers wobble run to run, so measure citation rate across multiple runs under fixed conditions, never a single snapshot.
  • check_circleTrack entity presence and URL presence separately, and log the third-party domains that keep getting cited.
  • check_circleReport share of voice over time against named rivals, then turn each gap into a content hypothesis you re-measure.
01

CHAPTER 01

Why Citation Share Is the Real GEO Metric

Let me save you a year of confusion. The metric you tracked for SEO does not work for AI engines. There is no position one in ChatGPT. There is no blue link to climb toward in Perplexity. The thing you need to measure is whether the model names your brand and links your page when a real buyer asks a real question. I call that citation share, and it is the only GEO number I trust.

Here is the shift. In classic search, the question was "where do I rank?" In AI search, the question is "do I get cited, and how often, against my competitors?" Those are not the same question. A page can rank on page two of Google and still get pulled into a ChatGPT answer because the model liked one clean paragraph you wrote. The reverse happens too. You own position one and the model never mentions you.

bolt

Rankings measure where you sit on a results page. Citation share measures whether the machine repeats your name when nobody is looking at a results page at all.

Citation share is simple to define. Take a set of buyer questions. Run each one through the AI engines. Count how often your brand or your URL shows up in the answer. Divide by the total. That percentage is your share. Track it against the brands that keep showing up next to you, and you have a scoreboard that actually maps to whether AI is sending you business.

targetWhy not just track AI referral traffic?

You should track it, but it lags and it lies. Many AI answers cite you without anyone clicking through, so your influence is real while your referral number stays flat. Worse, a lot of analytics setups misattribute or drop AI referrers entirely. Citation share measures the cause. Referral traffic measures one delayed, leaky effect of that cause.

Citation share, not rank

If you take one idea from this guide, take this: in GEO you are measuring presence inside answers, not position on a page.

This connects directly to the bigger picture I lay out in what is GEO. GEO is the practice of becoming the source AI engines quote. Measurement is how you know if the practice is working. Skip measurement and you are just guessing, redesigning pages on vibes, and hoping.

lightbulbPRO TIP

Before you build a single tracking sheet, get crystal clear on what a "win" looks like for your business. Being named is one win. Being linked is a bigger win. Being the only brand named is the win you actually want.

02

CHAPTER 02

What Actually Counts as a Citation

Before you can count anything, you have to decide what counts. This sounds pedantic. It is the whole game. If your definition of a citation is sloppy, your numbers are noise, and you will make bad decisions off them for months. I use three distinct categories, and I score them separately.

The three tiers I track

  • Brand mention: the model names you in prose but does not link you. Still valuable, because the name lands in the buyer's head.
  • Linked citation: the model names you and attaches a source link or footnote to your domain. This is the one that can drive a click.
  • Sole or primary source: the model leans on you as the main answer and treats competitors as secondary or omits them. This is dominance.
bolt

A mention without a link still moves buyers. Treat "named but not linked" as a real win, not a near miss.

Why split them? Because they call for different actions. If you are getting mentioned but never linked, your problem is source attribution and structure, not awareness. If you are not even getting mentioned, your problem is deeper. You are not in the model's consideration set at all. Lumping these together hides the diagnosis.

targetEntity vs URL

Models do not think in URLs the way Google's index does. They think in entities. ChatGPT may "know" your brand from training and name it without linking any page, while Perplexity does live retrieval and links a specific URL. So you must count two things in parallel: is the entity present, and is a URL present. I cover this split in depth in chapter five.

Example

Say you sell project management software and the prompt is "best lightweight project tools for a small agency." One model names you in a list of six with no link. Another names you plus links your comparison page. A third writes a paragraph that is basically your positioning and links only you. Same brand, three completely different citation outcomes. If your sheet records all three as "cited," you have thrown away the most useful signal you collected.

There is also the question of accuracy. The model can cite you and describe you wrong. I log a fourth flag for that: cited but inaccurate. A confident wrong description is a fire to put out, and it is invisible if you only count yes or no. This ties back to the structural fixes in how to get cited in ChatGPT, where clean, copy-ready claims reduce the odds the model garbles you.

Score three tiers plus an accuracy flag

Named, linked, primary, and a separate flag for cited-but-wrong. Four columns, not one yes/no.

03

CHAPTER 03

Building a Query Set That Mirrors Buyer Questions

The single biggest mistake I see in GEO measurement is testing the wrong queries. People test the prompts that flatter them, like their own brand name, and then celebrate. A buyer who already knows your brand name is not the buyer GEO wins. You need the questions people ask before they have decided anything. That is where citation share converts to revenue.

Start from the funnel, not from keywords. AI engines get used heavily for research and comparison, so your query set should be thick with problem-stage and comparison-stage prompts. These are conversational, longer, and messier than the keywords you targeted in keyword research for classic SEO. Write them the way a person actually talks to a chatbot.

Five buckets to fill

  • Problem-aware: "how do I stop X from happening" with no product in mind yet.
  • Solution-aware: "what kind of tool solves X" where they want a category, not a brand.
  • Comparison: "best X for Y type of buyer" and "X vs Y" head-to-heads.
  • Brand-adjacent: "is [competitor] good for X" and "alternatives to [competitor]."
  • Decision: "is X worth it for a small team" and pricing or trust questions.

targetHow many queries?

Enough to be stable, not so many you never refresh them. For a focused product I like a core set in the low dozens, weighted toward comparison and problem prompts because that is where AI engines actually shape buying. A sprawling thousand-prompt set looks impressive and never gets re-run, which makes it worthless for trend tracking.

bolt

Test the questions buyers ask before they know your name, not the ones they ask after.

Example

For a payroll app, a weak query set is "AcmePay reviews" and "AcmePay pricing." A strong set is "how do I run payroll for my first employee," "cheapest way to handle contractor payments," "best payroll software for a five-person company," and "alternatives to [big incumbent]." The weak set tells you nothing. The strong set tells you whether AI hands you new buyers.

lightbulbPRO TIP

Mine your real demand sources for phrasing: sales call recordings, support tickets, and the People Also Ask boxes. Buyers do not phrase questions the way marketers do. Steal their exact words.

Lock the set once it is good. The whole point of measurement is comparison over time, and you cannot compare if you change the questions every month. Keep a stable core you rerun, and a small rotating set for experiments. When you do add a query, mark its start date so you do not mistake a new question for a trend change.

04

CHAPTER 04

Sampling and Variance: The Same Prompt Gives Different Answers

Here is the dirty secret nobody selling a GEO dashboard wants to say out loud. Run the same prompt twice and you can get two different answers, with different brands cited. The models are non-deterministic by design. If you test each query once, your data is a coin flip dressed up as a metric. This chapter is the difference between real measurement and theater.

Treat each query like a sample, not a fact. One run is an anecdote. The honest unit of measurement is the citation rate across multiple runs of the same prompt. If you appear in three of five runs, your presence for that query is sixty percent, not a yes and not a no. That fractional view is the truth, and it is far more useful than a single snapshot.

bolt

One run is a rumor. A citation rate across several runs is a measurement.

How I control the wobble

  1. 1Run each core query several times, not once. More runs for high-value prompts, fewer for the long tail.
  2. 2Fix the conditions you can: same model version, same region setting, fresh chat with no memory carried over, search mode on or off held constant.
  3. 3Record the citation rate as a fraction, not a binary, for every query.
  4. 4Aggregate to the engine level so you compare ChatGPT to Perplexity fairly, since their variance differs.
  5. 5Re-run the full set on a fixed cadence so each period is measured the same way.

targetWhy personalization wrecks casual testing

If you test logged into your own account, the model may carry memory, location, and past chats into the answer, and you will see yourself cited because the system knows you. That is not what a stranger sees. Test in a clean session, logged out or in a fresh context, with personalization minimized, so your numbers reflect a real prospect and not a flattering mirror.

Hold conditions constant

Same model version, same region, fresh session, same search setting. Change one input and your trend line becomes meaningless.

warningWATCH OUT

Do not compare a number you collected in May at one model version to a number from August at a newer version and call the difference your doing. The vendor shipped a new model. That moved your number more than your content did. Always log which model version produced each reading.

This is also why I distrust any single-run "GEO audit" screenshot. It captures one roll of the dice. The discipline of repeated sampling is unglamorous, and it is exactly what separates a number you can act on from a number that lies to you. The same rigor I apply to measuring Core Web Vitals, where field data beats a single lab run, applies here.

05

CHAPTER 05

Attributing a Citation: Brand and Entity vs a Single URL

You need to be precise about what got cited. Sometimes a model names your brand from memory with no link. Sometimes it links one specific page it retrieved live. Sometimes it links a page about you that you do not even own, like a third-party review. If you only watch your own URLs, you will miss most of how AI talks about you, and you will optimize the wrong things.

Split attribution into two streams from day one. Stream one is the entity: is the brand named, regardless of any link? Stream two is the URL: which specific pages get linked, yours or anyone else's? These tell different stories. Strong entity presence with weak URL presence means the model knows you but does not trust a specific page enough to send a click. That is a fixable structural problem.

bolt

The model knows brands. It links pages. Measure both, because they fail in different ways.

targetChatGPT vs Perplexity attribution

Engines that lean on training knowledge tend to name entities without links, especially without search mode on. Engines built around live retrieval, like Perplexity, lean toward explicit URL citations you can see and count. Gemini and Google's AI Overviews sit in between and pull from the live index heavily. So the same brand can be "entity-strong, URL-weak" in one engine and the reverse in another. One number across all engines hides this.

Third-party citations count too

Often the linked source is not your site. It is a roundup, a review, a forum thread, or a directory that the model trusts more than your homepage. That is still a citation that shapes the buyer, and it is a roadmap. If models keep citing a particular review site when they describe your category, getting represented well on that site can move your share faster than any change to your own pages. This is the off-site half of E-E-A-T doing its quiet work.

Example

You run the prompt "best CRM for solo consultants" across engines. ChatGPT names you and two rivals, no links. Perplexity links a third-party comparison article that ranks you second. Gemini links your own pricing page. If you only tracked your own domain, you would record one citation. In reality you have an entity win, a third-party-mediated win, and a direct URL win, each pointing to a different next move.

Two streams: entity and URL

Count brand presence and link presence separately, and log the linked domain even when it is not yours.

lightbulbPRO TIP

Keep a running list of the non-owned domains that keep getting cited for your category. That list is a target list. Earning placement or accuracy on those sources is some of the highest-leverage GEO work there is, and it overlaps neatly with smart link building.

06

CHAPTER 06

Tracking Share of Voice Over Time

A citation count on one day tells you almost nothing. Citation share that climbs over six runs while a competitor's falls tells you everything. GEO measurement is a longitudinal sport. You are looking for direction and gap, not a hero number to screenshot. Here is how I structure the trend so it actually informs decisions.

Build a share-of-voice view, not a vanity count. For each period, total the citations across your whole query set, then express your slice as a percentage of all brand citations that appeared. Do the same for your top competitors. Now you have a chart where everyone competes for a fixed pie, and your wins and losses are visible relative to the field, not in a vacuum.

bolt

Your citation count can rise while your share falls. Only share of voice tells you if you are winning the room.

Slice the trend so it is useful

  • By engine: your share in ChatGPT may climb while Perplexity stalls. Average them and you lose the lesson.
  • By query bucket: dominating comparison prompts but missing problem-stage prompts is a specific, fixable gap.
  • By competitor: track the two or three brands that keep appearing next to you, not the whole universe.
  • By citation tier: rising mentions but flat links means a structural job, not an awareness job.

targetWhat a healthy trend looks like

A healthy GEO program shows share of voice grinding upward across several measurement periods, with the gap to your nearest rival narrowing, and your linked-citation tier growing faster than bare mentions. Flat share with rising raw counts usually means the whole category got more visible and you just rode the tide. That is not a win you earned.

Example

Imagine two quarters of data. In Q1 you hold a quarter of the citations in your set and a rival holds a third. In Q2 your raw count went up, which feels great, until you see your share dropped to a fifth because the rival shipped a big content push. The raw count fooled you. The share view caught it. This is hypothetical, but the trap is completely real and I see it constantly.

lightbulbPRO TIP

Annotate your trend line with what you shipped and when. New comparison page, a schema rollout, a guest placement on a cited review site. When share moves, you want to know the most likely cause, the same way you annotate deploys in any serious analytics setup.

Direction and gap beat the snapshot

Report share of voice over time against named rivals. A rising line with a shrinking gap is the real scoreboard.

07

CHAPTER 07

The DIY Method: Measure Citations Without a Fancy Tool

You do not need a four-figure subscription to start measuring GEO. You need a clear method and the patience to run it. I built my first citation tracker in a spreadsheet, and for many businesses that is genuinely enough. Here is the exact process, start to finish, that you can run this week.

Set up the sheet

  1. 1List your locked query set, one prompt per row, tagged by funnel bucket.
  2. 2Add columns per engine for each run: named (yes/no), linked (yes/no), primary source (yes/no), accuracy flag, and the linked domain.
  3. 3Add a calculated column for citation rate per query per engine, the fraction of runs you appeared in.
  4. 4Add a competitor presence column so you can compute share, not just your own count.
  5. 5Date and version-stamp every batch so each period is comparable.
bolt

The tool does not measure GEO. The method does. A disciplined spreadsheet beats a sloppy dashboard.

Run the test

  1. 1Open each engine in a clean, logged-out or fresh session with personalization minimized.
  2. 2Paste each prompt, run it the agreed number of times, and record every field honestly, including the runs where you do not appear.
  3. 3Capture the linked sources verbatim, including the third-party domains, not just whether you showed up.
  4. 4Hold conditions constant across the whole batch: same day, same model versions, same search setting.
  5. 5Roll the per-query rates up into per-engine share and your overall share of voice.

targetWhen to graduate to a tool

Dedicated GEO platforms exist that automate the runs, handle multiple engines, and chart share over time for you. They save real labor once your query set grows and you are tracking many competitors across several engines on a tight cadence. The honest test: if you are spending more time copy-pasting prompts than thinking about what the data means, buy the tool. Until then, the sheet teaches you the mechanics, and you will use any tool far better for having done it by hand first.

warningWATCH OUT

Respect each engine's terms of service when you test. Manual, reasonable-volume checking for your own research is one thing. Hammering an API in ways that violate terms is another. Keep it sane and within the rules.

Spreadsheet first

Locked queries, multi-run rates, four citation columns, version stamps. That sheet is a complete, honest GEO measurement system.

I trust a humble spreadsheet I understand over a slick dashboard I cannot interrogate. If you cannot explain how a number was produced, do not bet a strategy on it.Shmul
08

CHAPTER 08

Turning Findings Into Action

Data that does not change a decision is decoration. The whole reason to measure citation share is to know exactly where to point your effort next. Once you have a few periods of clean data, the patterns tell you what to fix in plain terms. Here is how I read the sheet and turn it into a work list.

Start with your biggest gaps, weighted by query value. A comparison prompt where buyers decide is worth more than a vague problem-stage one. Find the high-value queries where your citation rate is low and your competitors are high. That intersection is your priority list. You are not trying to win every prompt. You are trying to win the prompts that move money.

bolt

Do not chase every gap. Chase the high-value queries where a rival beats you and you can realistically catch up.

Map the pattern to the fix

  • Not mentioned at all on a query: you lack a page that clearly addresses that question. Build it, structured for extraction, as I cover in content writing.
  • Mentioned but never linked: the entity is known but no page earns trust. Tighten structure, add clear claims and schema, per schema markup.
  • Cited but inaccurate: the model has stale or muddled facts. Publish an unambiguous, current source page and reinforce it.
  • A third-party domain keeps winning: go earn accurate representation on that source.
  • Strong on one engine, weak on another: study what the winning engine rewards and adapt, since Perplexity and ChatGPT do not weight the same things, as I explain in ranking in Perplexity.

targetClose the loop

Every fix is a hypothesis. You changed a page to win a query, so the next measurement period is the experiment that tests it. Annotate what you shipped, re-run the set under the same conditions, and read the move. This loop, measure, fix, re-measure, is the entire discipline. Do it for two or three cycles and you will have something most of your competitors never build: an evidence-based picture of what actually moves your share.

Example

Your data shows you are absent from "best [category] for small teams" across all three engines while one rival is cited every time. You build a genuinely useful, well-structured comparison page aimed at that exact question, add schema, and earn a mention on the third-party roundup the models kept citing. Next period you measure again and watch whether your rate on that query moved. If it did, you found a repeatable play. If it did not, you learned something and you try the next lever.

Measure, fix, re-measure

Treat each content change as a hypothesis and the next measurement cycle as its test. That loop is GEO.

lightbulbPRO TIP

Pair this measurement loop with the on-page fundamentals in on-page SEO and the structural work in winning AI Overviews. Citation share tells you where to aim. Those guides tell you what to build once you are aiming right.

Frequently asked

What is citation share in GEO?expand_more
It is the percentage of times an AI engine names your brand or links your page across a set of buyer questions, measured against the competitors who appear alongside you. It replaces keyword rank as the core GEO metric because there is no ranked results page inside a ChatGPT or Perplexity answer to measure position on.
Why do I get different AI answers when I run the same prompt twice?expand_more
The models are non-deterministic by design, so the same prompt can return different wording and different cited brands each run. That is why you should run each query several times and record your citation rate as a fraction, not test once and treat that single answer as a fact.
Do I need a paid GEO tool to measure citations?expand_more
No. A disciplined spreadsheet with a locked query set, multi-run citation rates, and four scoring columns is a complete measurement system. Paid tools save labor once you track many competitors across several engines on a tight cadence, but doing it by hand first teaches you the mechanics so you use any tool better.
Should I count a mention if the model does not link my site?expand_more
Yes, but score it separately. A brand mention with no link still plants your name with the buyer and is a real win. A linked citation can drive a click and signals deeper trust. Track them as distinct tiers because each one points to a different fix.
How many queries should be in my GEO test set?expand_more
Enough to be stable and few enough that you will actually re-run them every period, which for a focused product usually means a core set in the low dozens weighted toward comparison and problem-stage prompts. A thousand-prompt set looks impressive, never gets re-run, and is therefore useless for tracking a trend.
How often should I re-measure citation share?expand_more
On a fixed cadence so each period is comparable, and always note which model versions produced the readings. When a vendor ships a new model your numbers can move more than your content did, so version-stamping every batch is what keeps your trend line honest.

Want this done for you?

I help brands win on Google and get cited in AI search. Tell me about your project.

Work with me