How the Four Floors Survive Agentic RAG

On 20 May 2026, Mike King published “Beyond RAG” at iPullRank.

The piece documents what every major AI search platform — ChatGPT Search, Perplexity Pro, Google AI Mode, Gemini Deep Research, Microsoft Copilot Researcher, Claude Computer Use — now runs in production: a five-stage agentic pipeline, not a single-shot retrieval. King names the stages: planner, router, retrieval tools, critic, synthesiser. He maps each stage to a Google patent filing dating back to 2018. He introduces stage-failure rate as the operational diagnostic: not “do we rank?”, not “are we cited?”, but “where in the five-stage pipeline does our content drop out?”

It is the strongest practitioner-facing framing of agentic RAG published to date, and it changes the diagnostic conversation a serious practitioner can have with a client. The work that mattered last month was visibility. The work that matters this month is observability.

This essay is the complement to King’s piece, not a competing version of it. Where King describes the HOW — the moment-by-moment process the agentic pipeline runs — the Four-Floor Model describes the WHAT — the four layers of architecture that the pipeline is evaluating at each moment. Without the Four Floors, King’s pipeline has nothing to act on. Without King’s pipeline, the Four Floors have no operational gate. Each framework needs the other to be fully useful.

The Four-Floor Model was published earlier in 2026. It names four layers — Entity Foundation and Discovery, Content Extractability, Trust and Selection, Agentic Execution — each of which is a dependency for the layer above it. This essay maps each floor to the stage in King’s pipeline where that floor gets evaluated, and what stage failure looks like from inside the floor.

Why the two frameworks talk to each other

King’s vocabulary describes a system. The Four-Floor Model describes a substrate. Both are correct. Neither is sufficient on its own.

A system without a substrate is a procedure with nothing to evaluate. King’s piece is unusually careful about this. The planner stage decomposes the user query into sub-queries. The router stage decides which retrieval tool answers each sub-query. The retrieval tools fetch candidate passages. The critic stage applies pairwise scrutiny and reflection logic to the candidates. The synthesiser stage produces the final answer, with citations where the critic accepted them. Each of these stages is a procedure. Each procedure operates on whatever the substrate provides — and whatever the substrate provides is what the floors describe.

A substrate without a system is a static description with no diagnostic. The Four-Floor Model on its own answers “what does an AI-recommendable business look like?” But it does not answer “where in the actual pipeline does my business get filtered out?” That question is operational. It is what a client paying for an AI visibility audit wants to know. King’s stage-failure-rate diagnostic gives the operational answer; the Four Floors give the architectural one. Together they give the practitioner a map of where the business is and a procedure for moving it.

The unification is more than rhetorical. It changes what gets measured. The pre-King AI visibility audit measured citation share — how often the business appeared in AI-generated answers across a set of queries. That measurement is useful but coarse. It does not tell you why the business is not appearing, and it does not tell you what to fix first. The post-King AI visibility audit measures stage-failure rate by floor — for each query, at which floor does the business fall out, and what does that floor need? That is a different conversation. It is also a more expensive one to run, but it produces a roadmap rather than a verdict.

There is a strategic implication beneath the technical one. Practitioners who have been doing Four Floors work — Floor 1 entity foundation, Floor 2 extractability, Floor 3 corroboration, Floor 4 agentic readiness — were already doing agentic RAG work whether they named it that or not. King’s framing makes the diagnostic vocabulary portable. It also makes the work defensible to clients in language those clients will recognise.

Floor 1 — Entity Foundation and Discovery, gated at the planner stage

Floor 1 is the foundation: the business identity has to be discoverable by the system before any retrieval happens. NAP consistency across Google Business Profile, Apple Business Connect, Companies House. Wikidata entity presence. Bing indexability. The signals that let an AI system recognise that this entity exists, what category it operates in, and where in the geographic graph it sits.

In King’s pipeline, this floor is evaluated at the planner stage. The planner decomposes the user query into sub-queries. For “best managed file transfer vendor for a UK financial services firm”, the planner is generating sub-queries like “managed file transfer UK financial services”, “MFT enterprise compliance UK”, “secure file transfer financial services regulated”, and so on. Each sub-query is then handed to the router for retrieval. Google patent US11663201B2, Query Variant Generation, names this mechanic explicitly. The planner is doing variant generation against an entity graph it has access to.

If your business does not exist in that entity graph in a recognisable form — if your NAP is inconsistent, if your Companies House registration is misaligned with your Google Business Profile, if you have no Wikidata entry, if your LinkedIn company page name does not match your trading name — the planner does not produce sub-queries that contain your entity. You fail before retrieval starts. King’s framing is precise: stage failure at the planner means the rest of the pipeline never has the chance to see you.

This is the floor where King’s bridge entity concept becomes operationally relevant. A bridge entity is an entity that exists in the knowledge graph and is associated with the topic the user asked about. The planner uses bridge entities to expand the query into the variants that will be sent to retrieval. If your business is not a bridge entity for any of the topics it serves, you are absent from the variant set. The fix is not content. The fix is identity-graph repair: make sure the entity exists in canonical form across the data sources the planner consults, with consistent claims and corroborated relationships.

The diagnostic test for Floor 1 failure is simple in shape and slow in execution. Ask the AI system for “businesses like [your business] in [your city]” or “alternatives to [your business] in [your category]”. If your business appears in the alternatives list when you are not the named entity, you are a bridge entity for that category. If it does not appear, you are not. The work to become one is the work the entity-foundation discipline has always specified: NAP consistency, Wikidata creation, sameAs structured data linking your identity across platforms, Companies House registration aligned with operating brand, named-source mentions in trade publications and analyst reports.

The reason Floor 1 sits at the bottom of the model is that no amount of upper-floor work — no extractability discipline, no corroboration cultivation, no agentic readiness — can compensate for the planner failing to surface the entity in the first place. The variant generator cannot produce sub-queries about an entity that does not exist for it.

Floor 2 — Content Extractability, gated at the pairwise re-rank

Floor 2 is the page-level work. CITATE-compliant passages. Structured data. Machine-readable answers. The discipline that says: when an AI system retrieves your page, your passages need to survive the head-to-head against competitor passages on the same query.

In King’s pipeline, this floor is evaluated at the retrieval-tool plus pairwise re-rank stage. The router sends sub-queries to the retrieval tools — typically a mix of search index, vector store, and direct fetch from named sources. Each retrieval tool returns candidate passages. The critic stage then runs pairwise scrutiny: head-to-head comparison of candidate passages, with the winner advancing to the synthesis stage. Google patent US20250124067A1, Pairwise Ranking Prompting, names this mechanic. King’s vocabulary for it is “your passages have to survive pairwise scrutiny” — which is the operational form of the CITATE C1-C6 criteria translated into the language of the system that is doing the scrutiny.

The implication is structural. A passage that is technically extractable but structurally weak loses to a passage that is structurally strong. Structural strength here means: a single attributable claim in the opening sentence, a named source in the same paragraph, a definition that is self-contained, a stat with provenance. The CITATE framework names this discipline at the page level. King’s pipeline shows where the discipline is mechanically enforced.

Stage failure at the pairwise stage looks like this. Your page ranks in traditional Google results. The retrieval tools fetch it. The critic compares it to a competitor passage on the same query. The competitor passage wins pairwise scrutiny. Your passage is dropped. The synthesiser produces an answer citing the competitor. You appear in nothing. The visible symptom is “we rank but we are never cited” — which is the exact pattern in the Floor 2 symptoms array on the homepage widget, and which we have now traced to the operational stage where the failure occurs.

The work to fix Floor 2 failure is the work CITATE specifies: opening 50-word extractable answer, attributable claims with named sources, defined terms with provenance, stats with citation. There is no shortcut. But the diagnostic value of King’s framing is that you can now identify which queries are failing at which floor. A query where you are absent from the planner is a Floor 1 problem. A query where you survive into retrieval but lose pairwise is a Floor 2 problem. A query where you survive pairwise but lose the critic is a Floor 3 problem. Different failures, different fixes.

The Schema Architecture for the AI Era keystone (May 2026) sits at this floor. The Footprint vs Fingerprint pre-publication discipline sits at this floor. The CITATE framework operationalises the floor at the page level. King’s pipeline is the operational gate the floor passes through. The frameworks converge.

Floor 3 — Trust and Selection, gated at the critic and reflection stage

Floor 3 is corroboration. Editorial mentions in trade publications. Named-source analyst coverage. Review signals. Cross-source verification. The signals that tell the AI system: this entity is not just claiming to be a serious player in its category, it is treated as one by independent sources.

In King’s pipeline, this floor is evaluated at the critic stage. The critic does not just rank passages against each other. The critic explicitly checks corroboration, contradiction-handling, source diversity, and recency. Google patent US20240289407A1, Search with Stateful Chat, names the memory mechanic the critic uses across multiple turns. The critic is the reflection module — the gate that decides not only “is this passage good?” but “is this source trustworthy enough to be cited by name in the final synthesis?”

The reviews-bridge sentence shipped in the recent Four-Floor widget update prefigured this: AI systems increasingly use corroboration signals similar to how humans evaluate trust — review consistency, editorial mentions, structured business data, and independent references across the web. King’s reflection module is exactly that, mechanised. The critic looks at the page that survived pairwise, asks whether the same claim appears in independent sources, asks whether the source making the claim has a track record of being right on this topic, asks whether there is contradictory evidence elsewhere that needs to be reconciled.

Stage failure at the critic looks like this. Your passage survived pairwise — it is technically the best available answer to the sub-query the router sent. But the critic cannot find independent corroboration. Or it finds contradiction in a more recent source. Or it finds that the entity making the claim has no editorial track record on the topic. The synthesiser receives the critic’s verdict: do not cite by name; paraphrase the claim and attribute to a more authoritative source if one is available. The visible symptom is “competitors get cited for the same factual content we publish first” — which is the displacement pattern several practitioner reports have documented through 2025 and 2026.

The Editorial Selection framework operationalises this floor at the per-event level. The Retrieval Gravity framework operationalises it at the cumulative-memory level. Both describe the work that produces critic-survival: editorial mentions earned (not purchased), named-source contribution to journalists writing on adjacent topics, analyst engagement where the analyst has no commercial relationship with the entity, research and data publication that other people quote because the data is useful, conference speaking on substantive material that gets cited downstream.

The slow part of the work is here. Floor 1 is fixable in weeks. Floor 2 is fixable in months. Floor 3 is fixable across years, because the signal the critic is looking for is durability and consistency across independent sources. The compound advantage goes to the entity that started the routine earlier. King’s framing is useful here because it gives the practitioner the diagnostic vocabulary to explain to a client why the work takes the time it takes — the critic is doing a check that cannot be hurried, and the only way through is to produce the signals the critic is looking for at the cadence the critic expects.

Floor 4 — Agentic Execution, gated at the synthesiser-to-action handoff

Floor 4 is the layer most practitioners are not yet operationally engaged with: agentic readiness. MCP, WebMCP, callable tools, structured-output endpoints. The infrastructure that lets an AI agent not just cite the business but transact with it.

King’s piece does not fully cover Floor 4 because his focus is research and retrieval. But his MCP and tool-calling sections directly map to where Floor 4 fits in the pipeline: the synthesiser-to-action handoff. After the synthesiser produces the answer, in the most advanced agentic configurations, it hands off to a tool call when one is available. Google patent US11769017B1, Generative Summaries for Search Results, covers the synthesis mechanic. The handoff to action is where Floor 4 readiness gets tested.

The configuration is this. A user asks an AI agent to “book the wheelchair-accessible taxi from Southampton port for the 14 May cruise turnaround”. The agent identifies the relevant entity through Floor 1. The agent retrieves the relevant page through Floor 2. The agent validates trust through Floor 3. The synthesiser produces the booking confirmation language. And then — if Floor 4 readiness exists — the agent calls the entity’s booking endpoint directly. If Floor 4 readiness does not exist, the agent stops at the synthesis stage and tells the user to visit the website to complete the booking. The conversion difference is structural.

Floor 4 is currently a pre-production reality moving quickly toward production. The Linux Foundation announced the Agentic AI Foundation in December 2025. MCP SDK downloads passed 97 million per month around the same time. Over 10,000 MCP servers have been published. King’s piece references the MCP and tool-calling layer as the next-stage extension of the agentic pipeline rather than a speculative future. The window for being early to Floor 4 is the 18-month horizon ahead of the moment when the answer to “did the agent transact” stops being “no, the user did” and starts being “yes, the agent did, with the entity that had the callable interface ready”.

The work at Floor 4 is genuinely new in shape. Authority signals at the entity level (the Floor 1 inheritance). Tool descriptions that agents can parse and select between. Schema for the transactional endpoint. Authentication patterns that work with agent-initiated flows. Error handling that returns useful state to the agent rather than to a human. The MCP versus WebMCP guide and the WebMCP technical guide on this site cover the implementation in detail.

Stage failure at the synthesiser-to-action handoff is not yet visible to most operators because most operators have not built the agentic surface that would generate the failure event. The window for being early is the work itself. Floor 4 is where the practitioner conversations of 2027 will start, and the practitioners who started in 2026 will be six to eighteen months ahead of where the conversation begins.

Stage-failure rate applied to the four floors

King’s stage-failure rate is the operational metric. The diagnostic question moves from “are we cited?” to “where in the pipeline do we drop out?”. Mapped onto the floors, the diagnostic becomes a routing instruction.

If you fail at the planner stage, the work is Floor 1. The entity does not exist in the variant set the planner generates. The fix is identity-graph repair: NAP consistency, Wikidata, sameAs, Companies House alignment, Apple Business Connect, LinkedIn company entity, and the named-source mentions that make the entity visible to the planner. Expected timeline: 4 to 12 weeks for the easier signals, 6 to 18 months for the harder ones (Wikidata entry, analyst recognition, editorial track record on the topic).

If you fail at the pairwise re-rank, the work is Floor 2. Your passages survive into retrieval but lose head-to-head against competitor passages. The fix is page-level extractability discipline: CITATE 6/6 enforcement, opening 50-word extractable answer, attributable claims with named sources, defined terms with provenance, stats with citation. Expected timeline: 4 to 12 weeks per page, longer if the existing content estate is large and not currently CITATE-compliant.

If you fail at the critic, the work is Floor 3. Your passages survive pairwise but the critic does not find independent corroboration, or finds contradiction, or judges the source not trustworthy enough on the topic. The fix is the slow off-page work: editorial mentions earned, named-source contribution to journalists, analyst engagement, research publication, conference speaking on substantive material. Expected timeline: 6 to 24 months for measurable shift; the compound effect goes to entities that started the routine earlier.

If you fail at the synthesiser-to-action handoff, the work is Floor 4. The agent has all three lower floors but cannot act on them because there is no callable interface. The fix is agentic-readiness infrastructure: MCP server, WebMCP for browser-based agents, tool descriptions, transactional schema, agent-friendly authentication, error handling that returns useful state. Expected timeline: 4 to 16 weeks for a credible first version; the operational lift compounds with how well-positioned the lower floors are.

The four-floor stage-failure routing produces something useful that pre-King AI visibility audits did not: a roadmap. Not a verdict. The roadmap is sequenced by dependency — Floor 1 before Floor 2, Floor 2 before Floor 3, Floor 3 before Floor 4. The roadmap is also sequenced by leverage: the largest gains for most businesses sit at Floor 1 and Floor 2 because those failures are mechanical and fixable in months rather than years.

A reproducible audit in 30 minutes

King has published an open-source distillation tool at github.com/iPullRank-dev/agentic-rag-audit. The tool stands up a calibrated harness for running queries against an AI search platform and producing stage-failure diagnostics. It is engineering-heavy. For a senior engineer or technical SEO, it is the right starting point.

For a practitioner who wants a lighter version that produces useful signal in under an hour, the following audit pattern works. Three queries, three platforms, the four-floor diagnostic spine.

Choose three queries that represent the commercial intent the business serves. One head term (the broad category). One specific intent (a modifier or use-case query a real buyer would use). One brand-adjacent term (a category descriptor a buyer who almost-knows the brand would search). Run each query against ChatGPT (logged in, deep research mode), Perplexity (Pro), and Google AI Mode. Capture the answer and the citations.

For each query and each platform, ask four questions in order. Does the business appear in the answer at all? If not, the failure is at Floor 1 — the planner did not surface the entity. If yes, is the business cited by name with a link to its own pages? If not, the failure is at Floor 2 or Floor 3 — the passages either lost pairwise or failed the critic. If yes, do the cited pages have the structural markers that would make them survivors of the pairwise re-rank — extractable opening, named-source attribution, defined terms, structured data? If not, the citation is fragile and likely to drop in the next platform refresh. If yes, does the answer contain a tool-call or transactional handoff to the business? If not, the Floor 4 readiness is absent — the business is being recommended but not transacted with.

This is the four-floor stage-failure diagnostic in 30 minutes. It is not as deep as King’s harness. It is enough to start a serious client conversation.

The third frame: surface divergence (Petra Labs, February 2026)

King’s piece covers the pipeline. The Four-Floor Model covers the substrate. There is a third frame that completes the picture: surface divergence.

Petra Labs published a 900-trial empirical study in February 2026 documenting that the same AI model produces materially different brand recommendations, different sources, and different content types depending on access surface. Logged-in chat versus logged-out chat versus API access produced visibility gaps of up to 32 percentage points on the same query. One brand was absent entirely from API trials while appearing in 15 to 18 percent of chat trials. The underlying retrieval system is the same; the surface routing produces different filtering behaviour.

The implication for the Four Floors is that the work cannot be evaluated on a single surface. A business with strong Floor 1 entity foundation may be visible on logged-in ChatGPT and absent on the OpenAI API. A business with strong Floor 2 extractability may be cited in Perplexity Pro and ignored in Perplexity’s standard interface. The same underlying retrieval system runs different filtering when the access surface changes — for safety, for personalisation, for cost optimisation, for product differentiation.

The audit pattern in the previous section hints at this by sampling three platforms. The fuller answer is portfolio coverage: measure the floors across logged-in, logged-out, and API surfaces of the major platforms, and track the divergence over time. Pre-King AI visibility was a single score. Post-King AI visibility is a stage-failure-rate diagnostic by floor. Post-Petra AI visibility is the stage-failure-rate diagnostic by floor and by surface. Each addition is more expensive to run. Each addition produces better client conversation.

What this changes for the practitioner

The agentic-RAG era does not replace the Four Floors. It operationalises them. The frameworks that named the work — entity foundation, content extractability, trust and selection, agentic execution — describe the same architecture King’s pipeline is mechanically evaluating. Practitioners who have been doing Four Floors work were already doing agentic RAG work whether they named it that or not. King’s framing makes the diagnostic vocabulary measurable and defensible. The work itself does not change.

What does change is how the work gets sequenced and reported. The pre-King practitioner conversation went: “your AI visibility is X out of Y, here are some recommendations.” The post-King conversation goes: “your business is failing at the planner stage on three of five tested queries, at the pairwise re-rank on one, and surviving the critic but absent at the synthesiser-to-action handoff on the remaining one. Here is what each floor needs and in what sequence.” The first conversation is a verdict. The second is a roadmap.

The reporting cadence shifts with it. A monthly AI visibility report that produces a single score is not a useful artefact when the underlying system runs stage-by-stage. A monthly report that produces stage-failure-rate by floor, by query, by surface, is a useful artefact because it tells the operator where to spend the next month. King’s diagnostic vocabulary is what makes the monthly report meaningful instead of decorative.

For senior practitioners, the second-order effect is more interesting. The frameworks that win the next two years will be the ones that operationalise a recognised problem at a recognised level of detail. King has operationalised the pipeline. The Four-Floor Model has operationalised the substrate. Petra Labs has operationalised the surface divergence. Each piece of work has a different audience but they fit together cleanly. The practitioner who can describe a client’s situation in all three vocabularies has the strongest seat at the table because the conversation that produces the next budget is the one that maps each technical observation to the operational consequence.

This is the moment the AI visibility category stops being adjacent to SEO and becomes its own discipline.

The discipline has names now — King’s, the Four-Floor Model’s, Petra’s. The disciplines compose. The compounded vocabulary is the one practitioners will be quoted in for the next eighteen months. King moved first. The Four-Floor Model is the substrate his pipeline runs on. The work is the same work it always was. The language for explaining it is now sharper.

Credits and further reading

Mike King’s Beyond RAG is the piece this essay synthesises with. The full piece is at ipullrank.com/agentic-rag and the open-source distillation tool is at github.com/iPullRank-dev/agentic-rag-audit. King’s vocabulary — agentic RAG, stage-failure rate, pairwise scrutiny, reflection module, bridge entity — is the operational language this essay uses throughout. The Google patents King references and which this essay touches: US11663201B2 (Query Variant Generation), US20240362093A1 (Query Response Using Custom Corpus), US20240289407A1 (Search with Stateful Chat), US20250124067A1 (Pairwise Ranking Prompting), US11769017B1 (Generative Summaries for Search Results).

The Petra Labs surface-divergence study (February 2026) is the third frame this synthesis depends on. The full paper documents 900 trials across nine platform-surface combinations and is the strongest empirical work on cross-surface AI visibility published to date.

The Four-Floor Model master post sits alongside this essay as the canonical statement of the framework. The MCP and WebMCP technical implementation is covered at MCP versus WebMCP: Building the Floors and the WebMCP guide. The CITATE framework — the page-level discipline that Floor 2 references — carries its full taxonomy and scoring criteria. The AI Discovery Stack covers the discovery-layer companion that sits below Floor 1 in the entity-surfacing chain.

Related topics:

aeo Ai Overviews ai-seo ai-visibility aio Chatgpt Citations Entity Seo future-of-seo geo llm-optimisation Perplexity
Sean Mullins

Founder of SEO Strategy Ltd with 20+ years in SEO, web development and digital marketing. Specialising in healthcare IT, legal services and SaaS — from technical audits to AI-assisted development.