AI Architecture 10 min read

Architecting AI Agents with Web IQ: The External Data Gap

Architecting AI Agents with Web IQ: The External Data Gap
A hands-on architect's guide to grounding enterprise AI agents with Microsoft Web IQ — passage-level retrieval, 164 ms p95 latency, ~70% token savings, and the integration paths (REST, MCP, Copilot Studio, Semantic Kernel) you can wire up today.

When building enterprise-grade AI applications, the models we rely on—whether massive foundational models or agile small language models—share a fundamental flaw: their knowledge is finite. Because training happens at a static point in time, models suffer from strict knowledge cut-off dates. Furthermore, out of the box, they have zero native visibility into your organization’s private data or the real-time state of the outside world.

As we move from simple chat assistants to complex, multi-step agentic workflows, managing these data boundaries is critical. If an AI hallucinates a fact in step one of an orchestration loop, that error cascades—each downstream step compounds the mistake until trust in the entire output collapses. This challenge is amplifying as the industry shifts toward reasoning-focused models. By stripping out bulky intrinsic knowledge to reduce parameter counts and improve reasoning speed and cost, these models mandate a robust strategy for external knowledge injection. A leaner model is a hungrier model: it reasons beautifully, but only over the facts you feed it.

The Internal Context: Work, Fabric, and Foundry IQ

Retrieval-Augmented Generation (RAG) is the standard mechanism for appending missing knowledge to an LLM’s prompt. For internal organizational data, the Microsoft ecosystem provides dedicated grounding surfaces—part of the broader “IQ” family unveiled at Build 2026:

  • Work IQ: Microsoft 365 collaboration context—people, email, documents, Teams, workflows, and the artifacts they produce.
  • Fabric IQ: Enterprise structured data, business entities, and systems of record.
  • Foundry IQ: Authoritative documents, policies, and curated knowledge bases—your custom repositories.

Think of these as the three layers an enterprise agent draws on for internal truth: data context, knowledge context, and user/workflow context. But an agent often needs to look beyond the corporate firewall. Whether it is monitoring competitor pricing, analyzing live market conditions and regulations, or tracking brand sentiment, your agent hits the External Data Gap—real-world information that no internal tool can supply.

Diagram showing the secure Internal IQ hub versus the vast open External Data Gap

The Five Pillars of AI-Grade Web Retrieval

You cannot simply point an AI at a standard consumer search engine. When architecting an automated evaluation framework or wiring tools through MCP (Model Context Protocol) servers, web retrieval has to satisfy machine-to-machine requirements that a human-facing search box was never designed for:

  1. Freshness: The data must reflect the absolute current state of the world. A streaming index that absorbs new pages in milliseconds beats a nightly batch crawl every time.
  2. Ultra-Low Latency: Humans tolerate a few seconds of page-load time; agents do not. They operate in iterative loops, often firing several lookups per turn—slow retrieval multiplies across every hop and breaks the experience.
  3. Index Breadth: The engine must parse deep, globally distributed, relevant information. Bing’s substrate behind Web IQ indexes 50+ billion documents and processes roughly 15 million new or updated pages daily.
  4. Responsible Grounding: The engine must honor publisher paywalls, robots.txt directives, and publisher preferences. Ignoring those requirements exposes your organization to severe IP-infringement risk.
  5. Integration: The service must plug directly into your orchestration logic via REST APIs, SDKs, or MCP servers—not a browser tab a human has to babysit.

Enter Web IQ: The Engine Behind Copilot

Web IQ, announced at Build 2026, is Microsoft’s ground-up rebuild of web search for AI systems rather than humans. Powered by the global Bing index and already in production grounding both Microsoft Copilot and OpenAI’s ChatGPT web responses, it is engineered around a single premise: models don’t need documents, they need information—and a full web page is a noisy, expensive proxy for the one paragraph that actually answers the query.

Two numbers define its profile:

  • 164 ms p95 latency for the full retrieval pipeline—fast enough to sit inside a tight agent loop without the user feeling it.
  • A leading score on GDSAT (Grounding Satisfaction), Microsoft’s internal quality metric that grades completeness, freshness, and authority across thousands of production queries.

The architectural departure matters: unlike the retired Bing Search API or the simpler “Grounding with Bing” wrapper, Web IQ does not return a ranked list of blue links. It returns evidence objects—pre-chunked, citation-ready passages with provenance metadata—that you drop straight into a context window with no parsing pipeline in between.

Solving Tokenomics with “Passages”

The most consequential architectural lever in Web IQ is Tokenomics. LLM providers charge per token. Passing entire web pages does double damage: it wastes money on input tokens and degrades output quality by burying the answer in irrelevant noise.

Web IQ’s Passages mode is the fix. Instead of returning full HTML or 10,000 characters of raw text, configuring your call for Passages instructs the engine to extract only the “golden paragraph”—the specific segment that directly answers the query—along with its source, publication date, and an authority score. Microsoft reports up to ~70% fewer tokens than piping the equivalent raw HTML through an LLM. For an agent making several searches per turn, that saving compounds on every single hop.

Illustration of a large, cluttered web page document being filtered into a single glowing Passages block

💡

Token Optimization: Passages sends drastically fewer, far more relevant tokens to the model. You get the highest grounding satisfaction and slash inference cost—because you stopped paying the model to read boilerplate, nav bars, and cookie banners.

Web IQ Modalities and Output Formats

Web IQ lets you request data in multiple output formats—HTML, Plain Text, Markdown, and Passages—across several specialized endpoints:

  • Web: Broad internet queries, optimized for token efficiency.
  • News: High-velocity, current-event indexing.
  • Video: Rich metadata—summaries, view counts, publisher info, source URLs—without processing the video file itself.
  • Images: Captions, dimensions, and host-page URLs.

Pick the endpoint and the format together: a competitor-pricing monitor wants web + passages; a “what shipped today” news ticker wants news + markdown so headlines render cleanly in your UI.

The “Browse” Feature: Security Meets Speed

Often an agent is handed a specific URL by a user and told to “analyze this.” Opening unknown URLs with a headless browser or computer-use agent is slow and introduces a serious attack surface—malicious payloads and prompt-injection lurking in page content.

With Web IQ’s Browse endpoint you pass the URL to the API; Web IQ crawls the destination in its own isolated infrastructure and returns sanitized content. Your application never opens a raw browser connection to a hostile origin.

Illustration of an AI agent safely extracting clean document data from within a protective glass bubble, shielded from internet chaos

🛡️

Safe Browsing: Browse keeps the risky connection isolated from your core logic—shielding you from both slow page loads and injection attempts buried in the fetched HTML. Still treat the returned text as untrusted input (see below).

From Theory to Practice: How You Actually Call It

Concepts are nice; let’s wire something up. Here is the part most write-ups skip.

Step 1 — Get access (and reset your expectations)

Web IQ is not generally available yet. The realistic timeline:

MilestoneStatus
Limited enterprise accessOpen now (Build 2026, June) — waitlist at aka.ms/webiq-waitlist
Public previewTargeted Q3 2026 (full JSON schema publishes here)
General availabilityTargeted end of 2026
Expanded language support2027 (English-only at launch)

Priority onboarding goes to organizations with an existing Microsoft account-team relationship building production AI workloads. If that’s you, join the waitlist and request priority—your real-world scenario can also shape the roadmap.

Step 2 — Choose your integration path

Once you’re in, Web IQ surfaces through four paths. Match the path to your stack:

PathBest forCode?
Direct REST APICustom orchestrators, any languageYes
Foundry IQ MCPMCP-native agents (JSON-RPC 2.0 — Claude Desktop, any MCP host)Minimal
Copilot Studio nodeLow-code makers; a “Web IQ” knowledge source on the canvasNo
Semantic Kernel WebGroundingPluginPython/.NET/Java agents; handles auth, retries, cachingYes

If you already run MCP-based agents, the Foundry IQ MCP path is the shortest distance to value—Web IQ shows up as just another tool your existing host can call.

Step 3 — Know the response shape

The public schema lands with the Q3 2026 preview, but Microsoft has already documented the core response fields. Design against these now so you’re ready on day one:

Code
// Illustrative Web IQ response — field names are documented; full schema ships with the Q3 2026 preview.
{
  "main_answer": "A synthesized, citation-ready summary of the retrieved evidence.",
  "evidences": [
    {
      "passage": "…the golden paragraph that directly answers the query…",
      "url": "https://example.com/eu-ai-act-2026",
      "title": "EU AI Act: High-Risk Obligations",
      "published": "2026-06-10",
      "authority_score": 0.91
    }
  ],
  "citations": [
    /* source list with publication dates + authority scores */
  ],
  "ranked_results": [
    /* underlying search results with raw ranking scores  */
  ]
}

The golden rule: build your agent to consume evidences[].passage directly, and surface citations to the user. You should almost never need ranked_results—that field exists for debugging and re-ranking, not for stuffing into the prompt.

Step 4 — A bridge you can run today

You don’t have to wait idly for the dedicated API. The GA web-grounding tool in Microsoft Foundry Agent Service rides the same Bing substrate and is callable right now with the azure-ai-projects SDK. It returns inline citations rather than pre-chunked passages, but it’s the most direct way to start building (and benchmarking) your grounding layer before Web IQ opens up:

Code
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import (
    PromptAgentDefinition,
    WebSearchTool,
    WebSearchApproximateLocation,
)

project = AIProjectClient(
    endpoint="https://<resource>.ai.azure.com/api/projects/<project>",
    credential=DefaultAzureCredential(),
)
openai = project.get_openai_client()

# Attach the web-grounding tool. search_context_size (low | medium | high)
# is your token-vs-coverage dial — start at "low" and raise only if recall suffers.
agent = project.agents.create_version(
    agent_name="market-watcher",
    definition=PromptAgentDefinition(
        model="gpt-5-mini",
        instructions="Answer only from fresh web evidence, and always cite your sources.",
        tools=[WebSearchTool(
            user_location=WebSearchApproximateLocation(country="AE", city="Dubai"),
        )],
    ),
)

# tool_choice="required" forces grounding so the model can't quietly
# fall back to stale parametric memory and hallucinate.
response = openai.responses.create(
    tool_choice="required",
    input="What did our top competitor announce this week?",
    extra_body={"agent_reference": {"name": agent.name, "type": "agent_reference"}},
)

print(response.output_text)
for item in response.output:
    if item.type == "message":
        for ann in item.content[-1].annotations:
            if getattr(ann, "type", None) == "url_citation":
                print("Source:", ann.url)

Three production levers worth internalizing from that snippet:

  • tool_choice="required" — the single most effective hallucination guard. It forces the model to ground before answering. Without it, the model decides on its own whether to search, and a confident model often skips it.
  • search_context_size — your direct Tokenomics knob (low / medium / high). It’s the GA-tool equivalent of choosing Passages: less context window spent on retrieval, lower cost.
  • user_location — geo-relevant results. Pricing, regulations, and “near me” queries change by market—pin it to your scenario’s region.

Step 5 — Treat every result as untrusted input

Grounding widens your attack surface. Anything the web returns—Passages, Browse output, citations—can carry prompt-injection payloads. Before you act on retrieved content:

  • Validate and sanitize results before passing them to downstream systems or tools.
  • Never echo secrets or sensitive PII into prompts that get forwarded to an external grounding service—your data crosses the Azure compliance/geo boundary when it does.
  • Add exponential backoff for 429 rate-limit errors; agent loops burst requests fast.
  • Don’t assume a VPN protects you. Web-grounding tools act as public endpoints and don’t respect private endpoints—factor that into network-secured designs.

Design for Swappability

The search-API market is moving fast, and Web IQ isn’t your only option while you wait. Treat your grounding layer as a swappable module behind a thin retrieval interface, and you can switch providers without rebuilding your agent. If you need something shipping in production today, Brave, Tavily, and Exa are all proven, reasonably priced, and available now—each with a different strength profile (broad coverage, agent-first output, and semantic depth, respectively). Abstract the interface, and migrating to Web IQ at preview becomes a one-file change rather than a rewrite.

The Bottom Line on TCO

When evaluating Total Cost of Ownership for AI infrastructure, it’s tempting to fixate on the cost per API call. Look one layer deeper. A Web IQ call may cost slightly more upfront than a raw search, but it makes the total pipeline meaningfully cheaper.

By filtering noise at the search layer and using Passages, you trim LLM input tokens by up to ~70% versus feeding in raw HTML—on every hop of a multi-step loop. You pay a little more for the search to save a lot on generation, and you get faster, higher-quality, responsibly grounded answers as a bonus. The expensive token is the one your model never needed to read.

🚀

Availability Note: Web IQ is in limited enterprise access now, with public preview targeted for Q3 2026 and GA by end of 2026 (English-only at launch). Join the waitlist at aka.ms/webiq-waitlist and engage your Microsoft account team to outline your scenario. Building today? Prototype on the GA Foundry web-grounding tool so you’re ready to switch the moment Web IQ opens.

Discussion

Loading...