GPT-5.5 vs Opus Comparison: Prompt Behavior Differences

Writer

A Head-to-Head Experiment: GPT-5.5 vs Opus Behavior on the Exact Same Prompt
Feed the exact same prompt to OpenAI’s GPT-5.5 and Anthropic’s Claude Opus, and you won’t just get different answers—you’ll witness two entirely different philosophies on how an AI should interpret ambiguity.
To test this, I set up a head-to-head experiment pitting OpenAI’s GPT-5.5 against Anthropic’s Opus/Sonnet using the exact same prompt inside Copilot Cowork. I also intentionally included M365 Copilot in the test matrix. Since the introduction of Cowork, many users have started skipping standard M365 Copilot entirely—a trend I believe is a mistake depending on the workload. The task was to rewrite a deeply technical, theory-heavy article for my blog.
The real goal of the experiment was to observe the behavioral differences between these top-tier models. In the breakdown below, you are about to see the fundamental differences in how Opus/Sonnet and GPT-5.5 work and behave when given the exact same basic instructions.
I ran the test across five distinct setups:
- Article 1: M365 Copilot (Auto Mode) — Basic Prompt
- Article 2: Copilot Cowork (Auto Mode / Anthropic Opus/Sonnet) — Basic Prompt
- Article 3: Copilot Cowork (GPT-5.5) — Basic Prompt
- Article 4: Copilot Cowork (GPT-5.5) — Improved Prompt
- Article 5: M365 Copilot (GPT-5.5 Think Mode) — Improved Prompt
The results revealed a fascinating dynamic between prompt loyalty, autonomous grounding, and the hidden system instructions that shape enterprise AI.
Round 1: The Basic Prompt (And the Autonomous Advantage)
I started with a simple, standard editing prompt:
The Basic Prompt: “carefully analyze this mdx article for my blog. find gaps or things that are not clear or where critical information is missng and improve/fix whole article. do not change schema but you are allowed to change schema values. i want article to be more clear to follow, and i feel the current version is too much teorotical which is great but i want to add hands on knowledge and guidance as well. provide the output as an md document and not mdx->
<article>”
I fed this to the three baseline configurations, and the differences were stark:
- Article 1 [M365 Copilot - Auto - Basic Prompt]: Produced a decent, but generic rewrite. It made minor adjustments but didn’t push the technical envelope. Overall: 6.5/10
- Article 3 [Copilot Cowork - GPT-5.5 - Basic Prompt]: Followed the prompt strictly. It cleaned up the theory, improved the flow, and converted the markdown flawlessly, but it avoided adding broad technical depth or product specifics that weren’t already in the text. Overall: 7/10
- Article 2 [Copilot Cowork - Auto (Opus/Sonnet) - Basic Prompt]: Completely blew the others out of the water in terms of depth. Overall: 9/10
What did Opus/Sonnet do differently?
Without being explicitly told to do so, Opus/Sonnet stepped outside the boundaries of the prompt. It dynamically triggered a web search for current documentation on the dummy topic, cross-referenced the technical claims, and heavily tailored the tone. The model didn’t just edit the text; it inferred my background as a Cloud Solution Architect and rewrote the piece as a practitioner-to-practitioner field guide. It proactively expanded the scope to add value where my basic prompt left massive gaps.
Round 2: Asking GPT-5.5 for a Confession
I was curious why GPT-5.5 lost the first round despite being a phenomenally capable model. So, I fed the behavior of Article 2 back into the GPT-5.5 model and asked it to analyze the discrepancy.
Its response was incredibly candid and spot-on regarding how system instructions dictate model behavior. Here is the exact analysis GPT-5.5 provided:
GPT-5.5’s Self-Analysis: Based only on that other model’s thought process:
What I likely missed:
- I didn’t do external grounding/search, so I may have preserved or expanded uncertain claims.
- I didn’t correct possible CLI syntax issues.
- I didn’t add concrete manifest examples.
- I didn’t cover prerequisites like licensing, CLI/.NET requirements, admin setup, or preview limitations.
- I didn’t distinguish catalog servers vs remote registration clearly enough.
- I didn’t mention auth types, validation/evaluation commands, mock testing, or observability details.
What I probably did better:
- I delivered a clean finished Markdown file quickly.
- I preserved the frontmatter schema and improved schema values.
- I made the article clearer and more practical without overloading it with possibly fragile product-specific claims.
- I added governance checklists, safer design patterns, testing prompts, and enterprise operating model guidance.
- I avoided inventing exact commands or APIs beyond what the original article already contained.
What the other model probably did better:
- It was more rigorous about fact-checking current product details.
- It identified specific technical inaccuracies and missing prerequisites.
- It planned stronger hands-on content: manifest structure, auth types, exact CLI flow, validation/testing, and support limitations.
- It noticed nuance in governance: catalog behavior vs remote approval flow.
Bottom line: My version improved clarity and practical guidance, but the other model’s approach would likely produce a more technically accurate and operationally useful article if its research findings are correct.
Yes — the difference is likely from routing/instructions and context use, not raw capability. Most likely factors:
- Different system instructions: Models in the same product can receive different hidden instructions, tool-routing rules, or “when to search” policies. One setup may be more strongly instructed to ground with web/search before editing current technical content.
- Different tool policy: The other run appears to have actively searched the web. I did not. That’s the biggest practical difference.
- Different context awareness: I do have your profile context, including that you’re a Sr Cloud Solution Architect. I should have used that more strongly. The other model clearly adapted to your expertise level and treated the article as technical enablement content.
- Different interpretation of the task: I interpreted it mainly as: “rewrite and improve the article.” The other model interpreted it as: “technical review + fact-check + rewrite.”
Round 3: The Engineered Prompt (The GPT-5.5 Comeback)
GPT-5.5’s self-reflection proved a vital point: a vague prompt rewards a model configured to autonomously expand scope, but penalizes a model configured for strict adherence.
To close the gap, I rewrote the prompt, explicitly commanding the model to adopt the behaviors that Opus/Sonnet had executed automatically.
The Improved Prompt: “Carefully analyze the MDX article below for my technical blog. Before rewriting, infer the target audience and technical depth from the article title, topic, author context, and metadata. Treat this as a practitioner-focused article, not just a copyediting task. Please:
- Validate current technical claims with online research where appropriate, especially product names, CLI commands, prerequisites, preview/GA status, authentication flows, limitations, and admin/governance behavior.
- Identify unclear sections, missing context, incorrect or outdated claims, unsupported assumptions, and places where the article is too theoretical.
- Improve the full article so it is clearer, more practical, and easier to follow.
- Add hands-on guidance such as prerequisites, step-by-step workflows, command examples, manifest/config examples, validation steps, troubleshooting notes, governance checklists, and practical recommendations.
- Do not fabricate product details. If something cannot be verified, either remove it, soften it, or label it clearly as conceptual guidance.
- Preserve the frontmatter schema exactly: do not add/remove/rename schema fields. You may update schema values such as description, readTime, keywords, etc.
- Convert MDX-only components into plain Markdown equivalents.
- Output the final result as a standard .md document, not MDX.
- Keep the author’s voice and opinionated style, but make it more useful for an experienced Microsoft/cloud/AI practitioner.
After rewriting, briefly summarize: the main gaps you fixed, any technical claims you verified, any claims you softened or avoided because they were not verifiable ->
<article>”
When I used this improved prompt, the landscape completely shifted. Article 4 (Copilot Cowork GPT-5.5 with the new prompt) and Article 5 (M365 Copilot GPT-5.5 in deep Think Mode) didn’t just close the gap—they delivered an absolute masterclass in technical writing.
The Evolution Matrix: Tracking the Iterations
To see exactly how the outputs matured from the basic prompt to the highly engineered prompt across the different models, here is the final evaluation matrix tracking all five major draft iterations:
| Criteria | Article 1 — (M365 Auto / Basic) | Article 2 — (Cowork Opus-Sonnet / Basic) | Article 3 — (Cowork GPT-5.5 / Basic) | Article 4 — (Cowork GPT-5.5 / Improved) | Article 5 — (M365 GPT-5.5 Think / Improved) | Explanation of Final State |
|---|---|---|---|---|---|---|
| Architectural Accuracy | 6/10 | 9/10 | 6/10 | 10/10 | 10/10 | Article 5 perfectly detailed the backend dependencies, specifying precise SDK constraints and exact environment variable requirements without a single line of hallucination. |
| Practical Actionability | 6/10 | 9/10 | 5/10 | 9/10 | 10/10 | Article 5 provided the most realistic troubleshooting steps, including a critical warning that a specific backend deletion operation was completely unsupported in the current preview—a massive trap for real-world enterprise architects. |
| Governance & Security | 8/10 | 9/10 | 9/10 | 10/10 | 10/10 | Article 5 hit the absolute gold standard. It isolated specific administrative roles (enforcing least-privilege tenant consent over generic Global Admin rights) and introduced exact KQL monitoring queries for security teams. |
| Ecosystem Completeness | 7/10 | 8/10 | 8/10 | 10/10 | 9/10 | Article 4 slightly edged out here on structural definitions, bridging the gap between raw code configuration and enterprise app store distribution paradigms perfectly. |
| Readability & Structure | 9/10 | 9/10 | 7/10 | 9/10 | 10/10 | Article 5 won on formatting execution, utilizing beautifully clean prose, concise callouts, and an exceptional “Block-First” enterprise runbook paradigm. |
The Ultimate Takeaway: Prompt Loyalty vs. Autonomous Grounding
This experiment perfectly illustrates the current frontier of working with top-tier LLMs. The core variance doesn’t stem from raw reasoning capacity, but rather from how the models are natively instructed to interpret ambiguity.
- Autonomous Grounding (Opus/Sonnet): Natively seeks context outside the prompt boundaries. It will proactively use user profile history, run web searches, and autonomously expand the scope to “add value.”
The result: It is highly forgiving of bad or details-lacking prompts. If you ask for a simple rewrite, it hands you a comprehensive architecture review. - Prompt Loyalty (GPT-5.5): Highly obedient to the exact wording and explicit constraints given. It does not presume to research, expand boundaries, or alter your intent unless explicitly directed to do so.
The result: A weak prompt yields a literal, baseline result (as seen in Article 3). However, once you fix the prompt and provide rigid, high-depth instructions, its precision, safety compliance, and architectural accuracy completely outpace the competition (as seen in Articles 4 and 5).
As Solution Architects, the lesson is clear: Do not rely on a model to guess your standard of quality. When you explicitly dictate the depth, grounding requirements, and constraints, the gap closes—and strict prompt loyalty transforms into execution excellence.
The Industry Verdict: Thought Partner vs. Precision Tool
This tension points to one of the most actively debated architectural divides in artificial intelligence right now: the line between being a proactive thought partner and being a precision tool.
The Autonomous Expansion Approach (The Thought Partner)
Models tuned for autonomous expansion prioritize holistic “helpfulness” over strict boundary compliance. When they detect a gap in a prompt, they fill it by inferring context, searching for updated facts, or restructuring the deliverable.
The Strengths: This approach is highly forgiving. It anticipates blind spots, corrects flawed premises, and often delivers a comprehensive final product that an average user might not have known how to ask for.
The Weaknesses: It introduces unpredictable token consumption, higher latency, and a significant loss of control. If a model decides to expand a simple editing task into a full architectural review, it can derail the intended workflow.
The Strict Adherence Approach (The Precision Tool)
Models optimized for strict instruction following act as a programmable substrate. They execute the exact parameters of the prompt—no more, no less.
The Strengths: Total predictability and control. In complex orchestration, you need components that do exactly what they are told. A model’s ability to rigidly adhere to constraints (like formatting, scope, and tone) is a massive technical capability.
The Weaknesses: It suffers from the “Garbage In, Garbage Out” problem. If a prompt lacks depth, the output will reflect that exact limitation, requiring the user to have strong prompt engineering skills to extract high-value results.
The Industry Standard

The industry is currently bifurcating its standards based on the deployment surface:
- Consumer and Chat Interfaces: The standard is shifting heavily toward autonomous expansion. Front-end assistants are increasingly designed to act as agents that infer intent, search the web proactively, and fill in the blanks to provide a frictionless experience for users who may not be prompt engineers.
- Enterprise APIs and Automated Pipelines: The standard strictly demands prompt adherence. When building software, developers require structured outputs (like guaranteed JSON schemas), predictable token usage, and absolute obedience to system instructions to prevent workflow failures.
The Final Verdict
Predictability scales better than magic.
A model that strictly obeys instructions is fundamentally more versatile as a foundational building block. You can always write a detailed, expansive prompt to force a strict model to act like a proactive thought partner—exactly as you did in your third round of testing. However, it is incredibly difficult to write a prompt that forces a naturally expansive, highly opinionated model to behave as a rigid, perfectly bounded precision tool.
When architecting complex, multi-layered cloud solutions, you need the underlying components to be reliable and tightly scoped. A model that defaults to strict adherence gives you that control, allowing you to dial up the autonomy only when and where the architecture actually requires it.
Note: The comparative analysis and scoring of the article outputs were independently evaluated by Gemini 3.1 Pro and Qwen 3.7 Max.
Read next


