Multi-Agent AI System on Google Cloud — Architectures — startupengineering.io

THE SOURCE

Google Cloud multi-agent reference architecture diagram showing a Cloud Run chat frontend, an ADK coordinator agent, multiple subagents communicating via A2A, an MCP-based tool layer, Vertex AI for inference, and Model Armor + VPC Service Controls + IAM for security.

Source write-up

Purpose

Google Cloud's reference architecture (last reviewed September 16, 2025) for a coordinator-plus-subagents system: a Cloud Run chat frontend, an ADK-built coordinator routing to specialised subagents in sequential and iterative-refinement patterns, A2A for inter-agent calls, MCP for tool access, Vertex AI for inference, and Model Armor + VPC-SC for safety. Independent evaluation for startup fit.

Components

Cloud Run
Vertex AI
Vertex AI Agent Engine
Google Kubernetes Engine
Agent Development Kit (ADK)
Agent2Agent (A2A) Protocol
Model Context Protocol (MCP)
Model Armor
Cloud Logging
VPC Service Controls
IAM

Vendor's stated assumptions

Coordinator-plus-subagents shape with two named collaboration patterns (sequential and iterative-refinement).
Protocols over products: A2A for agent-to-agent, MCP for agent-to-tool. Both are portable across substrates.
Three deployment substrates (Cloud Run, GKE, Vertex AI Agent Engine) so teams can match what already runs in production. Model Armor sits inline as a safety check.

Show full source write-up

What this artefact evaluates

Google Cloud published the Multi-agent AI system in Google Cloud reference architecture (last reviewed September 16, 2025). Unlike the single-agent shape most published references default to, this one codifies a coordinator agent that delegates to specialised subagents, with two named orchestration patterns — sequential and iterative refinement — and three worked use cases (financial advisory, research assistance, supply chain). This artefact evaluates the architectural choices and the patterns, not the specific use cases or the maturity of the underlying ADK/A2A SDKs. Pricing tiers and service quotas are out of scope.

The reading is from the perspective of a 5–25-engineer team picking a multi-agent foundation today. That perspective matters: the same diagram has different implications for a 10-engineer startup than for a 200-engineer enterprise platform team, and most of the language in the published reference is calibrated for the latter.

Reference at a glance

The reference is structured around five horizontal layers and two named orchestration patterns. The horizontal layers govern where code runs and how it talks; the patterns govern how requests flow through the agents. Reading the diagram in those two passes — once left-to-right for the layers, once top-to-bottom for the patterns — makes the rest of the artefact considerably easier to follow.

Layer	Primary responsibility	Google's pick(s)	Replaceable with
Frontend	Chat UI and request streaming	Cloud Run	Any HTTPS frontend
Agents	Coordinator + specialised subagents	ADK on Cloud Run / GKE / Agent Engine	LangGraph, CrewAI, AutoGen on the same hosts
Tools	Database, API, and external-system access	MCP clients/servers	Direct SDK calls (with portability cost)
Inference	Foundation-model serving	Vertex AI (managed); GKE/Cloud Run for custom	OpenAI, Anthropic, or self-hosted equivalents
Safety	Prompt-injection / harmful-content screening	Model Armor + VPC-SC + IAM + Cloud Logging	Lakera, Prompt Armor, custom guardrails

The two patterns sit on top of the same physical layers. Sequential chains subagents whose outputs feed forward; iterative refinement loops a generator and an evaluator until a quality bar (or an iteration cap) is met. Both terminate at a response generator that validates and grounds the final output before it leaves the agent layer.

What Google Cloud actually proposes

The reference architecture is built around five layers:

Frontend. A Cloud Run service hosts a chat interface that accepts user prompts and returns streamed responses. Cloud Run is the right default here: it scales to zero, terminates TLS, and handles streaming responses without bespoke configuration. For teams that already operate a web front end (Next.js on Vercel, static SPA on Cloud CDN), this layer is interchangeable — only the agent-layer endpoint contract matters.
Agent layer. A coordinator agent — built with the Agent Development Kit (ADK) — decides which subagent(s) to invoke. The coordinator and subagents talk over the Agent2Agent (A2A) protocol, so individual agents can be replaced or scaled independently. The same agent code can deploy onto Cloud Run, GKE, or Vertex AI Agent Engine, which is the most consequential flexibility the diagram offers (see Deployment substrate trade-offs below). The coordinator's prompt is, in practice, the hardest engineering surface in the entire system: routing decisions, tool-call arbitration, and human-in-the-loop fallbacks all live there.
Tooling layer. Subagents reach databases, APIs, and external systems through Model Context Protocol (MCP) clients/servers, so tool wiring is uniform across agents. MCP is doing two jobs at once here: it standardises the shape of tool calls (so the same subagent can target three different databases without prompt rewrites) and it standardises the transport (so a tool can move from in-process to a separate service without changing the calling agent). Both are worth the indirection.
Inference layer. Vertex AI hosts the foundation models. Custom or self-hosted models can run on Cloud Run or GKE behind the same agent interface. The diagram is intentionally agnostic about which Gemini SKU is used — the choice between Pro, Flash, and the smaller distilled variants is a per-subagent question, not a whole-system question, and the reference correctly leaves it open.
Safety layer. Model Armor sanitises model inputs and outputs (prompt-injection screening, harmful-content checks). VPC Service Controls reduces data-exfiltration blast radius. IAM enforces least-privilege at the agent boundary. Cloud Logging captures the trace. Putting Model Armor inline — not as a post-hoc batch check — is the load-bearing safety choice; the rest of the layer is conventional GCP perimeter hygiene.

Two orchestration patterns are spelled out:

Sequential pattern. Coordinator → Subagent A → Subagent A.1 → Response generator. Each step's output feeds the next; the response generator validates before returning. The pattern is well-suited to workflows where each step is cheap individually but expensive in aggregate: a finance pipeline that fetches market data, runs a technical screen, then drafts a recommendation, for instance.
Iterative refinement pattern. Subagent B produces a draft, a quality evaluator scores it, a prompt enhancer rewrites the request, and the loop runs until the evaluator is satisfied or a maximum-iteration cap fires. The pattern is well-suited to open- ended quality tasks — a research brief, a long-form draft, a multi-step plan — where the first attempt is rarely the best one and the cost of a second pass is much smaller than the cost of shipping a poor one.

The carousel below walks through Google's own architecture diagram in four numbered steps — frontend → coordinator → sequential pattern → iterative refinement → human-in-the-loop and response generation.

Carousel

The two patterns up close

The named patterns are the heart of the reference, so it is worth walking through both with a worked example before evaluating the diagram as a whole.

Sequential, with a finance example

The published financial-advisory use case threads four subagents behind the coordinator: a market-data fetcher, a technical-analysis agent, a fundamentals agent, and a recommendation drafter. A user prompt — "Should I rebalance into semiconductors?" — flows through the chain in order. Each subagent's output is the next one's input. The response generator validates the final draft against the original question (a relevance check) and against the fetched data (a grounding check) before the coordinator returns it.

What this pattern buys you, beyond the obvious modularity, is prompt isolation. Each subagent sees only the slice of context it needs. The fundamentals agent never sees the technical signals, which keeps its prompt smaller, its model choice cheaper, and its evaluation narrower. The price is one extra A2A hop per subagent, which at p95 is the dominant latency contribution — typically 200–800ms per hop on a small Gemini Flash deployment, and meaningfully more on Pro.

The pattern starts to creak when the chain exceeds four or five subagents, or when later subagents need information that earlier subagents discarded. Both failure modes are common enough that any production sequential pipeline should be instrumented with full trajectory traces from day one.

The research-assistant use case is the canonical iterative-refinement shape. A planner subagent decomposes the research question into sub-questions; a gatherer pulls evidence; an evaluator scores the draft for completeness, accuracy, and citation density; a prompt enhancer rewrites the request when the score is below threshold; the loop terminates when the evaluator passes the draft or the iteration cap fires.

The pattern's strength is also its weakness: each loop iteration is a full agentic round-trip, which means cost and latency multiply. A six-iteration refinement on Gemini Pro can cost 5–10× the equivalent single-shot generation. The iteration cap is therefore not just a correctness backstop — it is the production budget control. Set too low, the evaluator never has room to improve a draft; set too high, a single user request can dominate the day's token spend.

Two operational practices make the pattern manageable in production: log every (draft, evaluator-score, enhanced-prompt) tuple as a single trace, and treat the evaluator's threshold as a tunable parameter that you adjust over time as the underlying model gets better.

Findings

1. The architectural commitment is to two protocols, not to a product. A2A and MCP carry the structural weight of the reference. Both are open specifications: the agent topology can survive a switch of LLM, of agent runtime, even of cloud, because the wire format between agents and between agents-and-tools is not GCP-specific. This is the most defensive choice in the diagram and the one most worth copying into other clouds.

2. The coordinator-plus-subagents shape is opinionated, and the opinion is right for most production agentic workloads. Many production agent systems begin as a single ReAct loop and break under load when one model is asked to plan, execute, and self-correct in the same context window. Splitting the work across a coordinator and specialised subagents — each with a smaller, focused prompt — is a pattern that reliably outperforms the single-agent shape on multi-step tasks. Google making it the default shape in a published reference is, in itself, a useful signal.

Condition: this shape adds a hop and an extra prompt round-trip versus a single agent. For low-latency, single-tool tasks (e.g., a classifier with one knowledge-base lookup), it is overkill.

3. Naming the iterative-refinement loop as a first-class pattern is the under-appreciated detail. Most agent references implicitly assume one-shot generation. The diagram makes the generate → evaluate → refine loop an architectural primitive, with the evaluator agent as a peer of the executor. This matches what production teams actually do (LLM-as-judge gates) but rarely diagram. The trade-off Google does not name: each iteration multiplies cost and latency, so the max-iteration cap is a budget control, not just a correctness control.

4. Model Armor sits inline; observability sits beside. Putting Model Armor in the request path — not as a post-hoc check — is the right shape for safety. The companion choice (Cloud Logging beside the agent layer rather than as the primary trace export) is more conservative. Production agent teams typically want a dedicated trace tool (Langfuse, Phoenix, Braintrust) for prompt/completion inspection; Cloud Logging will work but is not the best surface for trajectory debugging.

5. The deployment substrate is genuinely flexible. Cloud Run for fast iteration, GKE for control, Vertex AI Agent Engine for managed. The same ADK code targets all three. This is unusual — most published references couple the agent layer to one runtime — and it materially lowers the cost of a wrong initial pick.

Implication: a 10-engineer team can start on Cloud Run, learn the load profile, and migrate to GKE or Agent Engine without rewriting the agent.

6. The reference is silent on evaluation. Cost optimisation, security, reliability, and performance get full sections. A trajectory-evaluation harness — what counts as a correct sequential or iterative-refinement run — does not. For an architecture this sophisticated, that omission is non-trivial: teams will need to wire their own eval pipeline (LangSmith, Braintrust, or Vertex AI's own prompt optimizer used as an eval surface) on top of the published shape.

7. Cost is named, but not as a first-class signal. The cost optimisation section is good — context caching, batch prediction, prompt-engineering for concise responses, DSQ vs Provisioned Throughput. None of it is wired into the runtime observability story. For a startup, per-conversation cost should sit next to latency in the production dashboards; the reference leaves that as an exercise.

8. ADK and A2A are young. The conceptual layers (coordinator, A2A, MCP, Vertex AI) are stable. The implementations — the ADK SDK and the A2A protocol surface — are evolving. Adopting today is fine for production; adopting and writing extensive custom tooling on top of the SDKs has higher rework risk over the next 12 months.

Design considerations through a startup lens

Google's published reference includes its own Design considerations section covering security, reliability, operations, cost, and performance. The intended reader of that section is an enterprise architect with a long checklist. A startup engineering team has a different shape of attention: fewer reviewers, less time per decision, and a stronger preference for defaults that hold up at both 10 and 1,000 QPS. The notes below recast each consideration with that audience in mind.

Security

The reference's security posture is solid by default and the architectural shape — Model Armor inline, IAM at the agent boundary, VPC-SC around the perimeter — is exactly the shape a regulated startup wants to inherit. Two practical caveats:

Inter-agent authentication is the easy thing to forget. A2A supports OAuth2/OpenID Connect, but the default ADK setup will let you ship without it. Wiring service-to-service auth between coordinator and subagents on day one is much cheaper than retrofitting it after a security review.
MCP servers are a credential surface in disguise. Each MCP server holds the credentials its tool needs. Rotate them. Scope them. A leaked MCP credential is a leaked tool, which in this architecture is also a leaked subagent capability.

Reliability

The reliability story rests on Cloud Run's regional/zonal failover and on Vertex AI's Dynamic Shared Quota (DSQ) and global endpoints. That gets a startup most of the way to a defensible SLA. The agent layer's own failure modes — partial subagent failures, A2A timeouts, infinite refinement loops — need product-specific guardrails the reference does not prescribe. Concretely:

Set per-subagent timeouts shorter than the coordinator's user-facing budget, so a slow subagent cannot blow the whole request's deadline.
Cap iterative-refinement loops at a value tied to cost, not just to quality. Six iterations on Pro is a defensible default; ten is not.
For sequential pipelines longer than three subagents, design the failure path: which subagent's failure is recoverable from a cached prior step? The reference is silent on this, but it is the difference between "user retries" and "user sees an error".

Operations and observability

This is the part of the reference where startup needs diverge most sharply from the published guidance. Cloud Logging plus Cloud Trace is enough to satisfy compliance, but a working agent team will reach for prompt-level inspection within the first month. The practitioner pattern that holds up at small scale is:

Cloud Logging for the structural trace (request IDs, latencies, IAM decisions, Model Armor verdicts).
A dedicated agent observability tool (Langfuse, Phoenix, Braintrust, LangSmith) for the prompt-level surface — the per-iteration drafts, the evaluator scores, the prompt-enhancer rewrites. Run it side-by-side; don't try to make Cloud Logging carry both jobs.
A small offline eval harness that replays a frozen prompt set through the agent on every model upgrade. The reference doesn't specify one because there is no GCP-native primitive for it; the options are LangSmith datasets, Braintrust experiments, or a hand-rolled harness over Vertex AI's prompt optimizer.

Cost

The published cost section is unusually thorough — context caching, prompt optimizer, batch prediction, DSQ vs Provisioned Throughput — and it is worth reading in full. What it does not do is connect cost to the agentic shape. Two startup-specific observations:

Per-conversation cost is the only cost number that matters in production. Aggregate token spend is a lagging indicator; the ratio of cost per resolved conversation is leading. Wire this into the same dashboard as latency.
The iterative-refinement pattern is a token amplifier. A modest 5× refinement on Pro can dominate a feature's unit economics. Do the math at the iteration cap, not at the average.

Performance

Performance optimisation in the reference is a mix of model-level guidance (model selection, prompt engineering, context caching) and substrate-level guidance (Cloud Run resources). The startup-specific addition is latency budgeting per layer:

Layer	Realistic p95 budget	Notes
Frontend	50–150 ms	Mostly TLS + streaming setup.
Coordinator	300–900 ms	One model call to decide route, plus A2A handoff.
Per subagent	400–1500 ms (Flash); 1–4 s (Pro)	Dominated by model latency, not A2A.
Evaluator	400–1200 ms	Treat as a subagent for budgeting.
Tooling	50–500 ms	MCP transport adds ~10–30 ms over a direct SDK call.
Total p95	2–10 s for single-pass; 6–30 s for iterative refinement	The iteration count is the lever.

These are working numbers from production deployments, not Google's published SLOs. They exist to set expectations for product reviews — a "real-time" agentic UX in this architecture is closer to "fast chat" than to "instant".

Deployment substrate trade-offs

The reference's most underrated feature is that the same agent code targets three substrates. The decision between them is rarely revisited once made, so it is worth being deliberate up front.

Substrate	Best for	Watch out for
Cloud Run	First 12 months. Iteration speed.	Cold-start tax on infrequently-used subagents.
GKE	Mature ops org. Custom networking.	Operational surface area. Patch cadence.
Vertex AI Agent Engine	Compliance-heavy orgs. Outsourcing the runtime.	Newest of the three; less community precedent for failure modes.

The pragmatic path for a 5–25-engineer team is Cloud Run first, with the discipline that nothing in the agent code should depend on Cloud Run-specific affordances. ADK plus A2A makes that discipline mostly free. The migration to GKE or Agent Engine should be triggered by a concrete signal — sustained QPS that justifies pre-warmed instances, or a compliance requirement that the managed runtime satisfies.

How this compares to AWS and Azure shapes

The cross-cloud comparison is useful even for a GCP-committed team, because the protocol portability claim is only meaningful if the neighbouring shapes are recognisable. The short version:

AWS Bedrock Agents + AgentCore. Bedrock Agents covers the coordinator-plus-subagents shape with a different vocabulary (action groups, agent collaboration). AgentCore — the subject of the fullstack-solution-template-for-agentcore sample — adds an application substrate around the agents. AWS does not standardise on A2A as the inter-agent wire, but MCP is supported. Migration cost: medium, dominated by re-targeting the tool layer.
Azure AI Foundry / Semantic Kernel. The shape is similar to ADK + A2A, with Semantic Kernel orchestrating. MCP support is improving; A2A support is partial. Migration cost: medium-to-high, more in tooling than in topology.
Self-hosted (LangGraph / CrewAI / AutoGen) on K8s. Closest to the GCP shape conceptually. Migration cost: low for the topology, higher for the safety layer (no Model Armor equivalent out-of-the-box; teams typically wire Lakera or a custom guard).

The honest cross-cloud signal: the patterns in this reference (coordinator + subagents, sequential, iterative refinement) are the durable artefact. The implementations (ADK, Model Armor, Vertex AI Agent Engine) are GCP-specific and should be evaluated as such.

Conditions of applicability

Context	Fit	Note
Multi-step agentic workflow, GCP-native, ≥5 eng	High	The reference is built for exactly this case.
Iterative-refinement use case (research, drafting)	High	The published pattern matches the problem shape.
Single-agent ReAct with one tool	Low	Over-engineered — use a single ADK or LangGraph agent instead.
Pre-PMF, <5 eng, single agent in beta	Low–Medium	Coordinator + A2A + MCP is too much surface to maintain pre-PMF.
Non-GCP team (AWS, Azure, self-hosted)	Medium	A2A and MCP transfer; Vertex AI / Cloud Run / Model Armor do not.
Regulated domain (healthcare, finance, legal)	High	IAM, VPC-SC, Model Armor, Cloud Logging satisfy audit primitives.
Latency-critical (<500ms p95) interactive UX	Medium	Coordinator hop + A2A round-trip adds latency the diagram does not budget.
High-volume batch agentic workload (>10K req/hr)	Medium-High	Substrate flexibility helps; cost discipline on iterative refinement is non-negotiable.
DAG-orchestrated agent topology (parallel fan-out)	Low	The reference is a tree, not a DAG. Adapting it adds work the diagram doesn't cover.

What the architecture does not address

A trajectory- or outcome-level evaluation harness for either pattern.
Per-conversation cost or token-spend as a first-class observability signal.
Migration path off ADK if the framework's API changes materially.
DAG-orchestrated agent topologies above the coordinator-plus- subagents shape (e.g., parallel fan-out with reduce).
Comparison with adjacent multi-agent frameworks (LangGraph multi-agent, CrewAI, AutoGen).
Decay and re-grading cadence as underlying models change.
Multi-tenant isolation between concurrent users in the same agent process — the diagram assumes single-user request handling.
A canary or shadow-deployment strategy for subagent prompt changes, which in practice is the most frequent change a production agent team ships.

These are scope boundaries of the published reference, not failures of it. They are the work the reader still has to do after adopting it.

Practitioner adoption checklist

A working order of operations for a team adopting this architecture in the first 90 days. Not a rigid sequence, but the checklist that keeps the early decisions reversible.

Lock the coordinator's contract first. Define the input/output shape of the coordinator and treat it as the public API of the agent system. Subagents come and go; the coordinator's contract should not.
Stand up Cloud Run with one coordinator and one subagent. No A2A wiring yet — let the coordinator call the subagent directly. Confirm streaming end-to-end through the frontend.
Introduce A2A on the second subagent. This is when the cost of the protocol becomes real and you find out whether your team is willing to pay it. (For most teams, the answer becomes yes once the third subagent ships.)
Wire MCP for the first tool. Pick a tool that is boring — a database read or a single API GET — so you debug the protocol, not the tool semantics.
Turn on Model Armor. Inline, on both inputs and outputs. Treat its rejections as a first-class observable, not as noise.
Add an external observability tool. Langfuse or Phoenix or Braintrust — any of them, but not none.
Build a 50-prompt offline eval set. Run it on every model change and every prompt change. This is the single highest-leverage piece of operational hygiene the published reference does not prescribe.
Set the iteration cap and per-subagent timeout. Write them down. Review them quarterly.
Decide the substrate-migration trigger. Pick a QPS or a compliance threshold now that will move you from Cloud Run to GKE or Agent Engine. If you don't, you will migrate reactively.
Document the human-in-the-loop path. When does a human intervene? Who is paged? What's the SLA? The diagram shows the box; the team has to fill in the playbook.

Author's take (Selva, April 2026)

If I were a 10-engineer GCP-native team building a multi-step agent today, I would adopt this architecture nearly as drawn. The two protocol commitments (A2A, MCP) and the inline safety layer (Model Armor + VPC-SC + IAM) are the parts I would not negotiate. I would add per-conversation cost into Cloud Logging dashboards and wire a Langfuse or Phoenix trace beside Cloud Logging for agent-level inspection. I would resist the urge to deploy on Vertex AI Agent Engine on day one — Cloud Run is cheaper to iterate on while the agent shape is still being learned. I would treat ADK as a useful starting point, not as a long-term framework commitment, and keep my business logic in code I own rather than in ADK abstractions.

I would also be honest with myself about which pattern the product actually needs. Sequential-by-default, with iterative-refinement reserved for the use cases where the cost of a bad first draft is real, is a defensible posture for the first year. Most teams that reach for iterative refinement everywhere learn — at a cost — that two-pass generation is a wide knife.

This is one practitioner's reading. It is not a universal recommendation.

Open questions for re-evaluation

How does the iterative-refinement pattern's cost envelope hold up at 10× volume — does the max-iteration cap fire often enough to matter to the unit economics?
What is the realistic A2A latency budget per hop at p95?
How does ADK compare on developer ergonomics with LangGraph and CrewAI on the same coordinator + 3-subagent workload?
When ADK's API changes materially, what is the realistic migration cost off it?
Does Model Armor's inline placement hold up under a sustained prompt-injection adversarial campaign, or does it become a latency / false-positive problem at scale?
When the next generation of foundation models lands, does the coordinator-plus-subagents shape still pay its overhead, or does the per-subagent prompt isolation become unnecessary?

Re-evaluation cadence: 6 months, or sooner on a major Google Cloud revision.

View Google Cloud reference architecture

MY EVALUATION

Verdict

Moderate fit. The protocol-centric design (A2A + MCP) and three deployment substrates make the conceptual layers portable. The coordinator-plus-subagents opinion is fixed — planner- executor or DAG-orchestrated systems need redesign — and ADK + A2A are still young as an implementation surface.

Rubric scores

Conceptual fit (multi-agent)3/5

Operational complexity2/5

Cost transparency3/5

Lock-in / portability3/5

Conditions for adoption

Adopt fully when: GCP-aligned, multi-agent shape genuinely required, comfortable on the bleeding edge of ADK + A2A, and product domain is a real fit for the coordinator-plus-subagents pattern.
Adopt selectively when: keep Vertex AI for inference and MCP for tool access; build the coordinator with whichever orchestrator already runs in your stack (LangGraph, custom, etc.).
Substitute when: not on GCP, or your shape is planner-executor or DAG-orchestrated rather than coordinator-plus-subagents.
Skip when: single-agent product — the multi-agent framework overhead doesn't pay off yet.

What to keep

Names the two agentic patterns explicitly — sequential and iterative-refinement — with worked examples for each.
Standardises on protocols rather than products: A2A for agent-to-agent, MCP for agent-to-tool. Both are portable.
Three deployment substrates (Cloud Run, GKE, Vertex AI Agent Engine) so teams can pick what already runs in production.
Model Armor as an inline input/response check is the right architectural shape for prompt-injection defence.
Operational playbook is unusually thorough — DSQ vs Provisioned Throughput, context caching, prompt optimizer, batch prediction.

Where it costs more than expected

No published evaluation harness for trajectories or tool-call correctness — that work is left to the reader.
Per-conversation cost is named in the optimization section but not as a first-class observability signal.
Coordinator-plus-subagents shape is opinionated; planner-executor and DAG-orchestrated systems need redesign.
ADK + A2A are young — the conceptual layers are stable, but the implementation surface is not yet a long-term commitment.

Conflict of interest: none.