When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

Authors: Corrado Rainone, Davide Belli, Bence Major, Arash Behboodi

Affiliations: Qualcomm AI Research

Primary category: cs.MA · all: cs.AI, cs.MA

Matched keywords: large language model, llm, agent, agentic, multi-agent, inference

TL;DR

This position/workshop paper systematically examines the design space of hybrid multi-agent systems (MAS) that mix cloud-hosted frontier LLMs with on-device SLMs, finding that no single hybrid architecture dominates across tasks and that more cloud compute does not reliably improve performance.

Motivation

Production agentic systems today face a fundamental tension: frontier LLMs (deployed in the cloud, accessed via per-token APIs) deliver broad capability but incur high and unpredictable costs under long-horizon, iterative workloads. On-device SLMs avoid API costs and address privacy concerns but suffer from a persistent capability gap with frontier models, particularly on long-context tasks where consumer DRAM imposes a hard upper bound on KV-cache and context length. Hybrid approaches that route requests across models of different sizes exist (Ding et al., 2024; Stripelis et al., 2024; Ong et al., 2024), but these prior works treat hybridization primarily as a routing or escalation problem — choosing which single model handles a given call — rather than exploring how cloud and edge models can play structurally distinct roles within a shared multi-agent pipeline. The MAS literature offers rich architectural diversity (centralized planners, mesh topologies, ReAct loops), but existing systems are almost uniformly co-designed with a specific task or benchmark, leaving open how architectural choices transfer. The authors argue that without principled design guidance, practitioners introduce hybrid components through ad hoc, domain-specific decisions.

Key Ideas

Systematic hybrid MAS study: Two representative MAS architectures are adapted to cloud–edge hybrid settings and evaluated on a unified Pareto frontier of task accuracy, monetary cost, and edge energy consumption.
No single dominant design: Plan-based and advisory paradigms excel in different task domains; the optimal hybrid configuration is task-dependent, not universal.
More cloud ≠ better performance: Greater frontier-level compute does not consistently improve task outcomes — a counterintuitive finding the authors investigate mechanistically.
Mechanistic drivers of hybridization: Supervision frequency, restart policies, and summarization are identified as key factors explaining when cloud assistance helps or hurts long-horizon reasoning.
Context efficiency via resets: Hybrid MAS manage long contexts more effectively through context resets and summarization, limiting KV-cache growth and improving suitability for memory-constrained edge deployment.

Method

The paper adapts two MAS architectures to the hybrid cloud–edge setting:

Plan-based (Supervisor–Executor with Replan): A cloud Supervisor generates an initial plan p0 from query q. An edge Executor runs a ReAct loop for up to T turns, producing reasoning r_t, action a_t, and observation o_t. Every Tv turns the Supervisor verifies progress and, if needed, replans — producing a new plan p_t^new that the Executor resumes from.

Advisory paradigm: No initial plan is generated. Instead, the Supervisor provides advice/hints h_t to the Executor at verification intervals. The Executor runs the same ReAct loop but is guided by advisory context rather than a structured plan. Supervision frequency Tv is a shared control variable across both architectures.

The pseudocode for both architectures (shown above) formalizes the two interaction patterns: the plan-based system intervenes via replan when verification fails, while the advisory system injects high-level hints without overwriting the execution context. Tasks evaluated include Deep Search and UI assistance.

Experiments

The paper evaluates on two task domains: Deep Search (long-horizon query decomposition, retrieval, browsing, and synthesis) and UI assistance (GUI agent tasks combining perception, grounding, and multi-step action recovery). The evaluation framework measures trade-offs across three axes: task accuracy, monetary cost (API spend), and edge energy consumption.

Baselines include different allocations of cloud compute — varying how frequently the cloud Supervisor intervenes (Tv) and which roles are assigned to cloud vs. edge models. The full experimental results, specific dataset names, and quantitative tables are not recoverable from the provided (truncated) text.

Results

Full text truncated — specific numbers, table references, and per-task deltas are not available in the provided content. From the abstract and contributions list:

Plan-based vs. advisory: The two paradigms excel in different task domains; neither universally dominates (contribution 2, abstract).
Cloud compute saturation: Increasing the fraction of frontier-model compute does not consistently improve performance — the authors describe this as an “unexpected observation” they investigate in depth (contribution 2, introduction).
Context management: Hybrid systems with context resets and summarization handle long-horizon tasks more effectively than those without, with measurable KV-cache footprint reductions at the edge (contribution 4) — specific magnitudes not quantified in available text.

Concrete baseline→method deltas with Table/Figure anchors cannot be reported as the results section of the full text was not recovered.

Conclusion

The central practitioner takeaway is that hybrid MAS design cannot be solved by a simple rule like “always use the cloud for hard steps” — the optimal assignment of cloud vs. edge roles depends heavily on the task structure, and more frontier compute can actively hurt performance in some configurations. The results are established on Deep Search and UI assistance tasks from a single lab (Qualcomm AI Research), published as a workshop paper (AIWILD @ ICML 2026), so generalization to other agentic domains is unverified. The paper does not quantify energy savings or cost reductions at production scale, and the ablation over supervision frequency (Tv) covers design choices within the two studied architectures rather than the full space of possible hybrid topologies. The title implies broad lessons for hybrid MAS; the experiments cover two architectures and two task families.

Novelty Check

From the paper’s own Related Work: The closest prior work the authors acknowledge is MAI-UI (Zhou et al., 2025), a “native hybrid-AI design in which a cloud model is [used alongside an edge model].” General routing/escalation work (Ding et al., 2024; Stripelis et al., 2024; Ong et al., 2024) treats hybridization as model selection rather than role assignment within a MAS. The authors’ claimed delta is moving from per-call routing to architectural role assignment and systematic multi-objective evaluation.

Independent assessment: The contribution is primarily empirical and organizational rather than algorithmic — the two architectures (plan-based supervisor and advisory supervisor) are natural extensions of existing ReAct/orchestrator patterns to the cloud–edge split. The novelty lies in the systematic Pareto-frontier framing and the mechanistic analysis of why hybridization helps or hurts. This is a workshop paper; the scope and depth of results appear commensurate with that venue. Not obviously “old wine in a new bottle,” but also not a major algorithmic advance.

Open Questions

Does the finding that “more cloud compute ≠ better performance” persist across model families and sizes beyond those tested at Qualcomm?
How sensitive are the Plan-based vs. Advisory rankings to the specific Tv (verification interval) chosen — is there a principled way to set Tv per task?
The paper identifies summarization as a key factor for context efficiency, but does not ablate summarization quality — how much does the cloud LLM’s summarization fidelity matter?
Results cover English-language Deep Search and UI tasks; do the design lessons transfer to multilingual or multimodal agentic settings?
Energy measurements are framed as edge consumption, but cloud inference energy is omitted from the Pareto analysis — does including it change the optimal operating point?

Original abstract

arXiv:2605.30102v1 Announce Type: cross Abstract: The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more systematically. We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Our findings paint a nuanced picture of hybrid MAS design: while SLMs can effectively benefit from LLM assistance, the optimal architecture is highly task-dependent, and greater frontier-level compute does not consistently translate to better performance.