AI agents — systems capable of reasoning, planning, and acting — are becoming a common paradigm for real-world AI applications. From coding assistants to personal health coaches, the industry is shifting from single-shot question answering to sustained, multi-step interactions. While researchers have long utilized established metrics to optimize the accuracy of traditional machine learning models, agents introduce a new layer of complexity. Unlike isolated predictions, agents must navigate sustained, multi-step interactions where a single error can cascade throughout a workflow. This shift compels us to look beyond standard accuracy and ask: How do we actually design these systems for optimal performance?
Practitioners often rely on heuristics, such as the assumption that “more agents are better“, believing that adding specialized agents will consistently improve results. For example, “More Agents Is All You Need” reported that LLM performance scales with agent count, while collaborative scaling research found that multi-agent collaboration “…often surpasses each individual through collective reasoning.”
In our new paper, “Towards a Science of Scaling Agent Systems”, we challenge this assumption. Through a large-scale controlled evaluation of 180 agent configurations, we derive the first quantitative scaling principles for agent systems, revealing that the “more agents” approach often hits a ceiling, and can even degrade performance if not aligned with the specific properties of the task.

