pull down to refresh

A while ago, @carter was explaining that with OpenAI Pro, you can basically let it control your entire desktop. (#1044344)
Good news, the Shanghai AI Lab is training Qwen2.5-VL to do the same.

We first present a Cross-Platform Interactive Data Pipeline composed of two synergistic loops. The Agent-Environment Interaction Loop enables automated agents to interact with diverse GUI environments, while the Agent-Human Hybrid Data Acquisition Loop integrates expert-annotated trajectories to ensure coverage and quality. The pipeline spans six major platforms, including Windows, macOS, Linux, Android, iOS, and Web, which facilitates the collection of rich screen-state observations, metadata and raw trajectories. In this pipeline, we design a unified action space, allowing for more consistent and efficient interaction with diverse real-world environments.
Building upon this corpus, we train a series of base agent models termed as ScaleCUA with Qwen2.5-VL (Bai et al., 2025). Our ScaleCUA support three distinct inference paradigms to offer enhanced flexibility and compatibility with various agent frameworks:
  1. A Grounding Mode, which focuses on precisely locating UI elements based on textual descriptions, allowing for efficient integration with more powerful reasoning planners in modular agent setups
  2. A Direct Action Mode, which enables highly efficient task completion by directly generating low-level executable actions without consuming additional tokens for intermediate reasoning
  3. A Reasoned Action Mode, which enhances task planning accuracy by first generating a thought process based on current observations and historical context before generating the following action.
The unified action space designed in our data construction enables our agents to interface with heterogeneous environments through a standardized control schema.

Interesting that this can now run locally!