Article Catalog[Hidden]
Cradle It is an open-source program by the BAAI-Agents team for General Computer Control (GCC) s multimodal AI Agent framework, which allows large multimodal models to use a variety of software and games like a human via screenshot input and keystroke output.
- Common goal: support for any native software (e.g. games, Office, image/video editing tools)
- Multi-modal input: screenshot as input, support keyboard and mouse operation as output
- Autonomy: Built-in "Cognitive Reflection + Skills Update" module for continuous self-optimization.
- Modular design: high controllability and scalability, easy to adapt to new environments
pain point scenario
Since the birth of the GPT series of gurus, LLMs have seen explosive growth. However, they rely on "API text input/output", which makes them unable to control the local interface, and local task automation is still difficult:
- Operation of Office, visualization software is limited
- Splitting complex tasks makes it difficult to close the loop
- Lack of visual ability to locate UI elements based on language alone
- Inability to memorize history for long periods of time and insufficient execution of multi-step logic
Cradle is designed to address these pain points:
- Controls mouse and keyboard to simulate human operation
- Strengthening "self-reflection" and "skill optimization" strategies
- Supports long-range tasks, complex gaming environments, and specialized software operations
core functionality
Below is a list of Cradle's 6 core module features:
- Information Gathering
- Processing UI screenshots, text messages using visual models
- Audio feedback can be accessed to complete the interoceptive input
- Self-Reflection
- Review historical operational results to determine if they were achieved
- Summarize the reasons for failure and provide guidance for the next run
- Task Inference
- Inferring current goals based on environment + historical memory
- Dynamic Programming Next Optimal Policy
- Skill Curation
- Generate or update skill functions for each task
- Customized strategies by environment for experience
- Action Planning
- LLM outputs high-level actions (e.g., "click on X" "move mouse to Y")
- Translation of human-written bridging layers to keystrokes and mouse actions
- Memory module (Memory)
- Short-term and long-term memory, including historical records
- Supports reuse of memories and skills across tasks
These modules form a set of closed loops: input screenshot → what you see → introspection → planning → execution → memory feedback.
Experiments have proven that Cradle can be accomplished:
- AAA Games:Red Dead Redemption 2 Main quests, high success rate completion;
- Municipal Games:Cities: Skylines Creating a City of a Thousand;
- Farm Games:Stardew Valley Automatic seeding and harvesting;
- Business Game:Dealer's Life 2 Achieve the highest weekly profit of 87%;
- Office software: Sign in to Chrome, reply to Outlook, use Feishu;
- Editing tools: Meituxiu, CapCut image/video processing.
technical architecture

List of Technical Advantages
Technical Advantages | descriptive |
---|---|
No API Insight at All | Does not rely on internal UI interfaces and adapts to a wide range of software. |
Highly modular configuration | Easily scalable to new games or software environments |
Progressive capacity enhancement | LLM + self-reflection + memory techniques to support self-improvement |
Universal Operating Interface | Screenshots + Keyboard and Mouse Output, Truly Universal |
An illustration of the interface

application scenario
- R&D AI Agent can autonomously simulate user actions, replacing UI API testing https://wxa.wxs.qq.com/tmpl/mi/base_tmpl.html
- Office automation: a large number of repetitive tasks (emails, forms, reports) can be completely automated.
- Game AI development: Become an in-game intelligence, test missions/train NPCs
- Process Automation: Provides UI automation pipeline with less reliance on traditional RPA
- Education and Training: Cradle demonstrates how to do things and helps students understand complex software.
Who's stronger?
Framework project | Support Mode | Whether or not it relies on an API | Key requirements | Core Advantages |
---|---|---|---|---|
Cradle | Screenshots + Keyboarding | ❌ No API | Complete closed-loop, self-directed learning | Versatility, Modularity, Wide Adaptation |
LangChain Agent | Text API Input/Output | ✅ With API | Text commands / HTTP requests | Expertise in information retrieval, text management |
AutoHotkey / RPA etc. | keyboard and mouse macro (computing) | ❌ No API | Single-step macro operation, lack of memory planning | Easy to use but low intelligence, weak self-improvement |
Playwright/Selenium | DOM Manipulation API | ✅ DOM API | web automation | Specializes in web, more limited than desktop |
Strengths: Cradle is a multimodal, cognitively-enabled "universal software executable" that goes beyond traditional or web automation tools.
Article Summary
- Cradle is the first general-purpose software-controlled AI agent.Supports a wide range of local software and AAA game operations.
- The core is 6 modules with self-thinking, self-learning, and self-adaptive capabilities.
- Modularized and maintainable technical architecture
- Compared to traditional tools, Cradle offers a video-quality experience, global closed-loop intelligence, and the ability to create a new, more efficient, and more effective way of communicating with your customers.
- Suitable for R&D automation, office, game development and teaching scenarios.