AICG

Found a Github very good AI project Cradle, can control the mouse, keyboard, simulate the human operation, too silky smooth, collection ~ ~ ~ ~

Cradle It is an open-source program by the BAAI-Agents team for General Computer Control (GCC) s multimodal AI Agent framework, which allows large multimodal models to use a variety of software and games like a human via screenshot input and keystroke output.

  • Common goal: support for any native software (e.g. games, Office, image/video editing tools)
  • Multi-modal input: screenshot as input, support keyboard and mouse operation as output
  • Autonomy: Built-in "Cognitive Reflection + Skills Update" module for continuous self-optimization.
  • Modular design: high controllability and scalability, easy to adapt to new environments

pain point scenario

Since the birth of the GPT series of gurus, LLMs have seen explosive growth. However, they rely on "API text input/output", which makes them unable to control the local interface, and local task automation is still difficult:

  • Operation of Office, visualization software is limited
  • Splitting complex tasks makes it difficult to close the loop
  • Lack of visual ability to locate UI elements based on language alone
  • Inability to memorize history for long periods of time and insufficient execution of multi-step logic

Cradle is designed to address these pain points:

  • Controls mouse and keyboard to simulate human operation
  • Strengthening "self-reflection" and "skill optimization" strategies
  • Supports long-range tasks, complex gaming environments, and specialized software operations

core functionality

Below is a list of Cradle's 6 core module features:

  1. Information Gathering
    • Processing UI screenshots, text messages using visual models
    • Audio feedback can be accessed to complete the interoceptive input
  2. Self-Reflection
    • Review historical operational results to determine if they were achieved
    • Summarize the reasons for failure and provide guidance for the next run
  3. Task Inference
    • Inferring current goals based on environment + historical memory
    • Dynamic Programming Next Optimal Policy
  4. Skill Curation
    • Generate or update skill functions for each task
    • Customized strategies by environment for experience
  5. Action Planning
    • LLM outputs high-level actions (e.g., "click on X" "move mouse to Y")
    • Translation of human-written bridging layers to keystrokes and mouse actions
  6. Memory module (Memory)
    • Short-term and long-term memory, including historical records
    • Supports reuse of memories and skills across tasks

These modules form a set of closed loops: input screenshot → what you see → introspection → planning → execution → memory feedback.

Experiments have proven that Cradle can be accomplished:

  • AAA Games:Red Dead Redemption 2 Main quests, high success rate completion;
  • Municipal Games:Cities: Skylines Creating a City of a Thousand;
  • Farm Games:Stardew Valley Automatic seeding and harvesting;
  • Business Game:Dealer's Life 2 Achieve the highest weekly profit of 87%;
  • Office software: Sign in to Chrome, reply to Outlook, use Feishu;
  • Editing tools: Meituxiu, CapCut image/video processing.

technical architecture

List of Technical Advantages

Technical Advantagesdescriptive
No API Insight at AllDoes not rely on internal UI interfaces and adapts to a wide range of software.
Highly modular configurationEasily scalable to new games or software environments
Progressive capacity enhancementLLM + self-reflection + memory techniques to support self-improvement
Universal Operating InterfaceScreenshots + Keyboard and Mouse Output, Truly Universal

An illustration of the interface

application scenario

  • R&D AI Agent can autonomously simulate user actions, replacing UI API testing https://wxa.wxs.qq.com/tmpl/mi/base_tmpl.html
  • Office automation: a large number of repetitive tasks (emails, forms, reports) can be completely automated.
  • Game AI development: Become an in-game intelligence, test missions/train NPCs
  • Process Automation: Provides UI automation pipeline with less reliance on traditional RPA
  • Education and Training: Cradle demonstrates how to do things and helps students understand complex software.

Who's stronger?

Framework projectSupport ModeWhether or not it relies on an APIKey requirementsCore Advantages
CradleScreenshots + Keyboarding❌ No APIComplete closed-loop, self-directed learningVersatility, Modularity, Wide Adaptation
LangChain AgentText API Input/Output✅ With APIText commands / HTTP requestsExpertise in information retrieval, text management
AutoHotkey / RPA etc.keyboard and mouse macro (computing)❌ No APISingle-step macro operation, lack of memory planningEasy to use but low intelligence, weak self-improvement
Playwright/SeleniumDOM Manipulation API✅ DOM APIweb automationSpecializes in web, more limited than desktop

Strengths: Cradle is a multimodal, cognitively-enabled "universal software executable" that goes beyond traditional or web automation tools.

Article Summary

  • Cradle is the first general-purpose software-controlled AI agent.Supports a wide range of local software and AAA game operations.
  • The core is 6 modules with self-thinking, self-learning, and self-adaptive capabilities.
  • Modularized and maintainable technical architecture
  • Compared to traditional tools, Cradle offers a video-quality experience, global closed-loop intelligence, and the ability to create a new, more efficient, and more effective way of communicating with your customers.
  • Suitable for R&D automation, office, game development and teaching scenarios.

Project Address

https://github.com/baai-agents/cradle