mcp-gui-automation-toolkit
A Model Context Protocol (MCP) server enabling programmatic manipulation of the graphical user interface (GUI), including cursor coordination adjustments, simulated keystroke entry, and visual screen capture utilities, operational across diverse operating systems like Windows, macOS, and Linux.
Author

hetaoBackend
Quick Info
Actions
Tags
mcp-gui-automation-toolkit
This MCP (Model Context Protocol) interface furnishes extensive capabilities for automated graphical environment interaction and validation, leveraging the underlying PyAutoGUI framework.
Core Functionalities
- Precise orchestration of cursor movements and button presses.
- Injection of simulated text and key combinations.
- Acquisition of digital representations of the displayed screen content.
- Locating graphical assets within the display buffer.
- Retrieval of display configuration metadata.
- Guaranteed operational compatibility across primary desktop environments (Win, Mac, Lin).
Provided Interfaces (Tools)
Cursor Manipulation
- Translate cursor location to specified screen coordinates.
- Execute mouse button activation at the current or defined point.
- Perform dragging and deposition routines.
- Query the present coordinates of the pointing device.
Input Simulation
- Input sequences of textual characters.
- Trigger single or composite key events.
- Activate complex system hotkey sequences.
Display Environment Operations
- Capture full-screen or region-specific images.
- Determine the dimensions of the active display area.
- Map the screen positions corresponding to provided image templates.
- Read the specific RGB values of individual screen pixels.
Deployment Instructions
Prerequisites
- Runtime environment: Python version 3.12 or newer.
- Core dependency: PyAutoGUI package.
- Auxiliary requirements will be managed via standard package resolution.
Installation Procedure
Install the distribution package:
bash pip install mcp-pyautogui-server
Configuration for Claude Desktop Integration
MacOS Path: bash ~/Library/Application\ Support/Claude/claude_desktop_config.json
Windows Path: bash %APPDATA%/Claude/claude_desktop_config.json
Configuration snippet for local/unreleased servers:
{ "mcpServers": { "mcp-pyautogui-server": { "command": "uv", "args": [ "--directory", "/path/to/mcp-pyautogui-server", "run", "mcp-pyautogui-server" ] } } }
Configuration snippet for officially released servers:
{ "mcpServers": { "mcp-pyautogui-server": { "command": "uvx", "args": [ "mcp-pyautogui-server" ] } } }
Development Cycle
Building and Releasing Artifacts
-
Synchronize project dependencies and update the lock file: bash uv sync
-
Generate package binaries: bash uv build
-
Submit to the PyPI repository: bash uv publish
Note: Authentication credentials for PyPI submission must be supplied via environment variables or direct command-line flags:
* Authentication Token: --token flag or the UV_PUBLISH_TOKEN environment variable.
* Username/Password: Use --username/UV_PUBLISH_USERNAME and --password/UV_PUBLISH_PASSWORD.
Troubleshooting and Inspection
For optimal diagnostic workflows, utilize the MCP Inspector tool.
Execute the Inspector via npm:
bash npx @modelcontextprotocol/inspector uv --directory /path/to/mcp-pyautogui-server run mcp-pyautogui-server
The Inspector will present a network address (URL) for browser-based debugging initiation.
Licensing
This software is distributed under the terms of the MIT License; refer to the LICENSE file for comprehensive details.
WIKIPEDIA: A headless browser is a web browser without a graphical user interface. Headless browsers provide automated control of a web page in an environment similar to popular web browsers, but they are executed via a command-line interface or using network communication. They are particularly useful for testing web pages as they are able to render and understand HTML the same way a browser would, including styling elements such as page layout, color, font selection and execution of JavaScript and Ajax which are usually not available when using other testing methods. Since version 59 of Google Chrome and version 56 of Firefox, there is native support for remote control of the browser. This made earlier efforts obsolete, notably PhantomJS.
== Use cases == The main use cases for headless browsers are:
Test automation in modern web applications (web testing) Taking screenshots of web pages. Running automated tests for JavaScript libraries. Automating interaction of web pages.
=== Other uses === Headless browsers are also useful for web scraping. Google stated in 2009 that using a headless browser could help their search engine index content from websites that use Ajax. Headless browsers have also been misused in various ways:
Perform DDoS attacks on web sites. Increase advertisement impressions. Automate web sites in unintended ways e.g. for credential stuffing. However, a study of browser traffic in 2018 found no preference by malicious actors for headless browsers. There is no indication that headless browsers are used more frequently than non-headless browsers for malicious purposes, like DDoS attacks, SQL injections or cross-site scripting attacks.
== Usage == As several major browsers natively support headless mode through APIs, some software exists to perform browser automation through a unified interface. These include:
Selenium WebDriver – a W3C compliant implementation of WebDriver Playwright – a Node.js library to automate Chromium, Firefox and WebKit Puppeteer – a Node.js library to automate Chrome or Firefox
=== Test automation === Some test automation software and frameworks include headless browsers as part of their testing apparati.
Capybara uses headless browsing, either via WebKit or Headless Chrome to mimic user behavior in its testing protocols. Jasmine uses Selenium by default, but can use WebKit or Headless Chrome, to run browser tests. Cypress, a frontend testing framework QF-Test, a software tool for automated testing of programs via the graphical user interface where a headless browser can also be used for testing.
=== Alternatives === Another approach is to use software that provides browser APIs. For example, Deno provides browser APIs as part of its design. For Node.js, jsdom is the most complete provider. While most are able to support common browser features (HTML parsing, cookies, XHR, some JavaScript, etc.), they do not render the DOM and have limited support for DOM events. They usually perform faster than
