scraper-mcp-smithery
Enhance web scraping capabilities with tools to efficiently extract and manipulate data. Automate data collection through asynchronous web search operations and integrate smoothly with existing applications using a FastMCP server foundation.
Author

rockerritesh
Quick Info
Actions
Tags
Web Scraper MCP for Smithery
A robust MCP (Model Context Protocol) server for web scraping operations, deployed on Smithery - the orchestration layer for AI agents. This extension converts any website into clean, structured markdown format with automatic ChromeDriver management.
🌟 Available on Smithery
This MCP server is part of Smithery's marketplace with 7953+ skills and extensions built by the community. Deploy instantly to integrate web scraping capabilities into your AI agents.
✨ Features
- 🚀 High Performance: Direct function integration with uv package manager for optimal speed
- 🔄 Zero Configuration: Automatic ChromeDriver management with version compatibility
- 🌐 Smart URL Processing: Auto-adds HTTPS protocol and validates URLs
- 📝 Markdown Conversion: Converts web content to clean, structured markdown
- ⚡ Async Operations: Non-blocking web scraping with proper async/await
- 🛡️ Production Ready: Comprehensive error handling and graceful fallbacks
- 🐳 Smithery Optimized: Containerized deployment with security best practices
📋 Prerequisites
- Smithery Account - Sign up at smithery.ai
- Python 3.12+ (for local development)
- UV package manager
- Google Chrome (automatically managed in deployment)
🚀 Smithery Deployment
Deploy to Smithery Platform
- Visit Smithery Web Scraper MCP
- Click "Deploy Server" to add to your agent
- Configure with your preferred settings
- Start scraping websites instantly!
Local Development
# Clone the repository
git clone https://github.com/rockerritesh/scraper-mcp-smithery.git
cd scraper-mcp-smithery
# Install dependencies with uv
uv sync
# Run the MCP development server
uv run mcp dev server.py
Direct Python Usage (Development)
from scraper_doc import scrape_website
# Scrape a website
content = scrape_website("https://example.com")
print(content) # Returns markdown formatted content
URL Format Requirements
- ✅ Supported:
https://example.com,http://example.com - ✅ Auto-fixed:
example.com→https://example.com - ❌ Invalid: Malformed URLs return descriptive error messages
🏗️ Smithery Architecture
Integration Flow
Smithery Agent → MCP Protocol → search_web_tool → Chrome/Selenium → Markdown Output
Platform Benefits
- 🎯 Zero Setup: Deploy instantly without infrastructure management
- 📊 Monitoring: Built-in health checks and performance metrics
- 🔗 Agent Integration: Seamless connection to Smithery's AI orchestration
- 📈 Scalability: Automatic scaling based on usage patterns
Key Improvements
- ❌ Old: Subprocess calls with performance overhead
- ✅ New: Direct function imports with async execution
- 🎯 Result: ~3x faster performance on Smithery platform
🛠️ Development & Testing
Local Testing
# Test the scraper directly
uv run python scraper_doc.py https://example.com
# Test with output directory
uv run python scraper_doc.py https://example.com ./output
# Run MCP development server
uv run mcp dev server.py
Debug Mode
MCP_DEBUG=1 uv run mcp dev server.py
Dependencies (Managed by UV)
- mcp[cli] - Model Context Protocol framework
- selenium - Web browser automation
- webdriver-manager - Automatic ChromeDriver management
- requests - HTTP client for image downloads
- python-dotenv - Environment variable management
🐛 Troubleshooting
Common Smithery Issues
- Deployment Timeout: Usually resolves automatically; check Smithery status
- Tool Not Found: Ensure proper MCP tool registration in server.py
- Memory Limits: Large pages may require optimization (handled automatically)
ChromeDriver Issues
Automatically resolved by webdriver-manager, but for local development:
# Clear webdriver cache if needed
rm -rf ~/.wdm/
# Verify Chrome installation
google-chrome --version
📊 Performance on Smithery
- 🚀 Scraping Speed: 2-5 seconds per page
- 💾 Memory Usage: ~50-100MB per operation
- ⚡ Concurrent Support: Multiple async operations
- 🔄 Auto-scaling: Handled by Smithery platform
🔐 Security Features
- 🛡️ Sandboxed Execution: Chrome runs with security flags
- 👤 Non-root User: Enhanced container security
- 🔒 URL Validation: Prevents malicious URL processing
- 📊 Audit Logging: Smithery platform monitoring
🌐 Smithery Integration Examples
In Chat Agents
Agent: "Can you scrape the latest news from example.com?"
Web Scraper MCP: *Scrapes and returns structured content*
Agent: "Here's the latest news in markdown format..."
In Automation Workflows
Trigger → Smithery Agent → Web Scraper MCP → Content Analysis → Action
📚 Resources
- Smithery Platform - Deploy and manage MCP servers
- Smithery Documentation - Platform guides and API reference
- MCP Specification - Protocol documentation
- Community Discord - Get help and share ideas
📜 License
MIT License - see LICENSE file for details.
🤝 Contributing to Smithery Ecosystem
- Fork this repository
- Create a feature branch
- Test on Smithery platform
- Submit a pull request
- Share in Smithery community
🚀 Deployed on Smithery | Built with FastMCP, Selenium, and UV | Part of 7953+ community extensions ```
This README provides clear setup instructions while highlighting the tool's async capabilities and Smithery integration. The structure follows best practices for developer tools documentation.
