logo
Free, unlimited AI code reviews that run on commit
git-lrc git-lrc GitHub Install Now We'd appreciate a star git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

parsemypdf

Extract and analyze complex PDF documents using various tools to maintain document structure and efficiently extract tables, images, and mixed content. Specialized processors are available tailored to the complexity and content type of the PDFs.

Author

parsemypdf logo

taxihabbel

MIT License

Quick Info

GitHub GitHub Stars 1
NPM Weekly Downloads 0
Tools 1
Last Updated 2026-02-19

Tags

parsemypdfpdfspdfparsemypdf extracttaxihabbel parsemypdfpdf documents

logo_genie_new 

YouTube_genieincodebottle_blue  style_5eba00_svg_label_LinkedIn_logo_linkedin_style_social    style_5eba00_svg_label_GenAI_Roadmap_logo_github_style_social

📑 Complex PDF Parsing

A comprehensive example codes for extracting content from PDFs

Also, check -> Pdf Parsing Guide

📌 Core Features

📤 Content Extraction

  • Multiple extraction methods with different tools/libraries:
  • Cloud-based: Claude 3.5 Sonnet, GPT-4 Vision, Unstructured.io
  • Local: Llama 3.2 11B, Docling, PDFium
  • Specialized: Camelot (tables), PDFMiner (text), PDFPlumber (mixed), PyPdf etc
  • Maintains document structure and formatting
  • Handles complex PDFs with mixed content including extracting image data

📦 Implementation Options

1. ☁️ Cloud-Based Methods

  • Claude & Llama: Excellent for complex PDFs with mixed content
  • GPT-4 Vision: Excellent for visual content analysis
  • Unstructured.io: Advanced content partitioning and classification

2. 🖥️ Local Methods

  • Llama 3.2 11B Vision: Image-based PDF processing
  • Docling: Excellent for complex PDFs with mixed content
  • PDFium: High-fidelity processing using Chrome's PDF engine
  • Camelot: Specialized table extraction
  • PDFMiner/PDFPlumber: Basic text and layout extraction

🔗 Dependencies

📚 Core Libraries

langchain_ollama
langchain_huggingface
langchain_community
FAISS
python-dotenv

⚙️ Implementation-Specific

anthropic        # Claude
openai           # GPT-4 Vision
camelot-py      # Table extraction
docling         # Text processing
pdf2image       # PDF conversion
pypdfium2       # PDFium processing
boto3           # AWS Textract

🛠️ Setup

  1. Environment Variables
ANTHROPIC_API_KEY=your_key_here    # For Claude
OPENAI_API_KEY=your_key_here       # For OpenAI
UNSTRUCTURED_API_KEY=your_key_here # For Unstructured.io
  1. Install Dependencies
pip install -r requirements.txt
  1. Install Ollama & Models (for local processing)
# Install Ollama
curl https://ollama.ai/install.sh | sh

# Pull required models
ollama pull llama3.1
ollama pull x/llama3.2-vision:11b

📈 Usage

  1. Place PDF files in input/ directory

📄 Example Complex Pdf placed in Input folder

  • sample-1.pdf: Standard tables
  • sample-2.pdf: Image-based simple tables
  • sample-3.pdf: Image-based complex tables
  • sample-4.pdf: Mixed content (text, tables, images)

📝 Notes

  • System resources needed for local LLM operations
  • API keys required for cloud based implementations
  • Consider PDF complexity when choosing implementation
  • Ghostscript required for Camelot
  • Different processors suit different use cases
  • Cloud: Complex documents, mixed content
  • Local: Simple text, basic tables
  • Specialized: Specific content types (tables, forms)

See Also

`