legal-doc-masker/backend/docs/PDF_PROCESSOR_README.md

4.8 KiB

PDF Processor with Mineru API

Overview

The PDF processor has been rewritten to use Mineru's REST API instead of the magic_pdf library. This provides better separation of concerns and allows for more flexible deployment options.

Changes Made

1. Removed Dependencies

  • Removed all magic_pdf imports and dependencies
  • Removed PyPDF2 direct usage (though kept in requirements for potential other uses)

2. New Implementation

  • REST API Integration: Uses HTTP requests to call Mineru's API
  • Configurable Settings: Mineru API URL and timeout are configurable
  • Error Handling: Comprehensive error handling for network issues, timeouts, and API errors
  • Flexible Response Parsing: Handles multiple possible response formats from Mineru API

3. Configuration

Add the following settings to your environment or .env file:

# Mineru API Configuration
MINERU_API_URL=http://mineru-api:8000
MINERU_TIMEOUT=300
MINERU_LANG_LIST=["ch"]
MINERU_BACKEND=pipeline
MINERU_PARSE_METHOD=auto
MINERU_FORMULA_ENABLE=true
MINERU_TABLE_ENABLE=true

4. API Endpoint

The processor expects Mineru to provide a REST API endpoint at /file_parse that accepts PDF files via multipart form data and returns JSON with markdown content.

Expected Request Format:

POST /file_parse
Content-Type: multipart/form-data

files: [PDF file]
output_dir: ./output
lang_list: ["ch"]
backend: pipeline
parse_method: auto
formula_enable: true
table_enable: true
return_md: true
return_middle_json: false
return_model_output: false
return_content_list: false
return_images: false
start_page_id: 0
end_page_id: 99999

Expected Response Format:

The processor can handle multiple response formats:

{
  "markdown": "# Document Title\n\nContent here..."
}

OR

{
  "md": "# Document Title\n\nContent here..."
}

OR

{
  "content": "# Document Title\n\nContent here..."
}

OR

{
  "result": {
    "markdown": "# Document Title\n\nContent here..."
  }
}

Usage

Basic Usage

from app.core.document_handlers.processors.pdf_processor import PdfDocumentProcessor

# Create processor instance
processor = PdfDocumentProcessor("input.pdf", "output.md")

# Read and convert PDF to markdown
content = processor.read_content()

# Process content (apply masking)
processed_content = processor.process_content(content)

# Save processed content
processor.save_content(processed_content)

Through Document Service

from app.core.services.document_service import DocumentService

service = DocumentService()
success = service.process_document("input.pdf", "output.md")

Testing

Run the test script to verify the implementation:

cd backend
python test_pdf_processor.py

Make sure you have:

  1. A sample PDF file in the sample_doc/ directory
  2. Mineru API service running and accessible
  3. Proper network connectivity between services

Error Handling

The processor handles various error scenarios:

  • Network Timeouts: Configurable timeout (default: 5 minutes)
  • API Errors: HTTP status code errors are logged and handled
  • Response Parsing: Multiple fallback strategies for extracting markdown content
  • File Operations: Proper error handling for file reading/writing

Logging

The processor provides detailed logging for debugging:

  • API call attempts and responses
  • Content extraction results
  • Error conditions and stack traces
  • Processing statistics

Deployment

Docker Compose

Ensure your Mineru service is running and accessible. The default configuration expects it at http://mineru-api:8000.

Environment Variables

Set the following environment variables in your deployment:

MINERU_API_URL=http://your-mineru-service:8000
MINERU_TIMEOUT=300

Troubleshooting

Common Issues

  1. Connection Refused: Check if Mineru service is running and accessible
  2. Timeout Errors: Increase MINERU_TIMEOUT for large PDF files
  3. Empty Content: Check Mineru API response format and logs
  4. Network Issues: Verify network connectivity between services

Debug Mode

Enable debug logging to see detailed API interactions:

import logging
logging.getLogger('app.core.document_handlers.processors.pdf_processor').setLevel(logging.DEBUG)

Migration from magic_pdf

If you were previously using magic_pdf:

  1. No Code Changes Required: The interface remains the same
  2. Configuration Update: Add Mineru API settings
  3. Service Dependencies: Ensure Mineru service is running
  4. Testing: Run the test script to verify functionality

Performance Considerations

  • Timeout: Large PDFs may require longer timeouts
  • Memory: The processor loads the entire PDF into memory for API calls
  • Network: API calls add network latency to processing time
  • Caching: Consider implementing caching for frequently processed documents