4.8 KiB

Raw Blame History

PDF Processor with Mineru API

Overview

The PDF processor has been rewritten to use Mineru's REST API instead of the magic_pdf library. This provides better separation of concerns and allows for more flexible deployment options.

Changes Made

1. Removed Dependencies

Removed all magic_pdf imports and dependencies
Removed PyPDF2 direct usage (though kept in requirements for potential other uses)

2. New Implementation

REST API Integration: Uses HTTP requests to call Mineru's API
Configurable Settings: Mineru API URL and timeout are configurable
Error Handling: Comprehensive error handling for network issues, timeouts, and API errors
Flexible Response Parsing: Handles multiple possible response formats from Mineru API

3. Configuration

Add the following settings to your environment or .env file:

# Mineru API Configuration
MINERU_API_URL=http://mineru-api:8000
MINERU_TIMEOUT=300
MINERU_LANG_LIST=["ch"]
MINERU_BACKEND=pipeline
MINERU_PARSE_METHOD=auto
MINERU_FORMULA_ENABLE=true
MINERU_TABLE_ENABLE=true

4. API Endpoint

The processor expects Mineru to provide a REST API endpoint at /file_parse that accepts PDF files via multipart form data and returns JSON with markdown content.

Expected Request Format:

POST /file_parse
Content-Type: multipart/form-data

files: [PDF file]
output_dir: ./output
lang_list: ["ch"]
backend: pipeline
parse_method: auto
formula_enable: true
table_enable: true
return_md: true
return_middle_json: false
return_model_output: false
return_content_list: false
return_images: false
start_page_id: 0
end_page_id: 99999

Expected Response Format:

The processor can handle multiple response formats:

{
  "markdown": "# Document Title\n\nContent here..."
}

{
  "md": "# Document Title\n\nContent here..."
}

{
  "content": "# Document Title\n\nContent here..."
}

{
  "result": {
    "markdown": "# Document Title\n\nContent here..."
  }
}

Usage

Basic Usage

from app.core.document_handlers.processors.pdf_processor import PdfDocumentProcessor

# Create processor instance
processor = PdfDocumentProcessor("input.pdf", "output.md")

# Read and convert PDF to markdown
content = processor.read_content()

# Process content (apply masking)
processed_content = processor.process_content(content)

# Save processed content
processor.save_content(processed_content)

Through Document Service

from app.core.services.document_service import DocumentService

service = DocumentService()
success = service.process_document("input.pdf", "output.md")

Testing

Run the test script to verify the implementation:

cd backend
python test_pdf_processor.py

Make sure you have:

A sample PDF file in the sample_doc/ directory
Mineru API service running and accessible
Proper network connectivity between services

Error Handling

The processor handles various error scenarios:

Network Timeouts: Configurable timeout (default: 5 minutes)
API Errors: HTTP status code errors are logged and handled
Response Parsing: Multiple fallback strategies for extracting markdown content
File Operations: Proper error handling for file reading/writing

Logging

The processor provides detailed logging for debugging:

API call attempts and responses
Content extraction results
Error conditions and stack traces
Processing statistics

Deployment

Docker Compose

Ensure your Mineru service is running and accessible. The default configuration expects it at http://mineru-api:8000.

Environment Variables

Set the following environment variables in your deployment:

MINERU_API_URL=http://your-mineru-service:8000
MINERU_TIMEOUT=300

Troubleshooting

Common Issues

Connection Refused: Check if Mineru service is running and accessible
Timeout Errors: Increase MINERU_TIMEOUT for large PDF files
Empty Content: Check Mineru API response format and logs
Network Issues: Verify network connectivity between services

Debug Mode

Enable debug logging to see detailed API interactions:

import logging
logging.getLogger('app.core.document_handlers.processors.pdf_processor').setLevel(logging.DEBUG)

Migration from magic_pdf

If you were previously using magic_pdf:

No Code Changes Required: The interface remains the same
Configuration Update: Add Mineru API settings
Service Dependencies: Ensure Mineru service is running
Testing: Run the test script to verify functionality

Performance Considerations

Timeout: Large PDFs may require longer timeouts
Memory: The processor loads the entire PDF into memory for API calls
Network: API calls add network latency to processing time
Caching: Consider implementing caching for frequently processed documents

4.8 KiB Raw Blame History