4.8 KiB
PDF Processor with Mineru API
Overview
The PDF processor has been rewritten to use Mineru's REST API instead of the magic_pdf library. This provides better separation of concerns and allows for more flexible deployment options.
Changes Made
1. Removed Dependencies
- Removed all
magic_pdfimports and dependencies - Removed
PyPDF2direct usage (though kept in requirements for potential other uses)
2. New Implementation
- REST API Integration: Uses HTTP requests to call Mineru's API
- Configurable Settings: Mineru API URL and timeout are configurable
- Error Handling: Comprehensive error handling for network issues, timeouts, and API errors
- Flexible Response Parsing: Handles multiple possible response formats from Mineru API
3. Configuration
Add the following settings to your environment or .env file:
# Mineru API Configuration
MINERU_API_URL=http://mineru-api:8000
MINERU_TIMEOUT=300
MINERU_LANG_LIST=["ch"]
MINERU_BACKEND=pipeline
MINERU_PARSE_METHOD=auto
MINERU_FORMULA_ENABLE=true
MINERU_TABLE_ENABLE=true
4. API Endpoint
The processor expects Mineru to provide a REST API endpoint at /file_parse that accepts PDF files via multipart form data and returns JSON with markdown content.
Expected Request Format:
POST /file_parse
Content-Type: multipart/form-data
files: [PDF file]
output_dir: ./output
lang_list: ["ch"]
backend: pipeline
parse_method: auto
formula_enable: true
table_enable: true
return_md: true
return_middle_json: false
return_model_output: false
return_content_list: false
return_images: false
start_page_id: 0
end_page_id: 99999
Expected Response Format:
The processor can handle multiple response formats:
{
"markdown": "# Document Title\n\nContent here..."
}
OR
{
"md": "# Document Title\n\nContent here..."
}
OR
{
"content": "# Document Title\n\nContent here..."
}
OR
{
"result": {
"markdown": "# Document Title\n\nContent here..."
}
}
Usage
Basic Usage
from app.core.document_handlers.processors.pdf_processor import PdfDocumentProcessor
# Create processor instance
processor = PdfDocumentProcessor("input.pdf", "output.md")
# Read and convert PDF to markdown
content = processor.read_content()
# Process content (apply masking)
processed_content = processor.process_content(content)
# Save processed content
processor.save_content(processed_content)
Through Document Service
from app.core.services.document_service import DocumentService
service = DocumentService()
success = service.process_document("input.pdf", "output.md")
Testing
Run the test script to verify the implementation:
cd backend
python test_pdf_processor.py
Make sure you have:
- A sample PDF file in the
sample_doc/directory - Mineru API service running and accessible
- Proper network connectivity between services
Error Handling
The processor handles various error scenarios:
- Network Timeouts: Configurable timeout (default: 5 minutes)
- API Errors: HTTP status code errors are logged and handled
- Response Parsing: Multiple fallback strategies for extracting markdown content
- File Operations: Proper error handling for file reading/writing
Logging
The processor provides detailed logging for debugging:
- API call attempts and responses
- Content extraction results
- Error conditions and stack traces
- Processing statistics
Deployment
Docker Compose
Ensure your Mineru service is running and accessible. The default configuration expects it at http://mineru-api:8000.
Environment Variables
Set the following environment variables in your deployment:
MINERU_API_URL=http://your-mineru-service:8000
MINERU_TIMEOUT=300
Troubleshooting
Common Issues
- Connection Refused: Check if Mineru service is running and accessible
- Timeout Errors: Increase
MINERU_TIMEOUTfor large PDF files - Empty Content: Check Mineru API response format and logs
- Network Issues: Verify network connectivity between services
Debug Mode
Enable debug logging to see detailed API interactions:
import logging
logging.getLogger('app.core.document_handlers.processors.pdf_processor').setLevel(logging.DEBUG)
Migration from magic_pdf
If you were previously using magic_pdf:
- No Code Changes Required: The interface remains the same
- Configuration Update: Add Mineru API settings
- Service Dependencies: Ensure Mineru service is running
- Testing: Run the test script to verify functionality
Performance Considerations
- Timeout: Large PDFs may require longer timeouts
- Memory: The processor loads the entire PDF into memory for API calls
- Network: API calls add network latency to processing time
- Caching: Consider implementing caching for frequently processed documents