# PDF Processor with Mineru API ## Overview The PDF processor has been rewritten to use Mineru's REST API instead of the magic_pdf library. This provides better separation of concerns and allows for more flexible deployment options. ## Changes Made ### 1. Removed Dependencies - Removed all `magic_pdf` imports and dependencies - Removed `PyPDF2` direct usage (though kept in requirements for potential other uses) ### 2. New Implementation - **REST API Integration**: Uses HTTP requests to call Mineru's API - **Configurable Settings**: Mineru API URL and timeout are configurable - **Error Handling**: Comprehensive error handling for network issues, timeouts, and API errors - **Flexible Response Parsing**: Handles multiple possible response formats from Mineru API ### 3. Configuration Add the following settings to your environment or `.env` file: ```bash # Mineru API Configuration MINERU_API_URL=http://mineru-api:8000 MINERU_TIMEOUT=300 MINERU_LANG_LIST=["ch"] MINERU_BACKEND=pipeline MINERU_PARSE_METHOD=auto MINERU_FORMULA_ENABLE=true MINERU_TABLE_ENABLE=true ``` ### 4. API Endpoint The processor expects Mineru to provide a REST API endpoint at `/file_parse` that accepts PDF files via multipart form data and returns JSON with markdown content. #### Expected Request Format: ``` POST /file_parse Content-Type: multipart/form-data files: [PDF file] output_dir: ./output lang_list: ["ch"] backend: pipeline parse_method: auto formula_enable: true table_enable: true return_md: true return_middle_json: false return_model_output: false return_content_list: false return_images: false start_page_id: 0 end_page_id: 99999 ``` #### Expected Response Format: The processor can handle multiple response formats: ```json { "markdown": "# Document Title\n\nContent here..." } ``` OR ```json { "md": "# Document Title\n\nContent here..." } ``` OR ```json { "content": "# Document Title\n\nContent here..." } ``` OR ```json { "result": { "markdown": "# Document Title\n\nContent here..." } } ``` ## Usage ### Basic Usage ```python from app.core.document_handlers.processors.pdf_processor import PdfDocumentProcessor # Create processor instance processor = PdfDocumentProcessor("input.pdf", "output.md") # Read and convert PDF to markdown content = processor.read_content() # Process content (apply masking) processed_content = processor.process_content(content) # Save processed content processor.save_content(processed_content) ``` ### Through Document Service ```python from app.core.services.document_service import DocumentService service = DocumentService() success = service.process_document("input.pdf", "output.md") ``` ## Testing Run the test script to verify the implementation: ```bash cd backend python test_pdf_processor.py ``` Make sure you have: 1. A sample PDF file in the `sample_doc/` directory 2. Mineru API service running and accessible 3. Proper network connectivity between services ## Error Handling The processor handles various error scenarios: - **Network Timeouts**: Configurable timeout (default: 5 minutes) - **API Errors**: HTTP status code errors are logged and handled - **Response Parsing**: Multiple fallback strategies for extracting markdown content - **File Operations**: Proper error handling for file reading/writing ## Logging The processor provides detailed logging for debugging: - API call attempts and responses - Content extraction results - Error conditions and stack traces - Processing statistics ## Deployment ### Docker Compose Ensure your Mineru service is running and accessible. The default configuration expects it at `http://mineru-api:8000`. ### Environment Variables Set the following environment variables in your deployment: ```bash MINERU_API_URL=http://your-mineru-service:8000 MINERU_TIMEOUT=300 ``` ## Troubleshooting ### Common Issues 1. **Connection Refused**: Check if Mineru service is running and accessible 2. **Timeout Errors**: Increase `MINERU_TIMEOUT` for large PDF files 3. **Empty Content**: Check Mineru API response format and logs 4. **Network Issues**: Verify network connectivity between services ### Debug Mode Enable debug logging to see detailed API interactions: ```python import logging logging.getLogger('app.core.document_handlers.processors.pdf_processor').setLevel(logging.DEBUG) ``` ## Migration from magic_pdf If you were previously using magic_pdf: 1. **No Code Changes Required**: The interface remains the same 2. **Configuration Update**: Add Mineru API settings 3. **Service Dependencies**: Ensure Mineru service is running 4. **Testing**: Run the test script to verify functionality ## Performance Considerations - **Timeout**: Large PDFs may require longer timeouts - **Memory**: The processor loads the entire PDF into memory for API calls - **Network**: API calls add network latency to processing time - **Caching**: Consider implementing caching for frequently processed documents