legal-doc-masker/backend/docs/PDF_PROCESSOR_README.md

202 lines
4.8 KiB
Markdown

# PDF Processor with Mineru API
## Overview
The PDF processor has been rewritten to use Mineru's REST API instead of the magic_pdf library. This provides better separation of concerns and allows for more flexible deployment options.
## Changes Made
### 1. Removed Dependencies
- Removed all `magic_pdf` imports and dependencies
- Removed `PyPDF2` direct usage (though kept in requirements for potential other uses)
### 2. New Implementation
- **REST API Integration**: Uses HTTP requests to call Mineru's API
- **Configurable Settings**: Mineru API URL and timeout are configurable
- **Error Handling**: Comprehensive error handling for network issues, timeouts, and API errors
- **Flexible Response Parsing**: Handles multiple possible response formats from Mineru API
### 3. Configuration
Add the following settings to your environment or `.env` file:
```bash
# Mineru API Configuration
MINERU_API_URL=http://mineru-api:8000
MINERU_TIMEOUT=300
MINERU_LANG_LIST=["ch"]
MINERU_BACKEND=pipeline
MINERU_PARSE_METHOD=auto
MINERU_FORMULA_ENABLE=true
MINERU_TABLE_ENABLE=true
```
### 4. API Endpoint
The processor expects Mineru to provide a REST API endpoint at `/file_parse` that accepts PDF files via multipart form data and returns JSON with markdown content.
#### Expected Request Format:
```
POST /file_parse
Content-Type: multipart/form-data
files: [PDF file]
output_dir: ./output
lang_list: ["ch"]
backend: pipeline
parse_method: auto
formula_enable: true
table_enable: true
return_md: true
return_middle_json: false
return_model_output: false
return_content_list: false
return_images: false
start_page_id: 0
end_page_id: 99999
```
#### Expected Response Format:
The processor can handle multiple response formats:
```json
{
"markdown": "# Document Title\n\nContent here..."
}
```
OR
```json
{
"md": "# Document Title\n\nContent here..."
}
```
OR
```json
{
"content": "# Document Title\n\nContent here..."
}
```
OR
```json
{
"result": {
"markdown": "# Document Title\n\nContent here..."
}
}
```
## Usage
### Basic Usage
```python
from app.core.document_handlers.processors.pdf_processor import PdfDocumentProcessor
# Create processor instance
processor = PdfDocumentProcessor("input.pdf", "output.md")
# Read and convert PDF to markdown
content = processor.read_content()
# Process content (apply masking)
processed_content = processor.process_content(content)
# Save processed content
processor.save_content(processed_content)
```
### Through Document Service
```python
from app.core.services.document_service import DocumentService
service = DocumentService()
success = service.process_document("input.pdf", "output.md")
```
## Testing
Run the test script to verify the implementation:
```bash
cd backend
python test_pdf_processor.py
```
Make sure you have:
1. A sample PDF file in the `sample_doc/` directory
2. Mineru API service running and accessible
3. Proper network connectivity between services
## Error Handling
The processor handles various error scenarios:
- **Network Timeouts**: Configurable timeout (default: 5 minutes)
- **API Errors**: HTTP status code errors are logged and handled
- **Response Parsing**: Multiple fallback strategies for extracting markdown content
- **File Operations**: Proper error handling for file reading/writing
## Logging
The processor provides detailed logging for debugging:
- API call attempts and responses
- Content extraction results
- Error conditions and stack traces
- Processing statistics
## Deployment
### Docker Compose
Ensure your Mineru service is running and accessible. The default configuration expects it at `http://mineru-api:8000`.
### Environment Variables
Set the following environment variables in your deployment:
```bash
MINERU_API_URL=http://your-mineru-service:8000
MINERU_TIMEOUT=300
```
## Troubleshooting
### Common Issues
1. **Connection Refused**: Check if Mineru service is running and accessible
2. **Timeout Errors**: Increase `MINERU_TIMEOUT` for large PDF files
3. **Empty Content**: Check Mineru API response format and logs
4. **Network Issues**: Verify network connectivity between services
### Debug Mode
Enable debug logging to see detailed API interactions:
```python
import logging
logging.getLogger('app.core.document_handlers.processors.pdf_processor').setLevel(logging.DEBUG)
```
## Migration from magic_pdf
If you were previously using magic_pdf:
1. **No Code Changes Required**: The interface remains the same
2. **Configuration Update**: Add Mineru API settings
3. **Service Dependencies**: Ensure Mineru service is running
4. **Testing**: Run the test script to verify functionality
## Performance Considerations
- **Timeout**: Large PDFs may require longer timeouts
- **Memory**: The processor loads the entire PDF into memory for API calls
- **Network**: API calls add network latency to processing time
- **Caching**: Consider implementing caching for frequently processed documents