202 lines
4.8 KiB
Markdown
202 lines
4.8 KiB
Markdown
# PDF Processor with Mineru API
|
|
|
|
## Overview
|
|
|
|
The PDF processor has been rewritten to use Mineru's REST API instead of the magic_pdf library. This provides better separation of concerns and allows for more flexible deployment options.
|
|
|
|
## Changes Made
|
|
|
|
### 1. Removed Dependencies
|
|
- Removed all `magic_pdf` imports and dependencies
|
|
- Removed `PyPDF2` direct usage (though kept in requirements for potential other uses)
|
|
|
|
### 2. New Implementation
|
|
- **REST API Integration**: Uses HTTP requests to call Mineru's API
|
|
- **Configurable Settings**: Mineru API URL and timeout are configurable
|
|
- **Error Handling**: Comprehensive error handling for network issues, timeouts, and API errors
|
|
- **Flexible Response Parsing**: Handles multiple possible response formats from Mineru API
|
|
|
|
### 3. Configuration
|
|
|
|
Add the following settings to your environment or `.env` file:
|
|
|
|
```bash
|
|
# Mineru API Configuration
|
|
MINERU_API_URL=http://mineru-api:8000
|
|
MINERU_TIMEOUT=300
|
|
MINERU_LANG_LIST=["ch"]
|
|
MINERU_BACKEND=pipeline
|
|
MINERU_PARSE_METHOD=auto
|
|
MINERU_FORMULA_ENABLE=true
|
|
MINERU_TABLE_ENABLE=true
|
|
```
|
|
|
|
### 4. API Endpoint
|
|
|
|
The processor expects Mineru to provide a REST API endpoint at `/file_parse` that accepts PDF files via multipart form data and returns JSON with markdown content.
|
|
|
|
#### Expected Request Format:
|
|
```
|
|
POST /file_parse
|
|
Content-Type: multipart/form-data
|
|
|
|
files: [PDF file]
|
|
output_dir: ./output
|
|
lang_list: ["ch"]
|
|
backend: pipeline
|
|
parse_method: auto
|
|
formula_enable: true
|
|
table_enable: true
|
|
return_md: true
|
|
return_middle_json: false
|
|
return_model_output: false
|
|
return_content_list: false
|
|
return_images: false
|
|
start_page_id: 0
|
|
end_page_id: 99999
|
|
```
|
|
|
|
#### Expected Response Format:
|
|
The processor can handle multiple response formats:
|
|
|
|
```json
|
|
{
|
|
"markdown": "# Document Title\n\nContent here..."
|
|
}
|
|
```
|
|
|
|
OR
|
|
|
|
```json
|
|
{
|
|
"md": "# Document Title\n\nContent here..."
|
|
}
|
|
```
|
|
|
|
OR
|
|
|
|
```json
|
|
{
|
|
"content": "# Document Title\n\nContent here..."
|
|
}
|
|
```
|
|
|
|
OR
|
|
|
|
```json
|
|
{
|
|
"result": {
|
|
"markdown": "# Document Title\n\nContent here..."
|
|
}
|
|
}
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Basic Usage
|
|
|
|
```python
|
|
from app.core.document_handlers.processors.pdf_processor import PdfDocumentProcessor
|
|
|
|
# Create processor instance
|
|
processor = PdfDocumentProcessor("input.pdf", "output.md")
|
|
|
|
# Read and convert PDF to markdown
|
|
content = processor.read_content()
|
|
|
|
# Process content (apply masking)
|
|
processed_content = processor.process_content(content)
|
|
|
|
# Save processed content
|
|
processor.save_content(processed_content)
|
|
```
|
|
|
|
### Through Document Service
|
|
|
|
```python
|
|
from app.core.services.document_service import DocumentService
|
|
|
|
service = DocumentService()
|
|
success = service.process_document("input.pdf", "output.md")
|
|
```
|
|
|
|
## Testing
|
|
|
|
Run the test script to verify the implementation:
|
|
|
|
```bash
|
|
cd backend
|
|
python test_pdf_processor.py
|
|
```
|
|
|
|
Make sure you have:
|
|
1. A sample PDF file in the `sample_doc/` directory
|
|
2. Mineru API service running and accessible
|
|
3. Proper network connectivity between services
|
|
|
|
## Error Handling
|
|
|
|
The processor handles various error scenarios:
|
|
|
|
- **Network Timeouts**: Configurable timeout (default: 5 minutes)
|
|
- **API Errors**: HTTP status code errors are logged and handled
|
|
- **Response Parsing**: Multiple fallback strategies for extracting markdown content
|
|
- **File Operations**: Proper error handling for file reading/writing
|
|
|
|
## Logging
|
|
|
|
The processor provides detailed logging for debugging:
|
|
|
|
- API call attempts and responses
|
|
- Content extraction results
|
|
- Error conditions and stack traces
|
|
- Processing statistics
|
|
|
|
## Deployment
|
|
|
|
### Docker Compose
|
|
|
|
Ensure your Mineru service is running and accessible. The default configuration expects it at `http://mineru-api:8000`.
|
|
|
|
### Environment Variables
|
|
|
|
Set the following environment variables in your deployment:
|
|
|
|
```bash
|
|
MINERU_API_URL=http://your-mineru-service:8000
|
|
MINERU_TIMEOUT=300
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Connection Refused**: Check if Mineru service is running and accessible
|
|
2. **Timeout Errors**: Increase `MINERU_TIMEOUT` for large PDF files
|
|
3. **Empty Content**: Check Mineru API response format and logs
|
|
4. **Network Issues**: Verify network connectivity between services
|
|
|
|
### Debug Mode
|
|
|
|
Enable debug logging to see detailed API interactions:
|
|
|
|
```python
|
|
import logging
|
|
logging.getLogger('app.core.document_handlers.processors.pdf_processor').setLevel(logging.DEBUG)
|
|
```
|
|
|
|
## Migration from magic_pdf
|
|
|
|
If you were previously using magic_pdf:
|
|
|
|
1. **No Code Changes Required**: The interface remains the same
|
|
2. **Configuration Update**: Add Mineru API settings
|
|
3. **Service Dependencies**: Ensure Mineru service is running
|
|
4. **Testing**: Run the test script to verify functionality
|
|
|
|
## Performance Considerations
|
|
|
|
- **Timeout**: Large PDFs may require longer timeouts
|
|
- **Memory**: The processor loads the entire PDF into memory for API calls
|
|
- **Network**: API calls add network latency to processing time
|
|
- **Caching**: Consider implementing caching for frequently processed documents |