legal-doc-masker/backend/docs/PDF_PROCESSOR_README.md

# PDF Processor with Mineru API

## Overview

The PDF processor has been rewritten to use Mineru's REST API instead of the magic_pdf library. This provides better separation of concerns and allows for more flexible deployment options.

## Changes Made

### 1. Removed Dependencies
- Removed all `magic_pdf` imports and dependencies
- Removed `PyPDF2` direct usage (though kept in requirements for potential other uses)

### 2. New Implementation
- **REST API Integration**: Uses HTTP requests to call Mineru's API
- **Configurable Settings**: Mineru API URL and timeout are configurable
- **Error Handling**: Comprehensive error handling for network issues, timeouts, and API errors
- **Flexible Response Parsing**: Handles multiple possible response formats from Mineru API

### 3. Configuration

Add the following settings to your environment or `.env` file:

```bash
# Mineru API Configuration
MINERU_API_URL=http://mineru-api:8000
MINERU_TIMEOUT=300
MINERU_LANG_LIST=["ch"]
MINERU_BACKEND=pipeline
MINERU_PARSE_METHOD=auto
MINERU_FORMULA_ENABLE=true
MINERU_TABLE_ENABLE=true
```

### 4. API Endpoint

The processor expects Mineru to provide a REST API endpoint at `/file_parse` that accepts PDF files via multipart form data and returns JSON with markdown content.

#### Expected Request Format:
```
POST /file_parse
Content-Type: multipart/form-data

files: [PDF file]
output_dir: ./output
lang_list: ["ch"]
backend: pipeline
parse_method: auto
formula_enable: true
table_enable: true
return_md: true
return_middle_json: false
return_model_output: false
return_content_list: false
return_images: false
start_page_id: 0
end_page_id: 99999
```

#### Expected Response Format:
The processor can handle multiple response formats:

```json
{
  "markdown": "# Document Title\n\nContent here..."
}
```

OR

```json
{
  "md": "# Document Title\n\nContent here..."
}
```

OR

```json
{
  "content": "# Document Title\n\nContent here..."
}
```

OR

```json
{
  "result": {
    "markdown": "# Document Title\n\nContent here..."
  }
}
```

## Usage

### Basic Usage

```python
from app.core.document_handlers.processors.pdf_processor import PdfDocumentProcessor

# Create processor instance
processor = PdfDocumentProcessor("input.pdf", "output.md")

# Read and convert PDF to markdown
content = processor.read_content()

# Process content (apply masking)
processed_content = processor.process_content(content)

# Save processed content
processor.save_content(processed_content)
```

### Through Document Service

```python
from app.core.services.document_service import DocumentService

service = DocumentService()
success = service.process_document("input.pdf", "output.md")
```

## Testing

Run the test script to verify the implementation:

```bash
cd backend
python test_pdf_processor.py
```

Make sure you have:
1. A sample PDF file in the `sample_doc/` directory
2. Mineru API service running and accessible
3. Proper network connectivity between services

## Error Handling

The processor handles various error scenarios:

- **Network Timeouts**: Configurable timeout (default: 5 minutes)
- **API Errors**: HTTP status code errors are logged and handled
- **Response Parsing**: Multiple fallback strategies for extracting markdown content
- **File Operations**: Proper error handling for file reading/writing

## Logging

The processor provides detailed logging for debugging:

- API call attempts and responses
- Content extraction results
- Error conditions and stack traces
- Processing statistics

## Deployment

### Docker Compose

Ensure your Mineru service is running and accessible. The default configuration expects it at `http://mineru-api:8000`.

### Environment Variables

Set the following environment variables in your deployment:

```bash
MINERU_API_URL=http://your-mineru-service:8000
MINERU_TIMEOUT=300
```

## Troubleshooting

### Common Issues

1. **Connection Refused**: Check if Mineru service is running and accessible
2. **Timeout Errors**: Increase `MINERU_TIMEOUT` for large PDF files
3. **Empty Content**: Check Mineru API response format and logs
4. **Network Issues**: Verify network connectivity between services

### Debug Mode

Enable debug logging to see detailed API interactions:

```python
import logging
logging.getLogger('app.core.document_handlers.processors.pdf_processor').setLevel(logging.DEBUG)
```

## Migration from magic_pdf

If you were previously using magic_pdf:

1. **No Code Changes Required**: The interface remains the same
2. **Configuration Update**: Add Mineru API settings
3. **Service Dependencies**: Ensure Mineru service is running
4. **Testing**: Run the test script to verify functionality

## Performance Considerations

- **Timeout**: Large PDFs may require longer timeouts
- **Memory**: The processor loads the entire PDF into memory for API calls
- **Network**: API calls add network latency to processing time
- **Caching**: Consider implementing caching for frequently processed documents