legal-doc-masker/magicdoc/SETUP.md

153 lines
3.6 KiB
Markdown

# MagicDoc Service Setup Guide
This guide explains how to set up and use the MagicDoc API service as an alternative to the Mineru API for document processing.
## Overview
The MagicDoc service provides a FastAPI-based REST API that converts various document formats (DOC, DOCX, PPT, PPTX, PDF) to markdown using the Magic-Doc library. It's designed to be compatible with your existing document processors.
## Quick Start
### 1. Build and Run the Service
```bash
cd magicdoc
./start.sh
```
Or manually:
```bash
cd magicdoc
docker-compose up --build -d
```
### 2. Verify the Service
```bash
# Check health
curl http://localhost:8002/health
# View API documentation
open http://localhost:8002/docs
```
### 3. Test with Sample Files
```bash
cd magicdoc
python test_api.py
```
## API Compatibility
The MagicDoc API is designed to be compatible with your existing Mineru API interface:
### Endpoint: `POST /file_parse`
**Request Format:**
- File upload via multipart form data
- Same parameters as Mineru API (most are optional)
**Response Format:**
```json
{
"markdown": "converted content",
"md": "converted content",
"content": "converted content",
"text": "converted content",
"time_cost": 1.23,
"filename": "document.docx",
"status": "success"
}
```
## Integration with Existing Processors
To use MagicDoc instead of Mineru in your existing processors:
### 1. Update Configuration
Add to your settings:
```python
MAGICDOC_API_URL = "http://magicdoc-api:8000" # or http://localhost:8002
MAGICDOC_TIMEOUT = 300
```
### 2. Modify Processors
Replace Mineru API calls with MagicDoc API calls. See `integration_example.py` for detailed examples.
### 3. Update Docker Compose
Add the MagicDoc service to your main docker-compose.yml:
```yaml
services:
magicdoc-api:
build:
context: ./magicdoc
dockerfile: Dockerfile
ports:
- "8002:8000"
volumes:
- ./magicdoc/storage:/app/storage
environment:
- PYTHONUNBUFFERED=1
restart: unless-stopped
```
## Service Architecture
```
magicdoc/
├── app/
│ ├── __init__.py
│ └── main.py # FastAPI application
├── Dockerfile # Container definition
├── docker-compose.yml # Service orchestration
├── requirements.txt # Python dependencies
├── README.md # Service documentation
├── SETUP.md # This setup guide
├── test_api.py # API testing script
├── integration_example.py # Integration examples
└── start.sh # Startup script
```
## Dependencies
- **Python 3.10**: Base runtime
- **LibreOffice**: Document processing (installed in container)
- **Magic-Doc**: Document conversion library
- **FastAPI**: Web framework
- **Uvicorn**: ASGI server
## Troubleshooting
### Service Won't Start
1. Check Docker is running
2. Verify port 8002 is available
3. Check logs: `docker-compose logs`
### File Conversion Fails
1. Verify LibreOffice is working in container
2. Check file format is supported
3. Review API logs for errors
### Integration Issues
1. Verify API endpoint URL
2. Check network connectivity between services
3. Ensure response format compatibility
## Performance Considerations
- MagicDoc is generally faster than Mineru for simple documents
- LibreOffice dependency adds container size
- Consider caching for repeated conversions
- Monitor memory usage for large files
## Security Notes
- Service runs on internal network
- File uploads are temporary
- No persistent storage of uploaded files
- Consider adding authentication for production use