feat:新增magicdoc
This commit is contained in:
parent
a16b69475e
commit
0820d7bba2
|
|
@ -0,0 +1,42 @@
|
||||||
|
FROM python:3.10-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# Install system dependencies including LibreOffice
|
||||||
|
RUN apt-get update && apt-get install -y \
|
||||||
|
build-essential \
|
||||||
|
libreoffice \
|
||||||
|
libreoffice-writer \
|
||||||
|
libreoffice-calc \
|
||||||
|
libreoffice-impress \
|
||||||
|
wget \
|
||||||
|
curl \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
|
||||||
|
# Copy requirements and install Python packages
|
||||||
|
RUN pip install --upgrade pip
|
||||||
|
RUN pip install uv
|
||||||
|
|
||||||
|
# Configure uv and install mineru
|
||||||
|
ENV UV_SYSTEM_PYTHON=1
|
||||||
|
RUN uv pip install --system -U "fairy-doc[cpu]"
|
||||||
|
|
||||||
|
COPY requirements.txt .
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
|
# Copy the application code
|
||||||
|
COPY app/ ./app/
|
||||||
|
|
||||||
|
# Create storage directories
|
||||||
|
RUN mkdir -p storage/uploads storage/processed
|
||||||
|
|
||||||
|
# Expose the port the app runs on
|
||||||
|
EXPOSE 8000
|
||||||
|
|
||||||
|
# Health check
|
||||||
|
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
|
||||||
|
CMD curl -f http://localhost:8000/health || exit 1
|
||||||
|
|
||||||
|
# Command to run the application
|
||||||
|
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||||
|
|
@ -0,0 +1,94 @@
|
||||||
|
# MagicDoc API Service
|
||||||
|
|
||||||
|
A FastAPI service that provides document to markdown conversion using the Magic-Doc library. This service is designed to be compatible with the existing Mineru API interface.
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- Converts DOC, DOCX, PPT, PPTX, and PDF files to markdown
|
||||||
|
- RESTful API interface compatible with Mineru API
|
||||||
|
- Docker containerization with LibreOffice dependencies
|
||||||
|
- Health check endpoint
|
||||||
|
- File upload support
|
||||||
|
|
||||||
|
## API Endpoints
|
||||||
|
|
||||||
|
### Health Check
|
||||||
|
```
|
||||||
|
GET /health
|
||||||
|
```
|
||||||
|
Returns service health status.
|
||||||
|
|
||||||
|
### File Parse
|
||||||
|
```
|
||||||
|
POST /file_parse
|
||||||
|
```
|
||||||
|
Converts uploaded document to markdown.
|
||||||
|
|
||||||
|
**Parameters:**
|
||||||
|
- `files`: File upload (required)
|
||||||
|
- `output_dir`: Output directory (default: "./output")
|
||||||
|
- `lang_list`: Language list (default: "ch")
|
||||||
|
- `backend`: Backend type (default: "pipeline")
|
||||||
|
- `parse_method`: Parse method (default: "auto")
|
||||||
|
- `formula_enable`: Enable formula processing (default: true)
|
||||||
|
- `table_enable`: Enable table processing (default: true)
|
||||||
|
- `return_md`: Return markdown (default: true)
|
||||||
|
- `return_middle_json`: Return middle JSON (default: false)
|
||||||
|
- `return_model_output`: Return model output (default: false)
|
||||||
|
- `return_content_list`: Return content list (default: false)
|
||||||
|
- `return_images`: Return images (default: false)
|
||||||
|
- `start_page_id`: Start page ID (default: 0)
|
||||||
|
- `end_page_id`: End page ID (default: 99999)
|
||||||
|
|
||||||
|
**Response:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"markdown": "converted markdown content",
|
||||||
|
"md": "converted markdown content",
|
||||||
|
"content": "converted markdown content",
|
||||||
|
"text": "converted markdown content",
|
||||||
|
"time_cost": 1.23,
|
||||||
|
"filename": "document.docx",
|
||||||
|
"status": "success"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Running with Docker
|
||||||
|
|
||||||
|
### Build and run with docker-compose
|
||||||
|
```bash
|
||||||
|
cd magicdoc
|
||||||
|
docker-compose up --build
|
||||||
|
```
|
||||||
|
|
||||||
|
The service will be available at `http://localhost:8002`
|
||||||
|
|
||||||
|
### Build and run with Docker
|
||||||
|
```bash
|
||||||
|
cd magicdoc
|
||||||
|
docker build -t magicdoc-api .
|
||||||
|
docker run -p 8002:8000 magicdoc-api
|
||||||
|
```
|
||||||
|
|
||||||
|
## Integration with Document Processors
|
||||||
|
|
||||||
|
This service is designed to be compatible with the existing document processors. To use it instead of Mineru API, update the configuration in your document processors:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# In docx_processor.py or pdf_processor.py
|
||||||
|
self.magicdoc_base_url = getattr(settings, 'MAGICDOC_API_URL', 'http://magicdoc-api:8000')
|
||||||
|
```
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- Python 3.10
|
||||||
|
- LibreOffice (installed in Docker container)
|
||||||
|
- Magic-Doc library
|
||||||
|
- FastAPI
|
||||||
|
- Uvicorn
|
||||||
|
|
||||||
|
## Storage
|
||||||
|
|
||||||
|
The service creates the following directories:
|
||||||
|
- `storage/uploads/`: For uploaded files
|
||||||
|
- `storage/processed/`: For processed files
|
||||||
|
|
@ -0,0 +1,152 @@
|
||||||
|
# MagicDoc Service Setup Guide
|
||||||
|
|
||||||
|
This guide explains how to set up and use the MagicDoc API service as an alternative to the Mineru API for document processing.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The MagicDoc service provides a FastAPI-based REST API that converts various document formats (DOC, DOCX, PPT, PPTX, PDF) to markdown using the Magic-Doc library. It's designed to be compatible with your existing document processors.
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
### 1. Build and Run the Service
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd magicdoc
|
||||||
|
./start.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Or manually:
|
||||||
|
```bash
|
||||||
|
cd magicdoc
|
||||||
|
docker-compose up --build -d
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Verify the Service
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check health
|
||||||
|
curl http://localhost:8002/health
|
||||||
|
|
||||||
|
# View API documentation
|
||||||
|
open http://localhost:8002/docs
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Test with Sample Files
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd magicdoc
|
||||||
|
python test_api.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## API Compatibility
|
||||||
|
|
||||||
|
The MagicDoc API is designed to be compatible with your existing Mineru API interface:
|
||||||
|
|
||||||
|
### Endpoint: `POST /file_parse`
|
||||||
|
|
||||||
|
**Request Format:**
|
||||||
|
- File upload via multipart form data
|
||||||
|
- Same parameters as Mineru API (most are optional)
|
||||||
|
|
||||||
|
**Response Format:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"markdown": "converted content",
|
||||||
|
"md": "converted content",
|
||||||
|
"content": "converted content",
|
||||||
|
"text": "converted content",
|
||||||
|
"time_cost": 1.23,
|
||||||
|
"filename": "document.docx",
|
||||||
|
"status": "success"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Integration with Existing Processors
|
||||||
|
|
||||||
|
To use MagicDoc instead of Mineru in your existing processors:
|
||||||
|
|
||||||
|
### 1. Update Configuration
|
||||||
|
|
||||||
|
Add to your settings:
|
||||||
|
```python
|
||||||
|
MAGICDOC_API_URL = "http://magicdoc-api:8000" # or http://localhost:8002
|
||||||
|
MAGICDOC_TIMEOUT = 300
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Modify Processors
|
||||||
|
|
||||||
|
Replace Mineru API calls with MagicDoc API calls. See `integration_example.py` for detailed examples.
|
||||||
|
|
||||||
|
### 3. Update Docker Compose
|
||||||
|
|
||||||
|
Add the MagicDoc service to your main docker-compose.yml:
|
||||||
|
```yaml
|
||||||
|
services:
|
||||||
|
magicdoc-api:
|
||||||
|
build:
|
||||||
|
context: ./magicdoc
|
||||||
|
dockerfile: Dockerfile
|
||||||
|
ports:
|
||||||
|
- "8002:8000"
|
||||||
|
volumes:
|
||||||
|
- ./magicdoc/storage:/app/storage
|
||||||
|
environment:
|
||||||
|
- PYTHONUNBUFFERED=1
|
||||||
|
restart: unless-stopped
|
||||||
|
```
|
||||||
|
|
||||||
|
## Service Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
magicdoc/
|
||||||
|
├── app/
|
||||||
|
│ ├── __init__.py
|
||||||
|
│ └── main.py # FastAPI application
|
||||||
|
├── Dockerfile # Container definition
|
||||||
|
├── docker-compose.yml # Service orchestration
|
||||||
|
├── requirements.txt # Python dependencies
|
||||||
|
├── README.md # Service documentation
|
||||||
|
├── SETUP.md # This setup guide
|
||||||
|
├── test_api.py # API testing script
|
||||||
|
├── integration_example.py # Integration examples
|
||||||
|
└── start.sh # Startup script
|
||||||
|
```
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- **Python 3.10**: Base runtime
|
||||||
|
- **LibreOffice**: Document processing (installed in container)
|
||||||
|
- **Magic-Doc**: Document conversion library
|
||||||
|
- **FastAPI**: Web framework
|
||||||
|
- **Uvicorn**: ASGI server
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Service Won't Start
|
||||||
|
1. Check Docker is running
|
||||||
|
2. Verify port 8002 is available
|
||||||
|
3. Check logs: `docker-compose logs`
|
||||||
|
|
||||||
|
### File Conversion Fails
|
||||||
|
1. Verify LibreOffice is working in container
|
||||||
|
2. Check file format is supported
|
||||||
|
3. Review API logs for errors
|
||||||
|
|
||||||
|
### Integration Issues
|
||||||
|
1. Verify API endpoint URL
|
||||||
|
2. Check network connectivity between services
|
||||||
|
3. Ensure response format compatibility
|
||||||
|
|
||||||
|
## Performance Considerations
|
||||||
|
|
||||||
|
- MagicDoc is generally faster than Mineru for simple documents
|
||||||
|
- LibreOffice dependency adds container size
|
||||||
|
- Consider caching for repeated conversions
|
||||||
|
- Monitor memory usage for large files
|
||||||
|
|
||||||
|
## Security Notes
|
||||||
|
|
||||||
|
- Service runs on internal network
|
||||||
|
- File uploads are temporary
|
||||||
|
- No persistent storage of uploaded files
|
||||||
|
- Consider adding authentication for production use
|
||||||
|
|
@ -0,0 +1 @@
|
||||||
|
# MagicDoc FastAPI Application
|
||||||
|
|
@ -0,0 +1,96 @@
|
||||||
|
import os
|
||||||
|
import logging
|
||||||
|
from typing import Dict, Any, Optional
|
||||||
|
from fastapi import FastAPI, File, UploadFile, Form, HTTPException
|
||||||
|
from fastapi.responses import JSONResponse
|
||||||
|
from magic_doc.docconv import DocConverter, S3Config
|
||||||
|
import tempfile
|
||||||
|
import shutil
|
||||||
|
|
||||||
|
# Configure logging
|
||||||
|
logging.basicConfig(level=logging.INFO)
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
app = FastAPI(title="MagicDoc API", version="1.0.0")
|
||||||
|
|
||||||
|
# Global converter instance
|
||||||
|
converter = DocConverter(s3_config=None)
|
||||||
|
|
||||||
|
@app.get("/health")
|
||||||
|
async def health_check():
|
||||||
|
"""Health check endpoint"""
|
||||||
|
return {"status": "healthy", "service": "magicdoc-api"}
|
||||||
|
|
||||||
|
@app.post("/file_parse")
|
||||||
|
async def parse_file(
|
||||||
|
files: UploadFile = File(...),
|
||||||
|
output_dir: str = Form("./output"),
|
||||||
|
lang_list: str = Form("ch"),
|
||||||
|
backend: str = Form("pipeline"),
|
||||||
|
parse_method: str = Form("auto"),
|
||||||
|
formula_enable: bool = Form(True),
|
||||||
|
table_enable: bool = Form(True),
|
||||||
|
return_md: bool = Form(True),
|
||||||
|
return_middle_json: bool = Form(False),
|
||||||
|
return_model_output: bool = Form(False),
|
||||||
|
return_content_list: bool = Form(False),
|
||||||
|
return_images: bool = Form(False),
|
||||||
|
start_page_id: int = Form(0),
|
||||||
|
end_page_id: int = Form(99999)
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Parse document file and convert to markdown
|
||||||
|
Compatible with Mineru API interface
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
logger.info(f"Processing file: {files.filename}")
|
||||||
|
|
||||||
|
# Create temporary file to save uploaded content
|
||||||
|
with tempfile.NamedTemporaryFile(delete=False, suffix=os.path.splitext(files.filename)[1]) as temp_file:
|
||||||
|
shutil.copyfileobj(files.file, temp_file)
|
||||||
|
temp_file_path = temp_file.name
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Convert file to markdown using magic-doc
|
||||||
|
markdown_content, time_cost = converter.convert(temp_file_path, conv_timeout=300)
|
||||||
|
|
||||||
|
logger.info(f"Successfully converted {files.filename} to markdown in {time_cost:.2f}s")
|
||||||
|
|
||||||
|
# Return response compatible with Mineru API
|
||||||
|
response = {
|
||||||
|
"markdown": markdown_content,
|
||||||
|
"md": markdown_content, # Alternative field name
|
||||||
|
"content": markdown_content, # Alternative field name
|
||||||
|
"text": markdown_content, # Alternative field name
|
||||||
|
"time_cost": time_cost,
|
||||||
|
"filename": files.filename,
|
||||||
|
"status": "success"
|
||||||
|
}
|
||||||
|
|
||||||
|
return JSONResponse(content=response)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
# Clean up temporary file
|
||||||
|
if os.path.exists(temp_file_path):
|
||||||
|
os.unlink(temp_file_path)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error processing file {files.filename}: {str(e)}")
|
||||||
|
raise HTTPException(status_code=500, detail=f"Error processing file: {str(e)}")
|
||||||
|
|
||||||
|
@app.get("/")
|
||||||
|
async def root():
|
||||||
|
"""Root endpoint with service information"""
|
||||||
|
return {
|
||||||
|
"service": "MagicDoc API",
|
||||||
|
"version": "1.0.0",
|
||||||
|
"description": "Document to Markdown conversion service using Magic-Doc",
|
||||||
|
"endpoints": {
|
||||||
|
"health": "/health",
|
||||||
|
"file_parse": "/file_parse"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import uvicorn
|
||||||
|
uvicorn.run(app, host="0.0.0.0", port=8000)
|
||||||
|
|
@ -0,0 +1,26 @@
|
||||||
|
version: '3.8'
|
||||||
|
|
||||||
|
services:
|
||||||
|
magicdoc-api:
|
||||||
|
build:
|
||||||
|
context: .
|
||||||
|
dockerfile: Dockerfile
|
||||||
|
platform: linux/amd64
|
||||||
|
ports:
|
||||||
|
- "8002:8000"
|
||||||
|
volumes:
|
||||||
|
- ./storage/uploads:/app/storage/uploads
|
||||||
|
- ./storage/processed:/app/storage/processed
|
||||||
|
environment:
|
||||||
|
- PYTHONUNBUFFERED=1
|
||||||
|
restart: unless-stopped
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
start_period: 60s
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
uploads:
|
||||||
|
processed:
|
||||||
|
|
@ -0,0 +1,144 @@
|
||||||
|
"""
|
||||||
|
Example of how to integrate MagicDoc API with existing document processors
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Example modification for docx_processor.py
|
||||||
|
# Replace the Mineru API configuration with MagicDoc API configuration
|
||||||
|
|
||||||
|
class DocxDocumentProcessor(DocumentProcessor):
|
||||||
|
def __init__(self, input_path: str, output_path: str):
|
||||||
|
super().__init__()
|
||||||
|
self.input_path = input_path
|
||||||
|
self.output_path = output_path
|
||||||
|
self.output_dir = os.path.dirname(output_path)
|
||||||
|
self.name_without_suff = os.path.splitext(os.path.basename(input_path))[0]
|
||||||
|
|
||||||
|
# Setup work directory for temporary files
|
||||||
|
self.work_dir = os.path.join(
|
||||||
|
os.path.dirname(output_path),
|
||||||
|
".work",
|
||||||
|
os.path.splitext(os.path.basename(input_path))[0]
|
||||||
|
)
|
||||||
|
os.makedirs(self.work_dir, exist_ok=True)
|
||||||
|
|
||||||
|
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
|
||||||
|
|
||||||
|
# MagicDoc API configuration (instead of Mineru)
|
||||||
|
self.magicdoc_base_url = getattr(settings, 'MAGICDOC_API_URL', 'http://magicdoc-api:8000')
|
||||||
|
self.magicdoc_timeout = getattr(settings, 'MAGICDOC_TIMEOUT', 300) # 5 minutes timeout
|
||||||
|
|
||||||
|
def _call_magicdoc_api(self, file_path: str) -> Optional[Dict[str, Any]]:
|
||||||
|
"""
|
||||||
|
Call MagicDoc API to convert DOCX to markdown
|
||||||
|
|
||||||
|
Args:
|
||||||
|
file_path: Path to the DOCX file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
API response as dictionary or None if failed
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
url = f"{self.magicdoc_base_url}/file_parse"
|
||||||
|
|
||||||
|
with open(file_path, 'rb') as file:
|
||||||
|
files = {'files': (os.path.basename(file_path), file, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document')}
|
||||||
|
|
||||||
|
# Prepare form data - simplified compared to Mineru
|
||||||
|
data = {
|
||||||
|
'output_dir': './output',
|
||||||
|
'lang_list': 'ch',
|
||||||
|
'backend': 'pipeline',
|
||||||
|
'parse_method': 'auto',
|
||||||
|
'formula_enable': True,
|
||||||
|
'table_enable': True,
|
||||||
|
'return_md': True,
|
||||||
|
'return_middle_json': False,
|
||||||
|
'return_model_output': False,
|
||||||
|
'return_content_list': False,
|
||||||
|
'return_images': False,
|
||||||
|
'start_page_id': 0,
|
||||||
|
'end_page_id': 99999
|
||||||
|
}
|
||||||
|
|
||||||
|
logger.info(f"Calling MagicDoc API for DOCX processing at {url}")
|
||||||
|
response = requests.post(
|
||||||
|
url,
|
||||||
|
files=files,
|
||||||
|
data=data,
|
||||||
|
timeout=self.magicdoc_timeout
|
||||||
|
)
|
||||||
|
|
||||||
|
if response.status_code == 200:
|
||||||
|
result = response.json()
|
||||||
|
logger.info("Successfully received response from MagicDoc API for DOCX")
|
||||||
|
return result
|
||||||
|
else:
|
||||||
|
error_msg = f"MagicDoc API returned status code {response.status_code}: {response.text}"
|
||||||
|
logger.error(error_msg)
|
||||||
|
raise Exception(error_msg)
|
||||||
|
|
||||||
|
except requests.exceptions.Timeout:
|
||||||
|
error_msg = f"MagicDoc API request timed out after {self.magicdoc_timeout} seconds"
|
||||||
|
logger.error(error_msg)
|
||||||
|
raise Exception(error_msg)
|
||||||
|
except requests.exceptions.RequestException as e:
|
||||||
|
error_msg = f"Error calling MagicDoc API for DOCX: {str(e)}"
|
||||||
|
logger.error(error_msg)
|
||||||
|
raise Exception(error_msg)
|
||||||
|
except Exception as e:
|
||||||
|
error_msg = f"Unexpected error calling MagicDoc API for DOCX: {str(e)}"
|
||||||
|
logger.error(error_msg)
|
||||||
|
raise Exception(error_msg)
|
||||||
|
|
||||||
|
def read_content(self) -> str:
|
||||||
|
logger.info("Starting DOCX content processing with MagicDoc API")
|
||||||
|
|
||||||
|
# Call MagicDoc API to convert DOCX to markdown
|
||||||
|
magicdoc_response = self._call_magicdoc_api(self.input_path)
|
||||||
|
|
||||||
|
# Extract markdown content from the response
|
||||||
|
markdown_content = self._extract_markdown_from_response(magicdoc_response)
|
||||||
|
|
||||||
|
if not markdown_content:
|
||||||
|
raise Exception("No markdown content found in MagicDoc API response for DOCX")
|
||||||
|
|
||||||
|
logger.info(f"Successfully extracted {len(markdown_content)} characters of markdown content from DOCX")
|
||||||
|
|
||||||
|
# Save the raw markdown content to work directory for reference
|
||||||
|
md_output_path = os.path.join(self.work_dir, f"{self.name_without_suff}.md")
|
||||||
|
with open(md_output_path, 'w', encoding='utf-8') as file:
|
||||||
|
file.write(markdown_content)
|
||||||
|
|
||||||
|
logger.info(f"Saved raw markdown content from DOCX to {md_output_path}")
|
||||||
|
|
||||||
|
return markdown_content
|
||||||
|
|
||||||
|
# Configuration changes needed in settings.py:
|
||||||
|
"""
|
||||||
|
# Add these settings to your configuration
|
||||||
|
MAGICDOC_API_URL = "http://magicdoc-api:8000" # or http://localhost:8002 for local development
|
||||||
|
MAGICDOC_TIMEOUT = 300 # 5 minutes timeout
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Docker Compose integration:
|
||||||
|
"""
|
||||||
|
# Add to your main docker-compose.yml
|
||||||
|
services:
|
||||||
|
magicdoc-api:
|
||||||
|
build:
|
||||||
|
context: ./magicdoc
|
||||||
|
dockerfile: Dockerfile
|
||||||
|
ports:
|
||||||
|
- "8002:8000"
|
||||||
|
volumes:
|
||||||
|
- ./magicdoc/storage:/app/storage
|
||||||
|
environment:
|
||||||
|
- PYTHONUNBUFFERED=1
|
||||||
|
restart: unless-stopped
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
start_period: 60s
|
||||||
|
"""
|
||||||
|
|
@ -0,0 +1,5 @@
|
||||||
|
fastapi==0.104.1
|
||||||
|
uvicorn[standard]==0.24.0
|
||||||
|
python-multipart==0.0.6
|
||||||
|
# fairy-doc[cpu]==0.1.0
|
||||||
|
pydantic==2.5.0
|
||||||
|
|
@ -0,0 +1,34 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
# MagicDoc API Service Startup Script
|
||||||
|
|
||||||
|
echo "Starting MagicDoc API Service..."
|
||||||
|
|
||||||
|
# Check if Docker is running
|
||||||
|
if ! docker info > /dev/null 2>&1; then
|
||||||
|
echo "Error: Docker is not running. Please start Docker first."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Build and start the service
|
||||||
|
echo "Building and starting MagicDoc API service..."
|
||||||
|
docker-compose up --build -d
|
||||||
|
|
||||||
|
# Wait for service to be ready
|
||||||
|
echo "Waiting for service to be ready..."
|
||||||
|
sleep 10
|
||||||
|
|
||||||
|
# Check health
|
||||||
|
echo "Checking service health..."
|
||||||
|
if curl -f http://localhost:8002/health > /dev/null 2>&1; then
|
||||||
|
echo "✅ MagicDoc API service is running successfully!"
|
||||||
|
echo "🌐 Service URL: http://localhost:8002"
|
||||||
|
echo "📖 API Documentation: http://localhost:8002/docs"
|
||||||
|
echo "🔍 Health Check: http://localhost:8002/health"
|
||||||
|
else
|
||||||
|
echo "❌ Service health check failed. Check logs with: docker-compose logs"
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "To stop the service, run: docker-compose down"
|
||||||
|
echo "To view logs, run: docker-compose logs -f"
|
||||||
|
|
@ -0,0 +1,92 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Test script for MagicDoc API
|
||||||
|
"""
|
||||||
|
|
||||||
|
import requests
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
|
||||||
|
def test_health_check(base_url="http://localhost:8002"):
|
||||||
|
"""Test health check endpoint"""
|
||||||
|
try:
|
||||||
|
response = requests.get(f"{base_url}/health")
|
||||||
|
print(f"Health check status: {response.status_code}")
|
||||||
|
print(f"Response: {response.json()}")
|
||||||
|
return response.status_code == 200
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Health check failed: {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
def test_file_parse(base_url="http://localhost:8002", file_path=None):
|
||||||
|
"""Test file parse endpoint"""
|
||||||
|
if not file_path or not os.path.exists(file_path):
|
||||||
|
print(f"File not found: {file_path}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
try:
|
||||||
|
with open(file_path, 'rb') as f:
|
||||||
|
files = {'files': (os.path.basename(file_path), f, 'application/octet-stream')}
|
||||||
|
data = {
|
||||||
|
'output_dir': './output',
|
||||||
|
'lang_list': 'ch',
|
||||||
|
'backend': 'pipeline',
|
||||||
|
'parse_method': 'auto',
|
||||||
|
'formula_enable': True,
|
||||||
|
'table_enable': True,
|
||||||
|
'return_md': True,
|
||||||
|
'return_middle_json': False,
|
||||||
|
'return_model_output': False,
|
||||||
|
'return_content_list': False,
|
||||||
|
'return_images': False,
|
||||||
|
'start_page_id': 0,
|
||||||
|
'end_page_id': 99999
|
||||||
|
}
|
||||||
|
|
||||||
|
response = requests.post(f"{base_url}/file_parse", files=files, data=data)
|
||||||
|
print(f"File parse status: {response.status_code}")
|
||||||
|
|
||||||
|
if response.status_code == 200:
|
||||||
|
result = response.json()
|
||||||
|
print(f"Success! Converted {len(result.get('markdown', ''))} characters")
|
||||||
|
print(f"Time cost: {result.get('time_cost', 'N/A')}s")
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
print(f"Error: {response.text}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"File parse failed: {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
def main():
|
||||||
|
"""Main test function"""
|
||||||
|
print("Testing MagicDoc API...")
|
||||||
|
|
||||||
|
# Test health check
|
||||||
|
print("\n1. Testing health check...")
|
||||||
|
if not test_health_check():
|
||||||
|
print("Health check failed. Make sure the service is running.")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Test file parse (if sample file exists)
|
||||||
|
print("\n2. Testing file parse...")
|
||||||
|
sample_files = [
|
||||||
|
"../sample_doc/20220707_na_decision-2.docx",
|
||||||
|
"../sample_doc/20220707_na_decision-2.pdf",
|
||||||
|
"../sample_doc/short_doc.md"
|
||||||
|
]
|
||||||
|
|
||||||
|
for sample_file in sample_files:
|
||||||
|
if os.path.exists(sample_file):
|
||||||
|
print(f"Testing with {sample_file}...")
|
||||||
|
if test_file_parse(file_path=sample_file):
|
||||||
|
print("File parse test passed!")
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
print(f"Sample file not found: {sample_file}")
|
||||||
|
|
||||||
|
print("\nTest completed!")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
Loading…
Reference in New Issue