legal-doc-masker/magicdoc/README.md

2.5 KiB

MagicDoc API Service

A FastAPI service that provides document to markdown conversion using the Magic-Doc library. This service is designed to be compatible with the existing Mineru API interface.

Features

  • Converts DOC, DOCX, PPT, PPTX, and PDF files to markdown
  • RESTful API interface compatible with Mineru API
  • Docker containerization with LibreOffice dependencies
  • Health check endpoint
  • File upload support

API Endpoints

Health Check

GET /health

Returns service health status.

File Parse

POST /file_parse

Converts uploaded document to markdown.

Parameters:

  • files: File upload (required)
  • output_dir: Output directory (default: "./output")
  • lang_list: Language list (default: "ch")
  • backend: Backend type (default: "pipeline")
  • parse_method: Parse method (default: "auto")
  • formula_enable: Enable formula processing (default: true)
  • table_enable: Enable table processing (default: true)
  • return_md: Return markdown (default: true)
  • return_middle_json: Return middle JSON (default: false)
  • return_model_output: Return model output (default: false)
  • return_content_list: Return content list (default: false)
  • return_images: Return images (default: false)
  • start_page_id: Start page ID (default: 0)
  • end_page_id: End page ID (default: 99999)

Response:

{
  "markdown": "converted markdown content",
  "md": "converted markdown content",
  "content": "converted markdown content",
  "text": "converted markdown content",
  "time_cost": 1.23,
  "filename": "document.docx",
  "status": "success"
}

Running with Docker

Build and run with docker-compose

cd magicdoc
docker-compose up --build

The service will be available at http://localhost:8002

Build and run with Docker

cd magicdoc
docker build -t magicdoc-api .
docker run -p 8002:8000 magicdoc-api

Integration with Document Processors

This service is designed to be compatible with the existing document processors. To use it instead of Mineru API, update the configuration in your document processors:

# In docx_processor.py or pdf_processor.py
self.magicdoc_base_url = getattr(settings, 'MAGICDOC_API_URL', 'http://magicdoc-api:8000')

Dependencies

  • Python 3.10
  • LibreOffice (installed in Docker container)
  • Magic-Doc library
  • FastAPI
  • Uvicorn

Storage

The service creates the following directories:

  • storage/uploads/: For uploaded files
  • storage/processed/: For processed files