legal-doc-masker/mineru/MINERU_API_README.md

# Mineru API Documentation

This document describes the FastAPI interface for the Mineru document parsing service.

## Overview

The Mineru API provides endpoints for parsing documents (PDFs, images) using advanced OCR and layout analysis. It supports both pipeline and VLM backends for different use cases.

## Base URL

```
http://localhost:8000/api/v1/mineru
```

## Endpoints

### 1. Health Check

**GET** `/health`

Check if the Mineru service is running.

**Response:**
```json
{
  "status": "healthy",
  "service": "mineru"
}
```

### 2. Parse Document

**POST** `/parse`

Parse a document using Mineru's advanced parsing capabilities.

**Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `file` | File | Required | The document file to parse (PDF, PNG, JPEG, JPG) |
| `lang` | string | "ch" | Language option ('ch', 'en', 'korean', 'japan', etc.) |
| `backend` | string | "pipeline" | Backend for parsing ('pipeline', 'vlm-transformers', 'vlm-sglang-engine', 'vlm-sglang-client') |
| `method` | string | "auto" | Method for parsing ('auto', 'txt', 'ocr') |
| `server_url` | string | null | Server URL for vlm-sglang-client backend |
| `start_page_id` | int | 0 | Start page ID for parsing |
| `end_page_id` | int | null | End page ID for parsing |
| `formula_enable` | boolean | true | Enable formula parsing |
| `table_enable` | boolean | true | Enable table parsing |
| `draw_layout_bbox` | boolean | true | Whether to draw layout bounding boxes |
| `draw_span_bbox` | boolean | true | Whether to draw span bounding boxes |
| `dump_md` | boolean | true | Whether to dump markdown files |
| `dump_middle_json` | boolean | true | Whether to dump middle JSON files |
| `dump_model_output` | boolean | true | Whether to dump model output files |
| `dump_orig_pdf` | boolean | true | Whether to dump original PDF files |
| `dump_content_list` | boolean | true | Whether to dump content list files |
| `make_md_mode` | string | "MM_MD" | The mode for making markdown content |

**Response:**
```json
{
  "status": "success",
  "file_name": "document_name",
  "outputs": {
    "markdown": "/path/to/document_name.md",
    "middle_json": "/path/to/document_name_middle.json",
    "model_output": "/path/to/document_name_model.json",
    "content_list": "/path/to/document_name_content_list.json",
    "original_pdf": "/path/to/document_name_origin.pdf",
    "layout_pdf": "/path/to/document_name_layout.pdf",
    "span_pdf": "/path/to/document_name_span.pdf"
  },
  "output_directory": "/path/to/output/directory"
}
```

### 3. Download Processed File

**GET** `/download/{file_path}`

Download a processed file from the Mineru output directory.

**Parameters:**
- `file_path`: Path to the file relative to the mineru output directory

**Response:** File download

## Usage Examples

### Python Example

```python
import requests

# Parse a document
with open('document.pdf', 'rb') as f:
    files = {'file': ('document.pdf', f, 'application/pdf')}
    params = {
        'lang': 'ch',
        'backend': 'pipeline',
        'method': 'auto',
        'formula_enable': True,
        'table_enable': True
    }

    response = requests.post(
        'http://localhost:8000/api/v1/mineru/parse',
        files=files,
        params=params
    )

    if response.status_code == 200:
        result = response.json()
        print(f"Parsed successfully: {result['file_name']}")

        # Download the markdown file
        md_path = result['outputs']['markdown']
        download_response = requests.get(
            f'http://localhost:8000/api/v1/mineru/download/{md_path}'
        )

        with open('output.md', 'wb') as f:
            f.write(download_response.content)
```

### cURL Example

```bash
# Parse a document
curl -X POST "http://localhost:8000/api/v1/mineru/parse" \
  -F "file=@document.pdf" \
  -F "lang=ch" \
  -F "backend=pipeline" \
  -F "method=auto"

# Download a processed file
curl -X GET "http://localhost:8000/api/v1/mineru/download/path/to/file.md" \
  -o downloaded_file.md
```

## Backend Options

### Pipeline Backend
- **Use case**: General purpose, more robust
- **Advantages**: Better for complex layouts, supports multiple languages
- **Command**: `backend=pipeline`

### VLM Backends
- **vlm-transformers**: General purpose VLM
- **vlm-sglang-engine**: Faster engine-based approach
- **vlm-sglang-client**: Fastest client-based approach (requires server_url)

## Language Support

Supported languages for the pipeline backend:
- `ch`: Chinese (Simplified)
- `en`: English
- `korean`: Korean
- `japan`: Japanese
- `chinese_cht`: Chinese (Traditional)
- `ta`: Tamil
- `te`: Telugu
- `ka`: Kannada

## Output Files

The API generates various output files depending on the parameters:

1. **Markdown** (`.md`): Structured text content
2. **Middle JSON** (`.json`): Intermediate parsing results
3. **Model Output** (`.json` or `.txt`): Raw model predictions
4. **Content List** (`.json`): Structured content list
5. **Original PDF**: Copy of the input file
6. **Layout PDF**: PDF with layout bounding boxes
7. **Span PDF**: PDF with span bounding boxes

## Error Handling

The API returns appropriate HTTP status codes:

- `200`: Success
- `400`: Bad request (invalid parameters, unsupported file type)
- `404`: File not found
- `500`: Internal server error

Error responses include a detail message explaining the issue.

## Testing

Use the provided test script to verify the API:

```bash
python test_mineru_api.py
```

## Notes

- The API creates unique output directories for each request to avoid conflicts
- Temporary files are automatically cleaned up after processing
- File downloads are restricted to the processed folder for security
- Large files may take time to process depending on the backend and document complexity