legal-doc-masker/mineru/MINERU_API_README.md

201 lines
5.7 KiB
Markdown

# Mineru API Documentation
This document describes the FastAPI interface for the Mineru document parsing service.
## Overview
The Mineru API provides endpoints for parsing documents (PDFs, images) using advanced OCR and layout analysis. It supports both pipeline and VLM backends for different use cases.
## Base URL
```
http://localhost:8000/api/v1/mineru
```
## Endpoints
### 1. Health Check
**GET** `/health`
Check if the Mineru service is running.
**Response:**
```json
{
"status": "healthy",
"service": "mineru"
}
```
### 2. Parse Document
**POST** `/parse`
Parse a document using Mineru's advanced parsing capabilities.
**Parameters:**
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `file` | File | Required | The document file to parse (PDF, PNG, JPEG, JPG) |
| `lang` | string | "ch" | Language option ('ch', 'en', 'korean', 'japan', etc.) |
| `backend` | string | "pipeline" | Backend for parsing ('pipeline', 'vlm-transformers', 'vlm-sglang-engine', 'vlm-sglang-client') |
| `method` | string | "auto" | Method for parsing ('auto', 'txt', 'ocr') |
| `server_url` | string | null | Server URL for vlm-sglang-client backend |
| `start_page_id` | int | 0 | Start page ID for parsing |
| `end_page_id` | int | null | End page ID for parsing |
| `formula_enable` | boolean | true | Enable formula parsing |
| `table_enable` | boolean | true | Enable table parsing |
| `draw_layout_bbox` | boolean | true | Whether to draw layout bounding boxes |
| `draw_span_bbox` | boolean | true | Whether to draw span bounding boxes |
| `dump_md` | boolean | true | Whether to dump markdown files |
| `dump_middle_json` | boolean | true | Whether to dump middle JSON files |
| `dump_model_output` | boolean | true | Whether to dump model output files |
| `dump_orig_pdf` | boolean | true | Whether to dump original PDF files |
| `dump_content_list` | boolean | true | Whether to dump content list files |
| `make_md_mode` | string | "MM_MD" | The mode for making markdown content |
**Response:**
```json
{
"status": "success",
"file_name": "document_name",
"outputs": {
"markdown": "/path/to/document_name.md",
"middle_json": "/path/to/document_name_middle.json",
"model_output": "/path/to/document_name_model.json",
"content_list": "/path/to/document_name_content_list.json",
"original_pdf": "/path/to/document_name_origin.pdf",
"layout_pdf": "/path/to/document_name_layout.pdf",
"span_pdf": "/path/to/document_name_span.pdf"
},
"output_directory": "/path/to/output/directory"
}
```
### 3. Download Processed File
**GET** `/download/{file_path}`
Download a processed file from the Mineru output directory.
**Parameters:**
- `file_path`: Path to the file relative to the mineru output directory
**Response:** File download
## Usage Examples
### Python Example
```python
import requests
# Parse a document
with open('document.pdf', 'rb') as f:
files = {'file': ('document.pdf', f, 'application/pdf')}
params = {
'lang': 'ch',
'backend': 'pipeline',
'method': 'auto',
'formula_enable': True,
'table_enable': True
}
response = requests.post(
'http://localhost:8000/api/v1/mineru/parse',
files=files,
params=params
)
if response.status_code == 200:
result = response.json()
print(f"Parsed successfully: {result['file_name']}")
# Download the markdown file
md_path = result['outputs']['markdown']
download_response = requests.get(
f'http://localhost:8000/api/v1/mineru/download/{md_path}'
)
with open('output.md', 'wb') as f:
f.write(download_response.content)
```
### cURL Example
```bash
# Parse a document
curl -X POST "http://localhost:8000/api/v1/mineru/parse" \
-F "file=@document.pdf" \
-F "lang=ch" \
-F "backend=pipeline" \
-F "method=auto"
# Download a processed file
curl -X GET "http://localhost:8000/api/v1/mineru/download/path/to/file.md" \
-o downloaded_file.md
```
## Backend Options
### Pipeline Backend
- **Use case**: General purpose, more robust
- **Advantages**: Better for complex layouts, supports multiple languages
- **Command**: `backend=pipeline`
### VLM Backends
- **vlm-transformers**: General purpose VLM
- **vlm-sglang-engine**: Faster engine-based approach
- **vlm-sglang-client**: Fastest client-based approach (requires server_url)
## Language Support
Supported languages for the pipeline backend:
- `ch`: Chinese (Simplified)
- `en`: English
- `korean`: Korean
- `japan`: Japanese
- `chinese_cht`: Chinese (Traditional)
- `ta`: Tamil
- `te`: Telugu
- `ka`: Kannada
## Output Files
The API generates various output files depending on the parameters:
1. **Markdown** (`.md`): Structured text content
2. **Middle JSON** (`.json`): Intermediate parsing results
3. **Model Output** (`.json` or `.txt`): Raw model predictions
4. **Content List** (`.json`): Structured content list
5. **Original PDF**: Copy of the input file
6. **Layout PDF**: PDF with layout bounding boxes
7. **Span PDF**: PDF with span bounding boxes
## Error Handling
The API returns appropriate HTTP status codes:
- `200`: Success
- `400`: Bad request (invalid parameters, unsupported file type)
- `404`: File not found
- `500`: Internal server error
Error responses include a detail message explaining the issue.
## Testing
Use the provided test script to verify the API:
```bash
python test_mineru_api.py
```
## Notes
- The API creates unique output directories for each request to avoid conflicts
- Temporary files are automatically cleaned up after processing
- File downloads are restricted to the processed folder for security
- Large files may take time to process depending on the backend and document complexity