Compare commits
4 Commits
daf316bb92
...
88b790dd6b
| Author | SHA1 | Date |
|---|---|---|
|
|
88b790dd6b | |
|
|
d3e1927bc5 | |
|
|
e8cb7b1a04 | |
|
|
1ba4f3cc02 |
|
|
@ -12,7 +12,7 @@ RUN apt-get update && apt-get install -y \
|
|||
|
||||
# Copy requirements first to leverage Docker cache
|
||||
COPY requirements.txt .
|
||||
RUN pip install huggingface_hub
|
||||
# RUN pip install huggingface_hub
|
||||
# RUN wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
|
||||
# RUN wget https://raw.githubusercontent.com/opendatalab/MinerU/refs/heads/release-1.3.1/scripts/download_models_hf.py -O download_models_hf.py
|
||||
|
||||
|
|
@ -20,7 +20,7 @@ RUN pip install huggingface_hub
|
|||
|
||||
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
RUN pip install -U magic-pdf[full]
|
||||
# RUN pip install -U magic-pdf[full]
|
||||
|
||||
|
||||
# Copy the rest of the application
|
||||
|
|
|
|||
|
|
@ -0,0 +1,202 @@
|
|||
# PDF Processor with Mineru API
|
||||
|
||||
## Overview
|
||||
|
||||
The PDF processor has been rewritten to use Mineru's REST API instead of the magic_pdf library. This provides better separation of concerns and allows for more flexible deployment options.
|
||||
|
||||
## Changes Made
|
||||
|
||||
### 1. Removed Dependencies
|
||||
- Removed all `magic_pdf` imports and dependencies
|
||||
- Removed `PyPDF2` direct usage (though kept in requirements for potential other uses)
|
||||
|
||||
### 2. New Implementation
|
||||
- **REST API Integration**: Uses HTTP requests to call Mineru's API
|
||||
- **Configurable Settings**: Mineru API URL and timeout are configurable
|
||||
- **Error Handling**: Comprehensive error handling for network issues, timeouts, and API errors
|
||||
- **Flexible Response Parsing**: Handles multiple possible response formats from Mineru API
|
||||
|
||||
### 3. Configuration
|
||||
|
||||
Add the following settings to your environment or `.env` file:
|
||||
|
||||
```bash
|
||||
# Mineru API Configuration
|
||||
MINERU_API_URL=http://mineru-api:8000
|
||||
MINERU_TIMEOUT=300
|
||||
MINERU_LANG_LIST=["ch"]
|
||||
MINERU_BACKEND=pipeline
|
||||
MINERU_PARSE_METHOD=auto
|
||||
MINERU_FORMULA_ENABLE=true
|
||||
MINERU_TABLE_ENABLE=true
|
||||
```
|
||||
|
||||
### 4. API Endpoint
|
||||
|
||||
The processor expects Mineru to provide a REST API endpoint at `/file_parse` that accepts PDF files via multipart form data and returns JSON with markdown content.
|
||||
|
||||
#### Expected Request Format:
|
||||
```
|
||||
POST /file_parse
|
||||
Content-Type: multipart/form-data
|
||||
|
||||
files: [PDF file]
|
||||
output_dir: ./output
|
||||
lang_list: ["ch"]
|
||||
backend: pipeline
|
||||
parse_method: auto
|
||||
formula_enable: true
|
||||
table_enable: true
|
||||
return_md: true
|
||||
return_middle_json: false
|
||||
return_model_output: false
|
||||
return_content_list: false
|
||||
return_images: false
|
||||
start_page_id: 0
|
||||
end_page_id: 99999
|
||||
```
|
||||
|
||||
#### Expected Response Format:
|
||||
The processor can handle multiple response formats:
|
||||
|
||||
```json
|
||||
{
|
||||
"markdown": "# Document Title\n\nContent here..."
|
||||
}
|
||||
```
|
||||
|
||||
OR
|
||||
|
||||
```json
|
||||
{
|
||||
"md": "# Document Title\n\nContent here..."
|
||||
}
|
||||
```
|
||||
|
||||
OR
|
||||
|
||||
```json
|
||||
{
|
||||
"content": "# Document Title\n\nContent here..."
|
||||
}
|
||||
```
|
||||
|
||||
OR
|
||||
|
||||
```json
|
||||
{
|
||||
"result": {
|
||||
"markdown": "# Document Title\n\nContent here..."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from app.core.document_handlers.processors.pdf_processor import PdfDocumentProcessor
|
||||
|
||||
# Create processor instance
|
||||
processor = PdfDocumentProcessor("input.pdf", "output.md")
|
||||
|
||||
# Read and convert PDF to markdown
|
||||
content = processor.read_content()
|
||||
|
||||
# Process content (apply masking)
|
||||
processed_content = processor.process_content(content)
|
||||
|
||||
# Save processed content
|
||||
processor.save_content(processed_content)
|
||||
```
|
||||
|
||||
### Through Document Service
|
||||
|
||||
```python
|
||||
from app.core.services.document_service import DocumentService
|
||||
|
||||
service = DocumentService()
|
||||
success = service.process_document("input.pdf", "output.md")
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
Run the test script to verify the implementation:
|
||||
|
||||
```bash
|
||||
cd backend
|
||||
python test_pdf_processor.py
|
||||
```
|
||||
|
||||
Make sure you have:
|
||||
1. A sample PDF file in the `sample_doc/` directory
|
||||
2. Mineru API service running and accessible
|
||||
3. Proper network connectivity between services
|
||||
|
||||
## Error Handling
|
||||
|
||||
The processor handles various error scenarios:
|
||||
|
||||
- **Network Timeouts**: Configurable timeout (default: 5 minutes)
|
||||
- **API Errors**: HTTP status code errors are logged and handled
|
||||
- **Response Parsing**: Multiple fallback strategies for extracting markdown content
|
||||
- **File Operations**: Proper error handling for file reading/writing
|
||||
|
||||
## Logging
|
||||
|
||||
The processor provides detailed logging for debugging:
|
||||
|
||||
- API call attempts and responses
|
||||
- Content extraction results
|
||||
- Error conditions and stack traces
|
||||
- Processing statistics
|
||||
|
||||
## Deployment
|
||||
|
||||
### Docker Compose
|
||||
|
||||
Ensure your Mineru service is running and accessible. The default configuration expects it at `http://mineru-api:8000`.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
Set the following environment variables in your deployment:
|
||||
|
||||
```bash
|
||||
MINERU_API_URL=http://your-mineru-service:8000
|
||||
MINERU_TIMEOUT=300
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Connection Refused**: Check if Mineru service is running and accessible
|
||||
2. **Timeout Errors**: Increase `MINERU_TIMEOUT` for large PDF files
|
||||
3. **Empty Content**: Check Mineru API response format and logs
|
||||
4. **Network Issues**: Verify network connectivity between services
|
||||
|
||||
### Debug Mode
|
||||
|
||||
Enable debug logging to see detailed API interactions:
|
||||
|
||||
```python
|
||||
import logging
|
||||
logging.getLogger('app.core.document_handlers.processors.pdf_processor').setLevel(logging.DEBUG)
|
||||
```
|
||||
|
||||
## Migration from magic_pdf
|
||||
|
||||
If you were previously using magic_pdf:
|
||||
|
||||
1. **No Code Changes Required**: The interface remains the same
|
||||
2. **Configuration Update**: Add Mineru API settings
|
||||
3. **Service Dependencies**: Ensure Mineru service is running
|
||||
4. **Testing**: Run the test script to verify functionality
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Timeout**: Large PDFs may require longer timeouts
|
||||
- **Memory**: The processor loads the entire PDF into memory for API calls
|
||||
- **Network**: API calls add network latency to processing time
|
||||
- **Caching**: Consider implementing caching for frequently processed documents
|
||||
|
|
@ -31,6 +31,17 @@ class Settings(BaseSettings):
|
|||
OLLAMA_API_KEY: str = ""
|
||||
OLLAMA_MODEL: str = "llama2"
|
||||
|
||||
# Mineru API settings
|
||||
# MINERU_API_URL: str = "http://mineru-api:8001"
|
||||
MINERU_API_URL: str = "http://host.docker.internal:8001"
|
||||
|
||||
MINERU_TIMEOUT: int = 300 # 5 minutes timeout
|
||||
MINERU_LANG_LIST: list = ["ch"] # Language list for parsing
|
||||
MINERU_BACKEND: str = "pipeline" # Backend to use
|
||||
MINERU_PARSE_METHOD: str = "auto" # Parse method
|
||||
MINERU_FORMULA_ENABLE: bool = True # Enable formula parsing
|
||||
MINERU_TABLE_ENABLE: bool = True # Enable table parsing
|
||||
|
||||
# Logging settings
|
||||
LOG_LEVEL: str = "INFO"
|
||||
LOG_FORMAT: str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
||||
|
|
|
|||
|
|
@ -4,7 +4,7 @@ from .document_processor import DocumentProcessor
|
|||
from .processors import (
|
||||
TxtDocumentProcessor,
|
||||
# DocxDocumentProcessor,
|
||||
# PdfDocumentProcessor,
|
||||
PdfDocumentProcessor,
|
||||
MarkdownDocumentProcessor
|
||||
)
|
||||
|
||||
|
|
@ -17,7 +17,7 @@ class DocumentProcessorFactory:
|
|||
'.txt': TxtDocumentProcessor,
|
||||
# '.docx': DocxDocumentProcessor,
|
||||
# '.doc': DocxDocumentProcessor,
|
||||
# '.pdf': PdfDocumentProcessor,
|
||||
'.pdf': PdfDocumentProcessor,
|
||||
'.md': MarkdownDocumentProcessor,
|
||||
'.markdown': MarkdownDocumentProcessor
|
||||
}
|
||||
|
|
|
|||
|
|
@ -96,56 +96,136 @@ class NerProcessor:
|
|||
logger.info(f"Merged {len(unique_entities)} unique entities")
|
||||
return unique_entities
|
||||
|
||||
def _generate_masked_mapping(self, unique_entities: list[Dict[str, str]]) -> Dict[str, str]:
|
||||
def _generate_masked_mapping(self, unique_entities: list[Dict[str, str]], linkage: Dict[str, Any]) -> Dict[str, str]:
|
||||
"""
|
||||
结合 linkage 信息,按实体分组映射同一脱敏名,并实现如下规则:
|
||||
1. 人名/简称:保留姓,名变为某,同姓编号;
|
||||
2. 公司名:同组公司名映射为大写字母公司(A公司、B公司...);
|
||||
3. 英文人名:每个单词首字母+***;
|
||||
4. 英文公司名:替换为所属行业名称,英文大写(如无行业信息,默认 COMPANY);
|
||||
5. 项目名:项目名称变为小写英文字母(如 a项目、b项目...);
|
||||
6. 案号:只替换案号中的数字部分为***,保留前后结构和“号”字,支持中间有空格;
|
||||
7. 身份证号:6位X;
|
||||
8. 社会信用代码:8位X;
|
||||
9. 地址:保留区级及以上行政区划,去除详细位置;
|
||||
10. 其他类型按原有逻辑。
|
||||
"""
|
||||
import re
|
||||
entity_mapping = {}
|
||||
used_masked_names = set()
|
||||
|
||||
group_mask_map = {}
|
||||
surname_counter = {}
|
||||
company_letter = ord('A')
|
||||
project_letter = ord('a')
|
||||
# 优先区县级单位,后市、省等
|
||||
admin_keywords = [
|
||||
'市辖区', '自治县', '自治旗', '林区', '区', '县', '旗', '州', '盟', '地区', '自治州',
|
||||
'市', '省', '自治区', '特别行政区'
|
||||
]
|
||||
admin_pattern = r"^(.*?(?:" + '|'.join(admin_keywords) + r"))"
|
||||
for group in linkage.get('entity_groups', []):
|
||||
group_type = group.get('group_type', '')
|
||||
entities = group.get('entities', [])
|
||||
if '公司' in group_type or 'Company' in group_type:
|
||||
masked = chr(company_letter) + '公司'
|
||||
company_letter += 1
|
||||
for entity in entities:
|
||||
group_mask_map[entity['text']] = masked
|
||||
elif '人名' in group_type:
|
||||
surname_local_counter = {}
|
||||
for entity in entities:
|
||||
name = entity['text']
|
||||
if not name:
|
||||
continue
|
||||
surname = name[0]
|
||||
surname_local_counter.setdefault(surname, 0)
|
||||
surname_local_counter[surname] += 1
|
||||
if surname_local_counter[surname] == 1:
|
||||
masked = f"{surname}某"
|
||||
else:
|
||||
masked = f"{surname}某{surname_local_counter[surname]}"
|
||||
group_mask_map[name] = masked
|
||||
elif '英文人名' in group_type:
|
||||
for entity in entities:
|
||||
name = entity['text']
|
||||
if not name:
|
||||
continue
|
||||
masked = ' '.join([n[0] + '***' if n else '' for n in name.split()])
|
||||
group_mask_map[name] = masked
|
||||
for entity in unique_entities:
|
||||
original_text = entity['text'].strip()
|
||||
text = entity['text']
|
||||
entity_type = entity.get('type', '')
|
||||
|
||||
if '人名' in entity_type or '英文人名' in entity_type:
|
||||
base_name = '某'
|
||||
masked_name = base_name
|
||||
counter = 1
|
||||
|
||||
while masked_name in used_masked_names:
|
||||
if counter <= 10:
|
||||
suffixes = ['甲', '乙', '丙', '丁', '戊', '己', '庚', '辛', '壬', '癸']
|
||||
masked_name = base_name + suffixes[counter - 1]
|
||||
if text in group_mask_map:
|
||||
entity_mapping[text] = group_mask_map[text]
|
||||
used_masked_names.add(group_mask_map[text])
|
||||
elif '英文公司名' in entity_type or 'English Company' in entity_type:
|
||||
industry = entity.get('industry', 'COMPANY')
|
||||
masked = industry.upper()
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '项目名' in entity_type:
|
||||
masked = chr(project_letter) + '项目'
|
||||
project_letter += 1
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '案号' in entity_type:
|
||||
masked = re.sub(r'(\d[\d\s]*)(号)', r'***\2', text)
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '身份证号' in entity_type:
|
||||
masked = 'X' * 6
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '社会信用代码' in entity_type:
|
||||
masked = 'X' * 8
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '地址' in entity_type:
|
||||
# 保留区级及以上行政区划,去除详细位置
|
||||
match = re.match(admin_pattern, text)
|
||||
if match:
|
||||
masked = match.group(1)
|
||||
else:
|
||||
masked = text # fallback
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '人名' in entity_type:
|
||||
name = text
|
||||
if not name:
|
||||
masked = '某'
|
||||
else:
|
||||
surname = name[0]
|
||||
surname_counter.setdefault(surname, 0)
|
||||
surname_counter[surname] += 1
|
||||
if surname_counter[surname] == 1:
|
||||
masked = f"{surname}某"
|
||||
else:
|
||||
masked_name = f"{base_name}{counter}"
|
||||
counter += 1
|
||||
|
||||
masked = f"{surname}某{surname_counter[surname]}"
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '公司' in entity_type or 'Company' in entity_type:
|
||||
base_name = '某公司'
|
||||
masked_name = base_name
|
||||
counter = 1
|
||||
|
||||
while masked_name in used_masked_names:
|
||||
if counter <= 10:
|
||||
suffixes = ['甲', '乙', '丙', '丁', '戊', '己', '庚', '辛', '壬', '癸']
|
||||
masked_name = base_name + suffixes[counter - 1]
|
||||
else:
|
||||
masked_name = f"{base_name}{counter}"
|
||||
counter += 1
|
||||
masked = chr(company_letter) + '公司'
|
||||
company_letter += 1
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '英文人名' in entity_type:
|
||||
name = text
|
||||
masked = ' '.join([n[0] + '***' if n else '' for n in name.split()])
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
else:
|
||||
base_name = '某'
|
||||
masked_name = base_name
|
||||
masked = base_name
|
||||
counter = 1
|
||||
|
||||
while masked_name in used_masked_names:
|
||||
while masked in used_masked_names:
|
||||
if counter <= 10:
|
||||
suffixes = ['甲', '乙', '丙', '丁', '戊', '己', '庚', '辛', '壬', '癸']
|
||||
masked_name = base_name + suffixes[counter - 1]
|
||||
masked = base_name + suffixes[counter - 1]
|
||||
else:
|
||||
masked_name = f"{base_name}{counter}"
|
||||
masked = f"{base_name}{counter}"
|
||||
counter += 1
|
||||
|
||||
entity_mapping[original_text] = masked_name
|
||||
used_masked_names.add(masked_name)
|
||||
|
||||
logger.info(f"Generated masked mapping for {len(entity_mapping)} entities")
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
return entity_mapping
|
||||
|
||||
def _validate_linkage_format(self, linkage: Dict[str, Any]) -> bool:
|
||||
|
|
@ -192,34 +272,10 @@ class NerProcessor:
|
|||
return {"entity_groups": []}
|
||||
|
||||
def _apply_entity_linkage_to_mapping(self, entity_mapping: Dict[str, str], entity_linkage: Dict[str, Any]) -> Dict[str, str]:
|
||||
updated_mapping = entity_mapping.copy()
|
||||
|
||||
for group in entity_linkage.get('entity_groups', []):
|
||||
group_entities = group.get('entities', [])
|
||||
if not group_entities:
|
||||
continue
|
||||
|
||||
primary_entity = None
|
||||
for entity in group_entities:
|
||||
if entity.get('is_primary', False):
|
||||
primary_entity = entity
|
||||
break
|
||||
|
||||
if not primary_entity and group_entities:
|
||||
primary_entity = group_entities[0]
|
||||
|
||||
if primary_entity:
|
||||
primary_text = primary_entity['text']
|
||||
primary_masked = updated_mapping.get(primary_text)
|
||||
|
||||
if primary_masked:
|
||||
for entity in group_entities:
|
||||
entity_text = entity['text']
|
||||
if entity_text in updated_mapping:
|
||||
updated_mapping[entity_text] = primary_masked
|
||||
logger.info(f"Linked entity '{entity_text}' to '{primary_text}' with masked name '{primary_masked}'")
|
||||
|
||||
return updated_mapping
|
||||
"""
|
||||
linkage 已在 _generate_masked_mapping 中处理,此处直接返回 entity_mapping。
|
||||
"""
|
||||
return entity_mapping
|
||||
|
||||
def process(self, chunks: list[str]) -> Dict[str, str]:
|
||||
chunk_mappings = []
|
||||
|
|
@ -237,7 +293,10 @@ class NerProcessor:
|
|||
entity_linkage = self._create_entity_linkage(unique_entities)
|
||||
logger.info(f"Entity linkage: {entity_linkage}")
|
||||
|
||||
combined_mapping = self._generate_masked_mapping(unique_entities)
|
||||
# for quick test
|
||||
# unique_entities = [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}, {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}, {'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}, {'text': '服务合同', 'type': '项目名'}, {'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '(2020)京0105 民初69754 号', 'type': '案号'}, {'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}, {'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}, {'text': '《计算机设备采购合同》', 'type': '项目名'}, {'text': '《服务合同书》', 'type': '项目名'}]
|
||||
# entity_linkage = {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
|
||||
combined_mapping = self._generate_masked_mapping(unique_entities, entity_linkage)
|
||||
logger.info(f"Combined mapping: {combined_mapping}")
|
||||
|
||||
final_mapping = self._apply_entity_linkage_to_mapping(combined_mapping, entity_linkage)
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
from .txt_processor import TxtDocumentProcessor
|
||||
# from .docx_processor import DocxDocumentProcessor
|
||||
# from .pdf_processor import PdfDocumentProcessor
|
||||
from .pdf_processor import PdfDocumentProcessor
|
||||
from .md_processor import MarkdownDocumentProcessor
|
||||
|
||||
# __all__ = ['TxtDocumentProcessor', 'DocxDocumentProcessor', 'PdfDocumentProcessor', 'MarkdownDocumentProcessor']
|
||||
__all__ = ['TxtDocumentProcessor', 'MarkdownDocumentProcessor']
|
||||
__all__ = ['TxtDocumentProcessor', 'PdfDocumentProcessor', 'MarkdownDocumentProcessor']
|
||||
|
|
@ -0,0 +1,204 @@
|
|||
import os
|
||||
import requests
|
||||
import logging
|
||||
from typing import Dict, Any, Optional
|
||||
from ...document_handlers.document_processor import DocumentProcessor
|
||||
from ...services.ollama_client import OllamaClient
|
||||
from ...config import settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class PdfDocumentProcessor(DocumentProcessor):
|
||||
def __init__(self, input_path: str, output_path: str):
|
||||
super().__init__() # Call parent class's __init__
|
||||
self.input_path = input_path
|
||||
self.output_path = output_path
|
||||
self.output_dir = os.path.dirname(output_path)
|
||||
self.name_without_suff = os.path.splitext(os.path.basename(input_path))[0]
|
||||
|
||||
# Setup work directory for temporary files
|
||||
self.work_dir = os.path.join(
|
||||
os.path.dirname(output_path),
|
||||
".work",
|
||||
os.path.splitext(os.path.basename(input_path))[0]
|
||||
)
|
||||
os.makedirs(self.work_dir, exist_ok=True)
|
||||
|
||||
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
|
||||
|
||||
# Mineru API configuration
|
||||
self.mineru_base_url = getattr(settings, 'MINERU_API_URL', 'http://mineru-api:8000')
|
||||
self.mineru_timeout = getattr(settings, 'MINERU_TIMEOUT', 300) # 5 minutes timeout
|
||||
self.mineru_lang_list = getattr(settings, 'MINERU_LANG_LIST', ['ch'])
|
||||
self.mineru_backend = getattr(settings, 'MINERU_BACKEND', 'pipeline')
|
||||
self.mineru_parse_method = getattr(settings, 'MINERU_PARSE_METHOD', 'auto')
|
||||
self.mineru_formula_enable = getattr(settings, 'MINERU_FORMULA_ENABLE', True)
|
||||
self.mineru_table_enable = getattr(settings, 'MINERU_TABLE_ENABLE', True)
|
||||
|
||||
def _call_mineru_api(self, file_path: str) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Call Mineru API to convert PDF to markdown
|
||||
|
||||
Args:
|
||||
file_path: Path to the PDF file
|
||||
|
||||
Returns:
|
||||
API response as dictionary or None if failed
|
||||
"""
|
||||
try:
|
||||
url = f"{self.mineru_base_url}/file_parse"
|
||||
|
||||
with open(file_path, 'rb') as file:
|
||||
files = {'files': (os.path.basename(file_path), file, 'application/pdf')}
|
||||
|
||||
# Prepare form data according to Mineru API specification
|
||||
data = {
|
||||
'output_dir': './output',
|
||||
'lang_list': self.mineru_lang_list,
|
||||
'backend': self.mineru_backend,
|
||||
'parse_method': self.mineru_parse_method,
|
||||
'formula_enable': self.mineru_formula_enable,
|
||||
'table_enable': self.mineru_table_enable,
|
||||
'return_md': True,
|
||||
'return_middle_json': False,
|
||||
'return_model_output': False,
|
||||
'return_content_list': False,
|
||||
'return_images': False,
|
||||
'start_page_id': 0,
|
||||
'end_page_id': 99999
|
||||
}
|
||||
|
||||
logger.info(f"Calling Mineru API at {url}")
|
||||
response = requests.post(
|
||||
url,
|
||||
files=files,
|
||||
data=data,
|
||||
timeout=self.mineru_timeout
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
result = response.json()
|
||||
logger.info("Successfully received response from Mineru API")
|
||||
return result
|
||||
else:
|
||||
logger.error(f"Mineru API returned status code {response.status_code}: {response.text}")
|
||||
return None
|
||||
|
||||
except requests.exceptions.Timeout:
|
||||
logger.error(f"Mineru API request timed out after {self.mineru_timeout} seconds")
|
||||
return None
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error calling Mineru API: {str(e)}")
|
||||
return None
|
||||
except Exception as e:
|
||||
logger.error(f"Unexpected error calling Mineru API: {str(e)}")
|
||||
return None
|
||||
|
||||
def _extract_markdown_from_response(self, response: Dict[str, Any]) -> str:
|
||||
"""
|
||||
Extract markdown content from Mineru API response
|
||||
|
||||
Args:
|
||||
response: Mineru API response dictionary
|
||||
|
||||
Returns:
|
||||
Extracted markdown content as string
|
||||
"""
|
||||
try:
|
||||
logger.debug(f"Mineru API response structure: {response}")
|
||||
|
||||
# Try different possible response formats based on Mineru API
|
||||
if 'markdown' in response:
|
||||
return response['markdown']
|
||||
elif 'md' in response:
|
||||
return response['md']
|
||||
elif 'content' in response:
|
||||
return response['content']
|
||||
elif 'text' in response:
|
||||
return response['text']
|
||||
elif 'result' in response and isinstance(response['result'], dict):
|
||||
result = response['result']
|
||||
if 'markdown' in result:
|
||||
return result['markdown']
|
||||
elif 'md' in result:
|
||||
return result['md']
|
||||
elif 'content' in result:
|
||||
return result['content']
|
||||
elif 'text' in result:
|
||||
return result['text']
|
||||
elif 'data' in response and isinstance(response['data'], dict):
|
||||
data = response['data']
|
||||
if 'markdown' in data:
|
||||
return data['markdown']
|
||||
elif 'md' in data:
|
||||
return data['md']
|
||||
elif 'content' in data:
|
||||
return data['content']
|
||||
elif 'text' in data:
|
||||
return data['text']
|
||||
elif isinstance(response, list) and len(response) > 0:
|
||||
# If response is a list, try to extract from first item
|
||||
first_item = response[0]
|
||||
if isinstance(first_item, dict):
|
||||
return self._extract_markdown_from_response(first_item)
|
||||
elif isinstance(first_item, str):
|
||||
return first_item
|
||||
else:
|
||||
# If no standard format found, try to extract from the response structure
|
||||
logger.warning("Could not find standard markdown field in Mineru response")
|
||||
|
||||
# Return the response as string if it's simple, or empty string
|
||||
if isinstance(response, str):
|
||||
return response
|
||||
elif isinstance(response, dict):
|
||||
# Try to find any text-like content
|
||||
for key, value in response.items():
|
||||
if isinstance(value, str) and len(value) > 100: # Likely content
|
||||
return value
|
||||
elif isinstance(value, dict):
|
||||
# Recursively search in nested dictionaries
|
||||
nested_content = self._extract_markdown_from_response(value)
|
||||
if nested_content:
|
||||
return nested_content
|
||||
|
||||
return ""
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error extracting markdown from Mineru response: {str(e)}")
|
||||
return ""
|
||||
|
||||
def read_content(self) -> str:
|
||||
logger.info("Starting PDF content processing with Mineru API")
|
||||
|
||||
# Call Mineru API to convert PDF to markdown
|
||||
mineru_response = self._call_mineru_api(self.input_path)
|
||||
|
||||
if not mineru_response:
|
||||
raise Exception("Failed to get response from Mineru API")
|
||||
|
||||
# Extract markdown content from the response
|
||||
markdown_content = self._extract_markdown_from_response(mineru_response)
|
||||
|
||||
if not markdown_content:
|
||||
raise Exception("No markdown content found in Mineru API response")
|
||||
|
||||
logger.info(f"Successfully extracted {len(markdown_content)} characters of markdown content")
|
||||
|
||||
# Save the raw markdown content to work directory for reference
|
||||
md_output_path = os.path.join(self.work_dir, f"{self.name_without_suff}.md")
|
||||
with open(md_output_path, 'w', encoding='utf-8') as file:
|
||||
file.write(markdown_content)
|
||||
|
||||
logger.info(f"Saved raw markdown content to {md_output_path}")
|
||||
|
||||
return markdown_content
|
||||
|
||||
def save_content(self, content: str) -> None:
|
||||
# Ensure output path has .md extension
|
||||
output_dir = os.path.dirname(self.output_path)
|
||||
base_name = os.path.splitext(os.path.basename(self.output_path))[0]
|
||||
md_output_path = os.path.join(output_dir, f"{base_name}.md")
|
||||
|
||||
logger.info(f"Saving masked content to: {md_output_path}")
|
||||
with open(md_output_path, 'w', encoding='utf-8') as file:
|
||||
file.write(content)
|
||||
|
|
@ -1,105 +0,0 @@
|
|||
import os
|
||||
import PyPDF2
|
||||
from ...document_handlers.document_processor import DocumentProcessor
|
||||
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
|
||||
from magic_pdf.data.dataset import PymuDocDataset
|
||||
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
|
||||
from magic_pdf.config.enums import SupportedPdfParseMethod
|
||||
from ...prompts.masking_prompts import get_masking_prompt, get_masking_mapping_prompt
|
||||
import logging
|
||||
from ...services.ollama_client import OllamaClient
|
||||
from ...config import settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class PdfDocumentProcessor(DocumentProcessor):
|
||||
def __init__(self, input_path: str, output_path: str):
|
||||
super().__init__() # Call parent class's __init__
|
||||
self.input_path = input_path
|
||||
self.output_path = output_path
|
||||
self.output_dir = os.path.dirname(output_path)
|
||||
self.name_without_suff = os.path.splitext(os.path.basename(input_path))[0]
|
||||
|
||||
# Setup output directories
|
||||
self.local_image_dir = os.path.join(self.output_dir, "images")
|
||||
self.image_dir = os.path.basename(self.local_image_dir)
|
||||
os.makedirs(self.local_image_dir, exist_ok=True)
|
||||
|
||||
# Setup work directory under output directory
|
||||
self.work_dir = os.path.join(
|
||||
os.path.dirname(output_path),
|
||||
".work",
|
||||
os.path.splitext(os.path.basename(input_path))[0]
|
||||
)
|
||||
os.makedirs(self.work_dir, exist_ok=True)
|
||||
|
||||
self.work_local_image_dir = os.path.join(self.work_dir, "images")
|
||||
self.work_image_dir = os.path.basename(self.work_local_image_dir)
|
||||
os.makedirs(self.work_local_image_dir, exist_ok=True)
|
||||
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
|
||||
|
||||
def read_content(self) -> str:
|
||||
logger.info("Starting PDF content processing")
|
||||
|
||||
# Read the PDF file
|
||||
with open(self.input_path, 'rb') as file:
|
||||
content = file.read()
|
||||
|
||||
# Initialize writers
|
||||
image_writer = FileBasedDataWriter(self.work_local_image_dir)
|
||||
md_writer = FileBasedDataWriter(self.work_dir)
|
||||
|
||||
# Create Dataset Instance
|
||||
ds = PymuDocDataset(content)
|
||||
|
||||
logger.info("Classifying PDF type: %s", ds.classify())
|
||||
# Process based on PDF type
|
||||
if ds.classify() == SupportedPdfParseMethod.OCR:
|
||||
infer_result = ds.apply(doc_analyze, ocr=True)
|
||||
pipe_result = infer_result.pipe_ocr_mode(image_writer)
|
||||
else:
|
||||
infer_result = ds.apply(doc_analyze, ocr=False)
|
||||
pipe_result = infer_result.pipe_txt_mode(image_writer)
|
||||
|
||||
logger.info("Generating all outputs")
|
||||
# Generate all outputs
|
||||
infer_result.draw_model(os.path.join(self.work_dir, f"{self.name_without_suff}_model.pdf"))
|
||||
model_inference_result = infer_result.get_infer_res()
|
||||
|
||||
pipe_result.draw_layout(os.path.join(self.work_dir, f"{self.name_without_suff}_layout.pdf"))
|
||||
pipe_result.draw_span(os.path.join(self.work_dir, f"{self.name_without_suff}_spans.pdf"))
|
||||
|
||||
md_content = pipe_result.get_markdown(self.work_image_dir)
|
||||
pipe_result.dump_md(md_writer, f"{self.name_without_suff}.md", self.work_image_dir)
|
||||
|
||||
content_list = pipe_result.get_content_list(self.work_image_dir)
|
||||
pipe_result.dump_content_list(md_writer, f"{self.name_without_suff}_content_list.json", self.work_image_dir)
|
||||
|
||||
middle_json = pipe_result.get_middle_json()
|
||||
pipe_result.dump_middle_json(md_writer, f'{self.name_without_suff}_middle.json')
|
||||
|
||||
return md_content
|
||||
|
||||
# def process_content(self, content: str) -> str:
|
||||
# logger.info("Starting content masking process")
|
||||
# sentences = content.split("。")
|
||||
# final_md = ""
|
||||
# for sentence in sentences:
|
||||
# if not sentence.strip(): # Skip empty sentences
|
||||
# continue
|
||||
# formatted_prompt = get_masking_mapping_prompt(sentence)
|
||||
# logger.info("Calling ollama to generate response, prompt: %s", formatted_prompt)
|
||||
# response = self.ollama_client.generate(formatted_prompt)
|
||||
# logger.info(f"Response generated: {response}")
|
||||
# final_md += response + "。"
|
||||
# return final_md
|
||||
|
||||
def save_content(self, content: str) -> None:
|
||||
# Ensure output path has .md extension
|
||||
output_dir = os.path.dirname(self.output_path)
|
||||
base_name = os.path.splitext(os.path.basename(self.output_path))[0]
|
||||
md_output_path = os.path.join(output_dir, f"{base_name}.md")
|
||||
|
||||
logger.info(f"Saving masked content to: {md_output_path}")
|
||||
with open(md_output_path, 'w', encoding='utf-8') as file:
|
||||
file.write(content)
|
||||
|
|
@ -90,9 +90,11 @@ class LLMResponseValidator:
|
|||
"""
|
||||
try:
|
||||
validate(instance=response, schema=cls.ENTITY_EXTRACTION_SCHEMA)
|
||||
logger.debug(f"Entity extraction validation passed for response: {response}")
|
||||
return True
|
||||
except ValidationError as e:
|
||||
logger.warning(f"Entity extraction validation error: {e}")
|
||||
logger.warning(f"Entity extraction validation failed: {e}")
|
||||
logger.warning(f"Response that failed validation: {response}")
|
||||
return False
|
||||
|
||||
@classmethod
|
||||
|
|
@ -108,9 +110,16 @@ class LLMResponseValidator:
|
|||
"""
|
||||
try:
|
||||
validate(instance=response, schema=cls.ENTITY_LINKAGE_SCHEMA)
|
||||
return cls._validate_linkage_content(response)
|
||||
content_valid = cls._validate_linkage_content(response)
|
||||
if content_valid:
|
||||
logger.debug(f"Entity linkage validation passed for response: {response}")
|
||||
return True
|
||||
else:
|
||||
logger.warning(f"Entity linkage content validation failed for response: {response}")
|
||||
return False
|
||||
except ValidationError as e:
|
||||
logger.warning(f"Entity linkage validation error: {e}")
|
||||
logger.warning(f"Entity linkage validation failed: {e}")
|
||||
logger.warning(f"Response that failed validation: {response}")
|
||||
return False
|
||||
|
||||
@classmethod
|
||||
|
|
@ -126,9 +135,11 @@ class LLMResponseValidator:
|
|||
"""
|
||||
try:
|
||||
validate(instance=response, schema=cls.REGEX_ENTITY_SCHEMA)
|
||||
logger.debug(f"Regex entity validation passed for response: {response}")
|
||||
return True
|
||||
except ValidationError as e:
|
||||
logger.warning(f"Regex entity validation error: {e}")
|
||||
logger.warning(f"Regex entity validation failed: {e}")
|
||||
logger.warning(f"Response that failed validation: {response}")
|
||||
return False
|
||||
|
||||
@classmethod
|
||||
|
|
|
|||
|
|
@ -0,0 +1,127 @@
|
|||
[2025-07-14 14:20:19,015: INFO/ForkPoolWorker-4] Raw response from LLM: {
|
||||
celery_worker-1 | "entities": []
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | [2025-07-14 14:20:19,016: INFO/ForkPoolWorker-4] Parsed mapping: {'entities': []}
|
||||
celery_worker-1 | [2025-07-14 14:20:19,020: INFO/ForkPoolWorker-4] Calling ollama to generate case numbers mapping for chunk (attempt 1/3):
|
||||
celery_worker-1 | 你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息,并按照指定的类别进行分类。请严格按照JSON格式输出结果。
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 实体类别包括:
|
||||
celery_worker-1 | - 案号
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 待处理文本:
|
||||
celery_worker-1 |
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 二审案件受理费450892 元,由北京丰复久信营销科技有限公司负担(已交纳)。
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 29. 本判决为终审判决。
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 审 判 长 史晓霞审 判 员 邓青菁审 判 员 李 淼二〇二二年七月七日法 官 助 理 黎 铧书 记 员 郑海兴
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 输出格式:
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "entities": [
|
||||
celery_worker-1 | {"text": "原始文本内容", "type": "案号"},
|
||||
celery_worker-1 | ...
|
||||
celery_worker-1 | ]
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 请严格按照JSON格式输出结果。
|
||||
celery_worker-1 |
|
||||
api-1 | INFO: 192.168.65.1:60045 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:34054 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:34054 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:22084 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
celery_worker-1 | [2025-07-14 14:20:31,279: INFO/ForkPoolWorker-4] Raw response from LLM: {
|
||||
celery_worker-1 | "entities": []
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | [2025-07-14 14:20:31,281: INFO/ForkPoolWorker-4] Parsed mapping: {'entities': []}
|
||||
celery_worker-1 | [2025-07-14 14:20:31,287: INFO/ForkPoolWorker-4] Chunk mapping: [{'entities': []}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': []}, {'entities': []}]
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Final chunk mappings: [{'entities': [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}]}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}]}, {'entities': [{'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}]}, {'entities': [{'text': '服务合同', 'type': '项目名'}]}, {'entities': [{'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '(2020)京0105 民初69754 号', 'type': '案号'}]}, {'entities': [{'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}]}, {'entities': [{'text': '丰复久信公司', 'type': '公司名称'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}]}, {'entities': [{'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}]}, {'entities': [{'text': '《计算机设备采购合同》', 'type': '项目名'}]}, {'entities': []}, {'entities': []}, {'entities': [{'text': '丰复久信公司', 'type': '公司名称'}, {'text': '中研智创公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': [{'text': '《服务合同书》', 'type': '项目名'}]}, {'entities': []}, {'entities': []}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': []}, {'entities': []}]
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '丰复久信公司', 'type': '公司名称'}
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '丰复久信公司', 'type': '公司名称'}
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '中研智创公司', 'type': '公司名称'}
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Merged 22 unique entities
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Unique entities: [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}, {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}, {'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}, {'text': '服务合同', 'type': '项目名'}, {'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '(2020)京0105 民初69754 号', 'type': '案号'}, {'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}, {'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}, {'text': '《计算机设备采购合同》', 'type': '项目名'}, {'text': '《服务合同书》', 'type': '项目名'}]
|
||||
celery_worker-1 | [2025-07-14 14:20:31,289: INFO/ForkPoolWorker-4] Calling ollama to generate entity linkage (attempt 1/3)
|
||||
api-1 | INFO: 192.168.65.1:52168 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:61426 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:30702 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:48159 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:16860 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:21262 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:45564 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:32142 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:27769 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:21196 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
celery_worker-1 | [2025-07-14 14:21:21,436: INFO/ForkPoolWorker-4] Raw entity linkage response from LLM: {
|
||||
celery_worker-1 | "entity_groups": [
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "group_id": "group_1",
|
||||
celery_worker-1 | "group_type": "公司名称",
|
||||
celery_worker-1 | "entities": [
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "北京丰复久信营销科技有限公司",
|
||||
celery_worker-1 | "type": "公司名称",
|
||||
celery_worker-1 | "is_primary": true
|
||||
celery_worker-1 | },
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "丰复久信公司",
|
||||
celery_worker-1 | "type": "公司名称简称",
|
||||
celery_worker-1 | "is_primary": false
|
||||
celery_worker-1 | },
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "丰复久信",
|
||||
celery_worker-1 | "type": "公司名称简称",
|
||||
celery_worker-1 | "is_primary": false
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | ]
|
||||
celery_worker-1 | },
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "group_id": "group_2",
|
||||
celery_worker-1 | "group_type": "公司名称",
|
||||
celery_worker-1 | "entities": [
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "中研智创区块链技术有限公司",
|
||||
celery_worker-1 | "type": "公司名称",
|
||||
celery_worker-1 | "is_primary": true
|
||||
celery_worker-1 | },
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "中研智创公司",
|
||||
celery_worker-1 | "type": "公司名称简称",
|
||||
celery_worker-1 | "is_primary": false
|
||||
celery_worker-1 | },
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "中研智创",
|
||||
celery_worker-1 | "type": "公司名称简称",
|
||||
celery_worker-1 | "is_primary": false
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | ]
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | ]
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | [2025-07-14 14:21:21,437: INFO/ForkPoolWorker-4] Parsed entity linkage: {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
|
||||
celery_worker-1 | [2025-07-14 14:21:21,445: INFO/ForkPoolWorker-4] Successfully created entity linkage with 2 groups
|
||||
celery_worker-1 | [2025-07-14 14:21:21,445: INFO/ForkPoolWorker-4] Entity linkage: {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Generated masked mapping for 22 entities
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Combined mapping: {'郭东军': '某', '王欢子': '某甲', '北京丰复久信营销科技有限公司': '某公司', '丰复久信公司': '某公司甲', '中研智创区块链技术有限公司': '某公司乙', '中研智才公司': '某公司丙', '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室': '某乙', '天津市津南区双港镇工业园区优谷产业园5 号楼-1505': '某丙', '服务合同': '某丁', '(2022)京 03 民终 3852 号': '某戊', '(2020)京0105 民初69754 号': '某己', '李圣艳': '某庚', '闫向东': '某辛', '李敏': '某壬', '布兰登·斯密特': '某癸', '中研智创公司': '某公司丁', '丰复久信': '某公司戊', '中研智创': '某公司己', '上海市': '某11', '北京': '某12', '《计算机设备采购合同》': '某13', '《服务合同书》': '某14'}
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '北京丰复久信营销科技有限公司' to '北京丰复久信营销科技有限公司' with masked name '某公司'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '丰复久信公司' to '北京丰复久信营销科技有限公司' with masked name '某公司'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '丰复久信' to '北京丰复久信营销科技有限公司' with masked name '某公司'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创区块链技术有限公司' to '中研智创区块链技术有限公司' with masked name '某公司乙'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创公司' to '中研智创区块链技术有限公司' with masked name '某公司乙'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创' to '中研智创区块链技术有限公司' with masked name '某公司乙'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Final mapping: {'郭东军': '某', '王欢子': '某甲', '北京丰复久信营销科技有限公司': '某公司', '丰复久信公司': '某公司', '中研智创区块链技术有限公司': '某公司乙', '中研智才公司': '某公司丙', '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室': '某乙', '天津市津南区双港镇工业园区优谷产业园5 号楼-1505': '某丙', '服务合同': '某丁', '(2022)京 03 民终 3852 号': '某戊', '(2020)京0105 民初69754 号': '某己', '李圣艳': '某庚', '闫向东': '某辛', '李敏': '某壬', '布兰登·斯密特': '某癸', '中研智创公司': '某公司乙', '丰复久信': '某公司', '中研智创': '某公司乙', '上海市': '某11', '北京': '某12', '《计算机设备采购合同》': '某13', '《服务合同书》': '某14'}
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Successfully masked content
|
||||
celery_worker-1 | [2025-07-14 14:21:21,449: INFO/ForkPoolWorker-4] Successfully saved masked content to /app/storage/processed/47522ea9-c259-4304-bfe4-1d3ed6902ede.md
|
||||
celery_worker-1 | [2025-07-14 14:21:21,470: INFO/ForkPoolWorker-4] Task app.services.file_service.process_file[5cfbca4c-0f6f-4c71-a66b-b22ee2d28139] succeeded in 311.847165101s: None
|
||||
api-1 | INFO: 192.168.65.1:33432 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:40073 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:29550 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:61350 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:61755 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:63726 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:43446 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:45624 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:25256 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:43464 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
|
|
@ -28,5 +28,5 @@ requests==2.28.1
|
|||
python-docx>=0.8.11
|
||||
PyPDF2>=3.0.0
|
||||
pandas>=2.0.0
|
||||
magic-pdf[full]
|
||||
# magic-pdf[full]
|
||||
jsonschema>=4.20.0
|
||||
|
|
@ -0,0 +1,62 @@
|
|||
import pytest
|
||||
from app.core.document_handlers.ner_processor import NerProcessor
|
||||
|
||||
def test_generate_masked_mapping():
|
||||
processor = NerProcessor()
|
||||
unique_entities = [
|
||||
{'text': '李雷', 'type': '人名'},
|
||||
{'text': '李明', 'type': '人名'},
|
||||
{'text': '王强', 'type': '人名'},
|
||||
{'text': 'Acme Manufacturing Inc.', 'type': '英文公司名', 'industry': 'manufacturing'},
|
||||
{'text': 'Google LLC', 'type': '英文公司名'},
|
||||
{'text': 'A公司', 'type': '公司名称'},
|
||||
{'text': 'B公司', 'type': '公司名称'},
|
||||
{'text': 'John Smith', 'type': '英文人名'},
|
||||
{'text': 'Elizabeth Windsor', 'type': '英文人名'},
|
||||
{'text': '华梦龙光伏项目', 'type': '项目名'},
|
||||
{'text': '案号12345', 'type': '案号'},
|
||||
{'text': '310101198802080000', 'type': '身份证号'},
|
||||
{'text': '9133021276453538XT', 'type': '社会信用代码'},
|
||||
]
|
||||
linkage = {
|
||||
'entity_groups': [
|
||||
{
|
||||
'group_id': 'g1',
|
||||
'group_type': '公司名称',
|
||||
'entities': [
|
||||
{'text': 'A公司', 'type': '公司名称', 'is_primary': True},
|
||||
{'text': 'B公司', 'type': '公司名称', 'is_primary': False},
|
||||
]
|
||||
},
|
||||
{
|
||||
'group_id': 'g2',
|
||||
'group_type': '人名',
|
||||
'entities': [
|
||||
{'text': '李雷', 'type': '人名', 'is_primary': True},
|
||||
{'text': '李明', 'type': '人名', 'is_primary': False},
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
mapping = processor._generate_masked_mapping(unique_entities, linkage)
|
||||
# 人名
|
||||
assert mapping['李雷'].startswith('李某')
|
||||
assert mapping['李明'].startswith('李某')
|
||||
assert mapping['王强'].startswith('王某')
|
||||
# 英文公司名
|
||||
assert mapping['Acme Manufacturing Inc.'] == 'MANUFACTURING'
|
||||
assert mapping['Google LLC'] == 'COMPANY'
|
||||
# 公司名同组
|
||||
assert mapping['A公司'] == mapping['B公司']
|
||||
assert mapping['A公司'].endswith('公司')
|
||||
# 英文人名
|
||||
assert mapping['John Smith'] == 'J*** S***'
|
||||
assert mapping['Elizabeth Windsor'] == 'E*** W***'
|
||||
# 项目名
|
||||
assert mapping['华梦龙光伏项目'].endswith('项目')
|
||||
# 案号
|
||||
assert mapping['案号12345'] == '***'
|
||||
# 身份证号
|
||||
assert mapping['310101198802080000'] == 'XXXXXX'
|
||||
# 社会信用代码
|
||||
assert mapping['9133021276453538XT'] == 'XXXXXXXX'
|
||||
|
|
@ -7,7 +7,7 @@ services:
|
|||
dockerfile: Dockerfile
|
||||
platform: linux/arm64
|
||||
ports:
|
||||
- "8000:8000"
|
||||
- "8001:8000"
|
||||
volumes:
|
||||
- ./storage/uploads:/app/storage/uploads
|
||||
- ./storage/processed:/app/storage/processed
|
||||
|
|
|
|||
Loading…
Reference in New Issue