Merge pull request 'feature-ner-keyword-detect' (#1) from feature-ner-keyword-detect into main

Reviewed-on: #1
This commit is contained in:
tigeren 2025-07-20 13:43:59 +00:00
commit 56c718d658
78 changed files with 21315 additions and 591 deletions

View File

@ -1,19 +0,0 @@
# Storage paths
OBJECT_STORAGE_PATH=/path/to/mounted/object/storage
TARGET_DIRECTORY_PATH=/path/to/target/directory
# Ollama API Configuration
OLLAMA_API_URL=https://api.ollama.com
OLLAMA_API_KEY=your_api_key_here
OLLAMA_MODEL=llama2
# Application Settings
MONITOR_INTERVAL=5
# Logging Configuration
LOG_LEVEL=INFO
LOG_FILE=app.log
# Optional: Additional security settings
# MAX_FILE_SIZE=10485760 # 10MB in bytes
# ALLOWED_FILE_TYPES=.txt,.doc,.docx,.pdf

5
.gitignore vendored
View File

@ -70,4 +70,7 @@ app.log
__pycache__
data/doc_dest
data/doc_src
data/doc_intermediate
data/doc_intermediate
node_modules
backend/storage/

206
DOCKER_COMPOSE_README.md Normal file
View File

@ -0,0 +1,206 @@
# Unified Docker Compose Setup
This project now includes a unified Docker Compose configuration that allows all services (mineru, backend, frontend) to run together and communicate using service names.
## Architecture
The unified setup includes the following services:
- **mineru-api**: Document processing service (port 8001)
- **backend-api**: Main API service (port 8000)
- **celery-worker**: Background task processor
- **redis**: Message broker for Celery
- **frontend**: React frontend application (port 3000)
## Network Configuration
All services are connected through a custom bridge network called `app-network`, allowing them to communicate using service names:
- Backend → Mineru: `http://mineru-api:8000`
- Frontend → Backend: `http://localhost:8000/api/v1` (external access)
- Backend → Redis: `redis://redis:6379/0`
## Usage
### Starting all services
```bash
# From the root directory
docker-compose up -d
```
### Starting specific services
```bash
# Start only backend and mineru
docker-compose up -d backend-api mineru-api redis
# Start only frontend and backend
docker-compose up -d frontend backend-api redis
```
### Stopping services
```bash
# Stop all services
docker-compose down
# Stop and remove volumes
docker-compose down -v
```
### Viewing logs
```bash
# View all logs
docker-compose logs -f
# View specific service logs
docker-compose logs -f backend-api
docker-compose logs -f mineru-api
docker-compose logs -f frontend
```
## Building Services
### Building all services
```bash
# Build all services
docker-compose build
# Build and start all services
docker-compose up -d --build
```
### Building individual services
```bash
# Build only backend
docker-compose build backend-api
# Build only frontend
docker-compose build frontend
# Build only mineru
docker-compose build mineru-api
# Build multiple specific services
docker-compose build backend-api frontend
```
### Building and restarting specific services
```bash
# Build and restart only backend
docker-compose build backend-api
docker-compose up -d backend-api
# Or combine in one command
docker-compose up -d --build backend-api
# Build and restart backend and celery worker
docker-compose up -d --build backend-api celery-worker
```
### Force rebuild (no cache)
```bash
# Force rebuild all services
docker-compose build --no-cache
# Force rebuild specific service
docker-compose build --no-cache backend-api
```
## Environment Variables
The unified setup uses environment variables from the individual service `.env` files:
- `./backend/.env` - Backend configuration
- `./frontend/.env` - Frontend configuration
- `./mineru/.env` - Mineru configuration (if exists)
### Key Configuration Changes
1. **Backend Configuration** (`backend/app/core/config.py`):
```python
MINERU_API_URL: str = "http://mineru-api:8000"
```
2. **Frontend Configuration**:
```javascript
REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
```
## Service Dependencies
- `backend-api` depends on `redis` and `mineru-api`
- `celery-worker` depends on `redis` and `backend-api`
- `frontend` depends on `backend-api`
## Port Mapping
- **Frontend**: `http://localhost:3000`
- **Backend API**: `http://localhost:8000`
- **Mineru API**: `http://localhost:8001`
- **Redis**: `localhost:6379`
## Health Checks
The mineru-api service includes a health check that verifies the service is running properly.
## Development vs Production
For development, you can still use the individual docker-compose files in each service directory. The unified setup is ideal for:
- Production deployments
- End-to-end testing
- Simplified development environment
## Troubleshooting
### Service Communication Issues
If services can't communicate:
1. Check if all services are running: `docker-compose ps`
2. Verify network connectivity: `docker network ls`
3. Check service logs: `docker-compose logs [service-name]`
### Port Conflicts
If you get port conflicts, you can modify the port mappings in the `docker-compose.yml` file:
```yaml
ports:
- "8002:8000" # Change external port
```
### Volume Issues
Make sure the storage directories exist:
```bash
mkdir -p backend/storage
mkdir -p mineru/storage/uploads
mkdir -p mineru/storage/processed
```
## Migration from Individual Compose Files
If you were previously using individual docker-compose files:
1. Stop all individual services:
```bash
cd backend && docker-compose down
cd ../frontend && docker-compose down
cd ../mineru && docker-compose down
```
2. Start the unified setup:
```bash
cd .. && docker-compose up -d
```
The unified setup maintains the same functionality while providing better service discovery and networking.

399
DOCKER_MIGRATION_GUIDE.md Normal file
View File

@ -0,0 +1,399 @@
# Docker Image Migration Guide
This guide explains how to export your built Docker images, transfer them to another environment, and run them without rebuilding.
## Overview
The migration process involves:
1. **Export**: Save built images to tar files
2. **Transfer**: Copy tar files to target environment
3. **Import**: Load images on target environment
4. **Run**: Start services with imported images
## Prerequisites
### Source Environment (where images are built)
- Docker installed and running
- All services built and working
- Sufficient disk space for image export
### Target Environment (where images will run)
- Docker installed and running
- Sufficient disk space for image import
- Network access to source environment (or USB drive)
## Step 1: Export Docker Images
### 1.1 List Current Images
First, check what images you have:
```bash
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.ID}}\t{{.Size}}"
```
You should see images like:
- `legal-doc-masker-backend-api`
- `legal-doc-masker-frontend`
- `legal-doc-masker-mineru-api`
- `redis:alpine`
### 1.2 Export Individual Images
Create a directory for exports:
```bash
mkdir -p docker-images-export
cd docker-images-export
```
Export each image:
```bash
# Export backend image
docker save legal-doc-masker-backend-api:latest -o backend-api.tar
# Export frontend image
docker save legal-doc-masker-frontend:latest -o frontend.tar
# Export mineru image
docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
# Export redis image (if not using official)
docker save redis:alpine -o redis.tar
```
### 1.3 Export All Images at Once (Alternative)
If you want to export all images in one command:
```bash
# Export all project images
docker save \
legal-doc-masker-backend-api:latest \
legal-doc-masker-frontend:latest \
legal-doc-masker-mineru-api:latest \
redis:alpine \
-o legal-doc-masker-all.tar
```
### 1.4 Verify Export Files
Check the exported files:
```bash
ls -lh *.tar
```
You should see files like:
- `backend-api.tar` (~200-500MB)
- `frontend.tar` (~100-300MB)
- `mineru-api.tar` (~1-3GB)
- `redis.tar` (~30-50MB)
## Step 2: Transfer Images
### 2.1 Transfer via Network (SCP/RSYNC)
```bash
# Transfer to remote server
scp *.tar user@remote-server:/path/to/destination/
# Or using rsync (more efficient for large files)
rsync -avz --progress *.tar user@remote-server:/path/to/destination/
```
### 2.2 Transfer via USB Drive
```bash
# Copy to USB drive
cp *.tar /Volumes/USB_DRIVE/docker-images/
# Or create a compressed archive
tar -czf legal-doc-masker-images.tar.gz *.tar
cp legal-doc-masker-images.tar.gz /Volumes/USB_DRIVE/
```
### 2.3 Transfer via Cloud Storage
```bash
# Upload to cloud storage (example with AWS S3)
aws s3 cp *.tar s3://your-bucket/docker-images/
# Or using Google Cloud Storage
gsutil cp *.tar gs://your-bucket/docker-images/
```
## Step 3: Import Images on Target Environment
### 3.1 Prepare Target Environment
```bash
# Create directory for images
mkdir -p docker-images-import
cd docker-images-import
# Copy images from transfer method
# (SCP, USB, or download from cloud storage)
```
### 3.2 Import Individual Images
```bash
# Import backend image
docker load -i backend-api.tar
# Import frontend image
docker load -i frontend.tar
# Import mineru image
docker load -i mineru-api.tar
# Import redis image
docker load -i redis.tar
```
### 3.3 Import All Images at Once (if exported together)
```bash
docker load -i legal-doc-masker-all.tar
```
### 3.4 Verify Imported Images
```bash
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.ID}}\t{{.Size}}"
```
## Step 4: Prepare Target Environment
### 4.1 Copy Project Files
Transfer the following files to target environment:
```bash
# Essential files to copy
docker-compose.yml
DOCKER_COMPOSE_README.md
setup-unified-docker.sh
# Environment files (if they exist)
backend/.env
frontend/.env
mineru/.env
# Storage directories (if you want to preserve data)
backend/storage/
mineru/storage/
backend/legal_doc_masker.db
```
### 4.2 Create Directory Structure
```bash
# Create necessary directories
mkdir -p backend/storage
mkdir -p mineru/storage/uploads
mkdir -p mineru/storage/processed
```
## Step 5: Run Services
### 5.1 Start All Services
```bash
# Start all services using imported images
docker-compose up -d
```
### 5.2 Verify Services
```bash
# Check service status
docker-compose ps
# Check service logs
docker-compose logs -f
```
### 5.3 Test Endpoints
```bash
# Test frontend
curl -I http://localhost:3000
# Test backend API
curl -I http://localhost:8000/api/v1
# Test mineru API
curl -I http://localhost:8001/health
```
## Automation Scripts
### Export Script
Create `export-images.sh`:
```bash
#!/bin/bash
set -e
echo "🚀 Exporting Docker Images"
# Create export directory
mkdir -p docker-images-export
cd docker-images-export
# Export images
echo "📦 Exporting backend-api image..."
docker save legal-doc-masker-backend-api:latest -o backend-api.tar
echo "📦 Exporting frontend image..."
docker save legal-doc-masker-frontend:latest -o frontend.tar
echo "📦 Exporting mineru-api image..."
docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
echo "📦 Exporting redis image..."
docker save redis:alpine -o redis.tar
# Show file sizes
echo "📊 Export complete. File sizes:"
ls -lh *.tar
echo "✅ Images exported successfully!"
```
### Import Script
Create `import-images.sh`:
```bash
#!/bin/bash
set -e
echo "🚀 Importing Docker Images"
# Check if tar files exist
if [ ! -f "backend-api.tar" ]; then
echo "❌ backend-api.tar not found"
exit 1
fi
# Import images
echo "📦 Importing backend-api image..."
docker load -i backend-api.tar
echo "📦 Importing frontend image..."
docker load -i frontend.tar
echo "📦 Importing mineru-api image..."
docker load -i mineru-api.tar
echo "📦 Importing redis image..."
docker load -i redis.tar
# Verify imports
echo "📊 Imported images:"
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep legal-doc-masker
echo "✅ Images imported successfully!"
```
## Troubleshooting
### Common Issues
1. **Image not found during import**
```bash
# Check if image exists
docker images | grep image-name
# Re-export if needed
docker save image-name:tag -o image-name.tar
```
2. **Port conflicts on target environment**
```bash
# Check what's using the ports
lsof -i :8000
lsof -i :8001
lsof -i :3000
# Modify docker-compose.yml if needed
ports:
- "8002:8000" # Change external port
```
3. **Permission issues**
```bash
# Fix file permissions
chmod +x setup-unified-docker.sh
chmod +x export-images.sh
chmod +x import-images.sh
```
4. **Storage directory issues**
```bash
# Create directories with proper permissions
sudo mkdir -p backend/storage
sudo mkdir -p mineru/storage/uploads
sudo mkdir -p mineru/storage/processed
sudo chown -R $USER:$USER backend/storage mineru/storage
```
### Performance Optimization
1. **Compress images for transfer**
```bash
# Compress before transfer
gzip *.tar
# Decompress on target
gunzip *.tar.gz
```
2. **Use parallel transfer**
```bash
# Transfer multiple files in parallel
parallel scp {} user@server:/path/ ::: *.tar
```
3. **Use Docker registry (alternative)**
```bash
# Push to registry
docker tag legal-doc-masker-backend-api:latest your-registry/backend-api:latest
docker push your-registry/backend-api:latest
# Pull on target
docker pull your-registry/backend-api:latest
```
## Complete Migration Checklist
- [ ] Export all Docker images
- [ ] Transfer image files to target environment
- [ ] Transfer project configuration files
- [ ] Import images on target environment
- [ ] Create necessary directories
- [ ] Start services
- [ ] Verify all services are running
- [ ] Test all endpoints
- [ ] Update any environment-specific configurations
## Security Considerations
1. **Secure transfer**: Use encrypted transfer methods (SCP, SFTP)
2. **Image verification**: Verify image integrity after transfer
3. **Environment isolation**: Ensure target environment is properly secured
4. **Access control**: Limit access to Docker daemon on target environment
## Cost Optimization
1. **Image size**: Remove unnecessary layers before export
2. **Compression**: Use compression for large images
3. **Selective transfer**: Only transfer images you need
4. **Cleanup**: Remove old images after successful migration

View File

@ -1,48 +0,0 @@
# Build stage
FROM python:3.12-slim AS builder
WORKDIR /app
# Install build dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements first to leverage Docker cache
COPY requirements.txt .
RUN pip wheel --no-cache-dir --no-deps --wheel-dir /app/wheels -r requirements.txt
# Final stage
FROM python:3.12-slim
WORKDIR /app
# Create non-root user
RUN useradd -m -r appuser && \
chown appuser:appuser /app
# Copy wheels from builder
COPY --from=builder /app/wheels /wheels
COPY --from=builder /app/requirements.txt .
# Install dependencies
RUN pip install --no-cache /wheels/*
# Copy application code
COPY src/ ./src/
# Create directories for mounted volumes
RUN mkdir -p /data/input /data/output && \
chown -R appuser:appuser /data
# Switch to non-root user
USER appuser
# Environment variables
ENV PYTHONPATH=/app \
OBJECT_STORAGE_PATH=/data/input \
TARGET_DIRECTORY_PATH=/data/output
# Run the application
CMD ["python", "src/main.py"]

View File

@ -0,0 +1,178 @@
# Docker Migration Quick Reference
## 🚀 Quick Migration Process
### Source Environment (Export)
```bash
# 1. Build images first (if not already built)
docker-compose build
# 2. Export all images
./export-images.sh
# 3. Transfer files to target environment
# Option A: SCP
scp -r docker-images-export-*/ user@target-server:/path/to/destination/
# Option B: USB Drive
cp -r docker-images-export-*/ /Volumes/USB_DRIVE/
# Option C: Compressed archive
scp legal-doc-masker-images-*.tar.gz user@target-server:/path/to/destination/
```
### Target Environment (Import)
```bash
# 1. Copy project files
scp docker-compose.yml user@target-server:/path/to/destination/
scp DOCKER_COMPOSE_README.md user@target-server:/path/to/destination/
# 2. Import images
./import-images.sh
# 3. Start services
docker-compose up -d
# 4. Verify
docker-compose ps
```
## 📋 Essential Files to Transfer
### Required Files
- `docker-compose.yml` - Unified compose configuration
- `DOCKER_COMPOSE_README.md` - Documentation
- `backend/.env` - Backend environment variables
- `frontend/.env` - Frontend environment variables
- `mineru/.env` - Mineru environment variables (if exists)
### Optional Files (for data preservation)
- `backend/storage/` - Backend storage directory
- `mineru/storage/` - Mineru storage directory
- `backend/legal_doc_masker.db` - Database file
## 🔧 Common Commands
### Export Commands
```bash
# Manual export
docker save legal-doc-masker-backend-api:latest -o backend-api.tar
docker save legal-doc-masker-frontend:latest -o frontend.tar
docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
docker save redis:alpine -o redis.tar
# Compress for transfer
tar -czf legal-doc-masker-images.tar.gz *.tar
```
### Import Commands
```bash
# Manual import
docker load -i backend-api.tar
docker load -i frontend.tar
docker load -i mineru-api.tar
docker load -i redis.tar
# Extract compressed archive
tar -xzf legal-doc-masker-images.tar.gz
```
### Service Management
```bash
# Start all services
docker-compose up -d
# Stop all services
docker-compose down
# View logs
docker-compose logs -f [service-name]
# Check status
docker-compose ps
```
### Building Individual Services
```bash
# Build specific service only
docker-compose build backend-api
docker-compose build frontend
docker-compose build mineru-api
# Build and restart specific service
docker-compose up -d --build backend-api
# Force rebuild (no cache)
docker-compose build --no-cache backend-api
# Using the build script
./build-service.sh backend-api --restart
./build-service.sh frontend --no-cache
./build-service.sh backend-api celery-worker
```
## 🌐 Service URLs
After successful migration:
- **Frontend**: http://localhost:3000
- **Backend API**: http://localhost:8000
- **Mineru API**: http://localhost:8001
## ⚠️ Troubleshooting
### Port Conflicts
```bash
# Check what's using ports
lsof -i :8000
lsof -i :8001
lsof -i :3000
# Modify docker-compose.yml if needed
ports:
- "8002:8000" # Change external port
```
### Permission Issues
```bash
# Fix script permissions
chmod +x export-images.sh
chmod +x import-images.sh
chmod +x setup-unified-docker.sh
# Fix directory permissions
sudo chown -R $USER:$USER backend/storage mineru/storage
```
### Disk Space Issues
```bash
# Check available space
df -h
# Clean up Docker
docker system prune -a
```
## 📊 Expected File Sizes
- `backend-api.tar`: ~200-500MB
- `frontend.tar`: ~100-300MB
- `mineru-api.tar`: ~1-3GB
- `redis.tar`: ~30-50MB
- `legal-doc-masker-images.tar.gz`: ~1-2GB (compressed)
## 🔒 Security Notes
1. Use encrypted transfer (SCP, SFTP) for sensitive environments
2. Verify image integrity after transfer
3. Update environment variables for target environment
4. Ensure proper network security on target environment
## 📞 Support
If you encounter issues:
1. Check the full `DOCKER_MIGRATION_GUIDE.md`
2. Verify all required files are present
3. Check Docker logs: `docker-compose logs -f`
4. Ensure sufficient disk space and permissions

20
backend/.env Normal file
View File

@ -0,0 +1,20 @@
# Storage paths
OBJECT_STORAGE_PATH=/Users/tigeren/Dev/digisky/legal-doc-masker/data/doc_src
TARGET_DIRECTORY_PATH=/Users/tigeren/Dev/digisky/legal-doc-masker/data/doc_dest
INTERMEDIATE_DIR_PATH=/Users/tigeren/Dev/digisky/legal-doc-masker/data/doc_intermediate
# Ollama API Configuration
OLLAMA_API_URL=http://192.168.2.245:11434
# OLLAMA_API_KEY=your_api_key_here
OLLAMA_MODEL=qwen3:8b
# Application Settings
MONITOR_INTERVAL=5
# Logging Configuration
LOG_LEVEL=INFO
LOG_FILE=app.log
# Optional: Additional security settings
# MAX_FILE_SIZE=10485760 # 10MB in bytes
# ALLOWED_FILE_TYPES=.txt,.doc,.docx,.pdf

36
backend/Dockerfile Normal file
View File

@ -0,0 +1,36 @@
FROM python:3.11-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
libreoffice \
wget \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements first to leverage Docker cache
COPY requirements.txt .
# RUN pip install huggingface_hub
# RUN wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
# RUN wget https://raw.githubusercontent.com/opendatalab/MinerU/refs/heads/release-1.3.1/scripts/download_models_hf.py -O download_models_hf.py
# RUN python download_models_hf.py
RUN pip install --no-cache-dir -r requirements.txt
# RUN pip install -U magic-pdf[full]
# Copy the rest of the application
COPY . .
# Create storage directories
RUN mkdir -p storage/uploads storage/processed
# Expose the port the app runs on
EXPOSE 8000
# Command to run the application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

View File

@ -0,0 +1,202 @@
# PDF Processor with Mineru API
## Overview
The PDF processor has been rewritten to use Mineru's REST API instead of the magic_pdf library. This provides better separation of concerns and allows for more flexible deployment options.
## Changes Made
### 1. Removed Dependencies
- Removed all `magic_pdf` imports and dependencies
- Removed `PyPDF2` direct usage (though kept in requirements for potential other uses)
### 2. New Implementation
- **REST API Integration**: Uses HTTP requests to call Mineru's API
- **Configurable Settings**: Mineru API URL and timeout are configurable
- **Error Handling**: Comprehensive error handling for network issues, timeouts, and API errors
- **Flexible Response Parsing**: Handles multiple possible response formats from Mineru API
### 3. Configuration
Add the following settings to your environment or `.env` file:
```bash
# Mineru API Configuration
MINERU_API_URL=http://mineru-api:8000
MINERU_TIMEOUT=300
MINERU_LANG_LIST=["ch"]
MINERU_BACKEND=pipeline
MINERU_PARSE_METHOD=auto
MINERU_FORMULA_ENABLE=true
MINERU_TABLE_ENABLE=true
```
### 4. API Endpoint
The processor expects Mineru to provide a REST API endpoint at `/file_parse` that accepts PDF files via multipart form data and returns JSON with markdown content.
#### Expected Request Format:
```
POST /file_parse
Content-Type: multipart/form-data
files: [PDF file]
output_dir: ./output
lang_list: ["ch"]
backend: pipeline
parse_method: auto
formula_enable: true
table_enable: true
return_md: true
return_middle_json: false
return_model_output: false
return_content_list: false
return_images: false
start_page_id: 0
end_page_id: 99999
```
#### Expected Response Format:
The processor can handle multiple response formats:
```json
{
"markdown": "# Document Title\n\nContent here..."
}
```
OR
```json
{
"md": "# Document Title\n\nContent here..."
}
```
OR
```json
{
"content": "# Document Title\n\nContent here..."
}
```
OR
```json
{
"result": {
"markdown": "# Document Title\n\nContent here..."
}
}
```
## Usage
### Basic Usage
```python
from app.core.document_handlers.processors.pdf_processor import PdfDocumentProcessor
# Create processor instance
processor = PdfDocumentProcessor("input.pdf", "output.md")
# Read and convert PDF to markdown
content = processor.read_content()
# Process content (apply masking)
processed_content = processor.process_content(content)
# Save processed content
processor.save_content(processed_content)
```
### Through Document Service
```python
from app.core.services.document_service import DocumentService
service = DocumentService()
success = service.process_document("input.pdf", "output.md")
```
## Testing
Run the test script to verify the implementation:
```bash
cd backend
python test_pdf_processor.py
```
Make sure you have:
1. A sample PDF file in the `sample_doc/` directory
2. Mineru API service running and accessible
3. Proper network connectivity between services
## Error Handling
The processor handles various error scenarios:
- **Network Timeouts**: Configurable timeout (default: 5 minutes)
- **API Errors**: HTTP status code errors are logged and handled
- **Response Parsing**: Multiple fallback strategies for extracting markdown content
- **File Operations**: Proper error handling for file reading/writing
## Logging
The processor provides detailed logging for debugging:
- API call attempts and responses
- Content extraction results
- Error conditions and stack traces
- Processing statistics
## Deployment
### Docker Compose
Ensure your Mineru service is running and accessible. The default configuration expects it at `http://mineru-api:8000`.
### Environment Variables
Set the following environment variables in your deployment:
```bash
MINERU_API_URL=http://your-mineru-service:8000
MINERU_TIMEOUT=300
```
## Troubleshooting
### Common Issues
1. **Connection Refused**: Check if Mineru service is running and accessible
2. **Timeout Errors**: Increase `MINERU_TIMEOUT` for large PDF files
3. **Empty Content**: Check Mineru API response format and logs
4. **Network Issues**: Verify network connectivity between services
### Debug Mode
Enable debug logging to see detailed API interactions:
```python
import logging
logging.getLogger('app.core.document_handlers.processors.pdf_processor').setLevel(logging.DEBUG)
```
## Migration from magic_pdf
If you were previously using magic_pdf:
1. **No Code Changes Required**: The interface remains the same
2. **Configuration Update**: Add Mineru API settings
3. **Service Dependencies**: Ensure Mineru service is running
4. **Testing**: Run the test script to verify functionality
## Performance Considerations
- **Timeout**: Large PDFs may require longer timeouts
- **Memory**: The processor loads the entire PDF into memory for API calls
- **Network**: API calls add network latency to processing time
- **Caching**: Consider implementing caching for frequently processed documents

103
backend/README.md Normal file
View File

@ -0,0 +1,103 @@
# Legal Document Masker API
This is the backend API for the Legal Document Masking system. It provides endpoints for file upload, processing status tracking, and file download.
## Prerequisites
- Python 3.8+
- Redis (for Celery)
## File Storage
Files are stored in the following structure:
```
backend/
├── storage/
│ ├── uploads/ # Original uploaded files
│ └── processed/ # Masked/processed files
```
## Setup
### Option 1: Local Development
1. Create a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Set up environment variables:
Create a `.env` file in the backend directory with the following variables:
```env
SECRET_KEY=your-secret-key-here
```
The database (SQLite) will be automatically created when you first run the application.
4. Start Redis (required for Celery):
```bash
redis-server
```
5. Start Celery worker:
```bash
celery -A app.services.file_service worker --loglevel=info
```
6. Start the FastAPI server:
```bash
uvicorn app.main:app --reload
```
### Option 2: Docker Deployment
1. Build and start the services:
```bash
docker-compose up --build
```
This will start:
- FastAPI server on port 8000
- Celery worker for background processing
- Redis for task queue
## API Documentation
Once the server is running, you can access:
- Swagger UI: `http://localhost:8000/docs`
- ReDoc: `http://localhost:8000/redoc`
## API Endpoints
- `POST /api/v1/files/upload` - Upload a new file
- `GET /api/v1/files` - List all files
- `GET /api/v1/files/{file_id}` - Get file details
- `GET /api/v1/files/{file_id}/download` - Download processed file
- `WS /api/v1/files/ws/status/{file_id}` - WebSocket for real-time status updates
## Development
### Running Tests
```bash
pytest
```
### Code Style
The project uses Black for code formatting:
```bash
black .
```
### Docker Commands
- Start services: `docker-compose up`
- Start in background: `docker-compose up -d`
- Stop services: `docker-compose down`
- View logs: `docker-compose logs -f`
- Rebuild: `docker-compose up --build`

View File

@ -0,0 +1,166 @@
from fastapi import APIRouter, Depends, HTTPException, UploadFile, File, WebSocket, Response
from fastapi.responses import FileResponse
from sqlalchemy.orm import Session
from typing import List
import os
from ...core.config import settings
from ...core.database import get_db
from ...models.file import File as FileModel, FileStatus
from ...services.file_service import process_file, delete_file
from ...schemas.file import FileResponse as FileResponseSchema, FileList
import asyncio
from fastapi import WebSocketDisconnect
import uuid
router = APIRouter()
@router.post("/upload", response_model=FileResponseSchema)
async def upload_file(
file: UploadFile = File(...),
db: Session = Depends(get_db)
):
if not file.filename:
raise HTTPException(status_code=400, detail="No file provided")
if not any(file.filename.lower().endswith(ext) for ext in settings.ALLOWED_EXTENSIONS):
raise HTTPException(
status_code=400,
detail=f"File type not allowed. Allowed types: {', '.join(settings.ALLOWED_EXTENSIONS)}"
)
# Generate unique file ID
file_id = str(uuid.uuid4())
file_extension = os.path.splitext(file.filename)[1]
unique_filename = f"{file_id}{file_extension}"
# Save file with unique name
file_path = settings.UPLOAD_FOLDER / unique_filename
with open(file_path, "wb") as buffer:
content = await file.read()
buffer.write(content)
# Create database entry
db_file = FileModel(
id=file_id,
filename=file.filename,
original_path=str(file_path),
status=FileStatus.NOT_STARTED
)
db.add(db_file)
db.commit()
db.refresh(db_file)
# Start processing
process_file.delay(str(db_file.id))
return db_file
@router.get("/files", response_model=List[FileResponseSchema])
def list_files(
skip: int = 0,
limit: int = 100,
db: Session = Depends(get_db)
):
files = db.query(FileModel).offset(skip).limit(limit).all()
return files
@router.get("/files/{file_id}", response_model=FileResponseSchema)
def get_file(
file_id: str,
db: Session = Depends(get_db)
):
file = db.query(FileModel).filter(FileModel.id == file_id).first()
if not file:
raise HTTPException(status_code=404, detail="File not found")
return file
@router.get("/files/{file_id}/download")
async def download_file(
file_id: str,
db: Session = Depends(get_db)
):
print(f"=== DOWNLOAD REQUEST ===")
print(f"File ID: {file_id}")
file = db.query(FileModel).filter(FileModel.id == file_id).first()
if not file:
print(f"❌ File not found for ID: {file_id}")
raise HTTPException(status_code=404, detail="File not found")
print(f"✅ File found: {file.filename}")
print(f"File status: {file.status}")
print(f"Original path: {file.original_path}")
print(f"Processed path: {file.processed_path}")
if file.status != FileStatus.SUCCESS:
print(f"❌ File not ready for download. Status: {file.status}")
raise HTTPException(status_code=400, detail="File is not ready for download")
if not os.path.exists(file.processed_path):
print(f"❌ Processed file not found at: {file.processed_path}")
raise HTTPException(status_code=404, detail="Processed file not found")
print(f"✅ Processed file exists at: {file.processed_path}")
# Get the original filename without extension and add .md extension
original_filename = file.filename
filename_without_ext = os.path.splitext(original_filename)[0]
download_filename = f"{filename_without_ext}.md"
print(f"Original filename: {original_filename}")
print(f"Filename without extension: {filename_without_ext}")
print(f"Download filename: {download_filename}")
response = FileResponse(
path=file.processed_path,
filename=download_filename,
media_type="text/markdown"
)
print(f"Response headers: {dict(response.headers)}")
print(f"=== END DOWNLOAD REQUEST ===")
return response
@router.websocket("/ws/status/{file_id}")
async def websocket_endpoint(websocket: WebSocket, file_id: str, db: Session = Depends(get_db)):
await websocket.accept()
try:
while True:
file = db.query(FileModel).filter(FileModel.id == file_id).first()
if not file:
await websocket.send_json({"error": "File not found"})
break
await websocket.send_json({
"status": file.status,
"error": file.error_message
})
if file.status in [FileStatus.SUCCESS, FileStatus.FAILED]:
break
await asyncio.sleep(1)
except WebSocketDisconnect:
pass
@router.delete("/files/{file_id}")
async def delete_file_endpoint(
file_id: str,
db: Session = Depends(get_db)
):
"""
Delete a file and its associated records.
This will remove:
1. The database record
2. The original uploaded file
3. The processed markdown file (if it exists)
"""
try:
delete_file(file_id)
return {"message": "File deleted successfully"}
except HTTPException as e:
raise e
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))

View File

@ -0,0 +1,65 @@
from pydantic_settings import BaseSettings
from typing import Optional
import os
from pathlib import Path
class Settings(BaseSettings):
# API Settings
API_V1_STR: str = "/api/v1"
PROJECT_NAME: str = "Legal Document Masker API"
# Security
SECRET_KEY: str = "your-secret-key-here" # Change in production
ACCESS_TOKEN_EXPIRE_MINUTES: int = 60 * 24 * 8 # 8 days
# Database
BASE_DIR: Path = Path(__file__).parent.parent.parent
DATABASE_URL: str = f"sqlite:///{BASE_DIR}/storage/legal_doc_masker.db"
# File Storage
UPLOAD_FOLDER: Path = BASE_DIR / "storage" / "uploads"
PROCESSED_FOLDER: Path = BASE_DIR / "storage" / "processed"
MAX_FILE_SIZE: int = 50 * 1024 * 1024 # 50MB
ALLOWED_EXTENSIONS: set = {"pdf", "docx", "doc", "md"}
# Celery
CELERY_BROKER_URL: str = "redis://redis:6379/0"
CELERY_RESULT_BACKEND: str = "redis://redis:6379/0"
# Ollama API settings
OLLAMA_API_URL: str = "https://api.ollama.com"
OLLAMA_API_KEY: str = ""
OLLAMA_MODEL: str = "llama2"
# Mineru API settings
MINERU_API_URL: str = "http://mineru-api:8000"
# MINERU_API_URL: str = "http://host.docker.internal:8001"
MINERU_TIMEOUT: int = 300 # 5 minutes timeout
MINERU_LANG_LIST: list = ["ch"] # Language list for parsing
MINERU_BACKEND: str = "pipeline" # Backend to use
MINERU_PARSE_METHOD: str = "auto" # Parse method
MINERU_FORMULA_ENABLE: bool = True # Enable formula parsing
MINERU_TABLE_ENABLE: bool = True # Enable table parsing
# Logging settings
LOG_LEVEL: str = "INFO"
LOG_FORMAT: str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
LOG_DATE_FORMAT: str = "%Y-%m-%d %H:%M:%S"
LOG_FILE: str = "app.log"
class Config:
case_sensitive = True
env_file = ".env"
env_file_encoding = "utf-8"
extra = "allow"
def __init__(self, **kwargs):
super().__init__(**kwargs)
# Create storage directories if they don't exist
self.UPLOAD_FOLDER.mkdir(parents=True, exist_ok=True)
self.PROCESSED_FOLDER.mkdir(parents=True, exist_ok=True)
# Create storage directory for database
(self.BASE_DIR / "storage").mkdir(parents=True, exist_ok=True)
settings = Settings()

View File

@ -1,5 +1,6 @@
import logging.config
from config.settings import settings
# from config.settings import settings
from .settings import settings
LOGGING_CONFIG = {
"version": 1,

View File

View File

@ -0,0 +1,21 @@
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from .config import settings
# Create SQLite engine with check_same_thread=False for FastAPI
engine = create_engine(
settings.DATABASE_URL,
connect_args={"check_same_thread": False}
)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
Base = declarative_base()
# Dependency
def get_db():
db = SessionLocal()
try:
yield db
finally:
db.close()

View File

@ -1,9 +1,9 @@
import os
from typing import Optional
from document_handlers.document_processor import DocumentProcessor
from document_handlers.processors import (
from .document_processor import DocumentProcessor
from .processors import (
TxtDocumentProcessor,
DocxDocumentProcessor,
# DocxDocumentProcessor,
PdfDocumentProcessor,
MarkdownDocumentProcessor
)
@ -15,8 +15,8 @@ class DocumentProcessorFactory:
processors = {
'.txt': TxtDocumentProcessor,
'.docx': DocxDocumentProcessor,
'.doc': DocxDocumentProcessor,
# '.docx': DocxDocumentProcessor,
# '.doc': DocxDocumentProcessor,
'.pdf': PdfDocumentProcessor,
'.md': MarkdownDocumentProcessor,
'.markdown': MarkdownDocumentProcessor

View File

@ -0,0 +1,71 @@
from abc import ABC, abstractmethod
from typing import Any, Dict
import logging
from .ner_processor import NerProcessor
logger = logging.getLogger(__name__)
class DocumentProcessor(ABC):
def __init__(self):
self.max_chunk_size = 1000 # Maximum number of characters per chunk
self.ner_processor = NerProcessor()
@abstractmethod
def read_content(self) -> str:
"""Read document content"""
pass
def _split_into_chunks(self, sentences: list[str]) -> list[str]:
"""Split sentences into chunks that don't exceed max_chunk_size"""
chunks = []
current_chunk = ""
for sentence in sentences:
if not sentence.strip():
continue
if len(current_chunk) + len(sentence) > self.max_chunk_size and current_chunk:
chunks.append(current_chunk)
current_chunk = sentence
else:
if current_chunk:
current_chunk += "" + sentence
else:
current_chunk = sentence
if current_chunk:
chunks.append(current_chunk)
logger.info(f"Split content into {len(chunks)} chunks")
return chunks
def _apply_mapping(self, text: str, mapping: Dict[str, str]) -> str:
"""Apply the mapping to replace sensitive information"""
masked_text = text
for original, masked in mapping.items():
if isinstance(masked, dict):
masked = next(iter(masked.values()), "")
elif not isinstance(masked, str):
masked = str(masked) if masked is not None else ""
masked_text = masked_text.replace(original, masked)
return masked_text
def process_content(self, content: str) -> str:
"""Process document content by masking sensitive information"""
sentences = content.split("")
chunks = self._split_into_chunks(sentences)
logger.info(f"Split content into {len(chunks)} chunks")
final_mapping = self.ner_processor.process(chunks)
masked_content = self._apply_mapping(content, final_mapping)
logger.info("Successfully masked content")
return masked_content
@abstractmethod
def save_content(self, content: str) -> None:
"""Save processed content"""
pass

View File

@ -0,0 +1,305 @@
from typing import Any, Dict
from ..prompts.masking_prompts import get_ner_name_prompt, get_ner_company_prompt, get_ner_address_prompt, get_ner_project_prompt, get_ner_case_number_prompt, get_entity_linkage_prompt
import logging
import json
from ..services.ollama_client import OllamaClient
from ...core.config import settings
from ..utils.json_extractor import LLMJsonExtractor
from ..utils.llm_validator import LLMResponseValidator
import re
from .regs.entity_regex import extract_id_number_entities, extract_social_credit_code_entities
logger = logging.getLogger(__name__)
class NerProcessor:
def __init__(self):
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
self.max_retries = 3
def _validate_mapping_format(self, mapping: Dict[str, Any]) -> bool:
return LLMResponseValidator.validate_entity_extraction(mapping)
def _process_entity_type(self, chunk: str, prompt_func, entity_type: str) -> Dict[str, str]:
for attempt in range(self.max_retries):
try:
formatted_prompt = prompt_func(chunk)
logger.info(f"Calling ollama to generate {entity_type} mapping for chunk (attempt {attempt + 1}/{self.max_retries}): {formatted_prompt}")
response = self.ollama_client.generate(formatted_prompt)
logger.info(f"Raw response from LLM: {response}")
mapping = LLMJsonExtractor.parse_raw_json_str(response)
logger.info(f"Parsed mapping: {mapping}")
if mapping and self._validate_mapping_format(mapping):
return mapping
else:
logger.warning(f"Invalid mapping format received on attempt {attempt + 1}, retrying...")
except Exception as e:
logger.error(f"Error generating {entity_type} mapping on attempt {attempt + 1}: {e}")
if attempt < self.max_retries - 1:
logger.info("Retrying...")
else:
logger.error(f"Max retries reached for {entity_type}, returning empty mapping")
return {}
def build_mapping(self, chunk: str) -> list[Dict[str, str]]:
mapping_pipeline = []
entity_configs = [
(get_ner_name_prompt, "people names"),
(get_ner_company_prompt, "company names"),
(get_ner_address_prompt, "addresses"),
(get_ner_project_prompt, "project names"),
(get_ner_case_number_prompt, "case numbers")
]
for prompt_func, entity_type in entity_configs:
mapping = self._process_entity_type(chunk, prompt_func, entity_type)
if mapping:
mapping_pipeline.append(mapping)
regex_entity_extractors = [
extract_id_number_entities,
extract_social_credit_code_entities
]
for extractor in regex_entity_extractors:
mapping = extractor(chunk)
if mapping and LLMResponseValidator.validate_regex_entity(mapping):
mapping_pipeline.append(mapping)
elif mapping:
logger.warning(f"Invalid regex entity mapping format: {mapping}")
return mapping_pipeline
def _merge_entity_mappings(self, chunk_mappings: list[Dict[str, Any]]) -> list[Dict[str, str]]:
all_entities = []
for mapping in chunk_mappings:
if isinstance(mapping, dict) and 'entities' in mapping:
entities = mapping['entities']
if isinstance(entities, list):
all_entities.extend(entities)
unique_entities = []
seen_texts = set()
for entity in all_entities:
if isinstance(entity, dict) and 'text' in entity:
text = entity['text'].strip()
if text and text not in seen_texts:
seen_texts.add(text)
unique_entities.append(entity)
elif text and text in seen_texts:
# 暂时记录下可能存在冲突的entity
logger.info(f"Duplicate entity found: {entity}")
continue
logger.info(f"Merged {len(unique_entities)} unique entities")
return unique_entities
def _generate_masked_mapping(self, unique_entities: list[Dict[str, str]], linkage: Dict[str, Any]) -> Dict[str, str]:
"""
结合 linkage 信息按实体分组映射同一脱敏名并实现如下规则
1. 人名/简称保留姓名变为某同姓编号
2. 公司名同组公司名映射为大写字母公司A公司B公司...
3. 英文人名每个单词首字母+***
4. 英文公司名替换为所属行业名称英文大写如无行业信息默认 COMPANY
5. 项目名项目名称变为小写英文字母 a项目b项目...
6. 案号只替换案号中的数字部分为***保留前后结构和支持中间有空格
7. 身份证号6位X
8. 社会信用代码8位X
9. 地址保留区级及以上行政区划去除详细位置
10. 其他类型按原有逻辑
"""
import re
entity_mapping = {}
used_masked_names = set()
group_mask_map = {}
surname_counter = {}
company_letter = ord('A')
project_letter = ord('a')
# 优先区县级单位,后市、省等
admin_keywords = [
'市辖区', '自治县', '自治旗', '林区', '', '', '', '', '', '地区', '自治州',
'', '', '自治区', '特别行政区'
]
admin_pattern = r"^(.*?(?:" + '|'.join(admin_keywords) + r"))"
for group in linkage.get('entity_groups', []):
group_type = group.get('group_type', '')
entities = group.get('entities', [])
if '公司' in group_type or 'Company' in group_type:
masked = chr(company_letter) + '公司'
company_letter += 1
for entity in entities:
group_mask_map[entity['text']] = masked
elif '人名' in group_type:
surname_local_counter = {}
for entity in entities:
name = entity['text']
if not name:
continue
surname = name[0]
surname_local_counter.setdefault(surname, 0)
surname_local_counter[surname] += 1
if surname_local_counter[surname] == 1:
masked = f"{surname}"
else:
masked = f"{surname}{surname_local_counter[surname]}"
group_mask_map[name] = masked
elif '英文人名' in group_type:
for entity in entities:
name = entity['text']
if not name:
continue
masked = ' '.join([n[0] + '***' if n else '' for n in name.split()])
group_mask_map[name] = masked
for entity in unique_entities:
text = entity['text']
entity_type = entity.get('type', '')
if text in group_mask_map:
entity_mapping[text] = group_mask_map[text]
used_masked_names.add(group_mask_map[text])
elif '英文公司名' in entity_type or 'English Company' in entity_type:
industry = entity.get('industry', 'COMPANY')
masked = industry.upper()
entity_mapping[text] = masked
used_masked_names.add(masked)
elif '项目名' in entity_type:
masked = chr(project_letter) + '项目'
project_letter += 1
entity_mapping[text] = masked
used_masked_names.add(masked)
elif '案号' in entity_type:
masked = re.sub(r'(\d[\d\s]*)(号)', r'***\2', text)
entity_mapping[text] = masked
used_masked_names.add(masked)
elif '身份证号' in entity_type:
masked = 'X' * 6
entity_mapping[text] = masked
used_masked_names.add(masked)
elif '社会信用代码' in entity_type:
masked = 'X' * 8
entity_mapping[text] = masked
used_masked_names.add(masked)
elif '地址' in entity_type:
# 保留区级及以上行政区划,去除详细位置
match = re.match(admin_pattern, text)
if match:
masked = match.group(1)
else:
masked = text # fallback
entity_mapping[text] = masked
used_masked_names.add(masked)
elif '人名' in entity_type:
name = text
if not name:
masked = ''
else:
surname = name[0]
surname_counter.setdefault(surname, 0)
surname_counter[surname] += 1
if surname_counter[surname] == 1:
masked = f"{surname}"
else:
masked = f"{surname}{surname_counter[surname]}"
entity_mapping[text] = masked
used_masked_names.add(masked)
elif '公司' in entity_type or 'Company' in entity_type:
masked = chr(company_letter) + '公司'
company_letter += 1
entity_mapping[text] = masked
used_masked_names.add(masked)
elif '英文人名' in entity_type:
name = text
masked = ' '.join([n[0] + '***' if n else '' for n in name.split()])
entity_mapping[text] = masked
used_masked_names.add(masked)
else:
base_name = ''
masked = base_name
counter = 1
while masked in used_masked_names:
if counter <= 10:
suffixes = ['', '', '', '', '', '', '', '', '', '']
masked = base_name + suffixes[counter - 1]
else:
masked = f"{base_name}{counter}"
counter += 1
entity_mapping[text] = masked
used_masked_names.add(masked)
return entity_mapping
def _validate_linkage_format(self, linkage: Dict[str, Any]) -> bool:
return LLMResponseValidator.validate_entity_linkage(linkage)
def _create_entity_linkage(self, unique_entities: list[Dict[str, str]]) -> Dict[str, Any]:
linkable_entities = []
for entity in unique_entities:
entity_type = entity.get('type', '')
if any(keyword in entity_type for keyword in ['公司', 'Company', '人名', '英文人名']):
linkable_entities.append(entity)
if not linkable_entities:
logger.info("No linkable entities found")
return {"entity_groups": []}
entities_text = "\n".join([
f"- {entity['text']} (类型: {entity['type']})"
for entity in linkable_entities
])
for attempt in range(self.max_retries):
try:
formatted_prompt = get_entity_linkage_prompt(entities_text)
logger.info(f"Calling ollama to generate entity linkage (attempt {attempt + 1}/{self.max_retries})")
response = self.ollama_client.generate(formatted_prompt)
logger.info(f"Raw entity linkage response from LLM: {response}")
linkage = LLMJsonExtractor.parse_raw_json_str(response)
logger.info(f"Parsed entity linkage: {linkage}")
if linkage and self._validate_linkage_format(linkage):
logger.info(f"Successfully created entity linkage with {len(linkage.get('entity_groups', []))} groups")
return linkage
else:
logger.warning(f"Invalid entity linkage format received on attempt {attempt + 1}, retrying...")
except Exception as e:
logger.error(f"Error generating entity linkage on attempt {attempt + 1}: {e}")
if attempt < self.max_retries - 1:
logger.info("Retrying...")
else:
logger.error("Max retries reached for entity linkage, returning empty linkage")
return {"entity_groups": []}
def _apply_entity_linkage_to_mapping(self, entity_mapping: Dict[str, str], entity_linkage: Dict[str, Any]) -> Dict[str, str]:
"""
linkage 已在 _generate_masked_mapping 中处理此处直接返回 entity_mapping
"""
return entity_mapping
def process(self, chunks: list[str]) -> Dict[str, str]:
chunk_mappings = []
for i, chunk in enumerate(chunks):
logger.info(f"Processing chunk {i+1}/{len(chunks)}")
chunk_mapping = self.build_mapping(chunk)
logger.info(f"Chunk mapping: {chunk_mapping}")
chunk_mappings.extend(chunk_mapping)
logger.info(f"Final chunk mappings: {chunk_mappings}")
unique_entities = self._merge_entity_mappings(chunk_mappings)
logger.info(f"Unique entities: {unique_entities}")
entity_linkage = self._create_entity_linkage(unique_entities)
logger.info(f"Entity linkage: {entity_linkage}")
# for quick test
# unique_entities = [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}, {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}, {'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}, {'text': '服务合同', 'type': '项目名'}, {'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '2020京0105 民初69754 号', 'type': '案号'}, {'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}, {'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}, {'text': '《计算机设备采购合同》', 'type': '项目名'}, {'text': '《服务合同书》', 'type': '项目名'}]
# entity_linkage = {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
combined_mapping = self._generate_masked_mapping(unique_entities, entity_linkage)
logger.info(f"Combined mapping: {combined_mapping}")
final_mapping = self._apply_entity_linkage_to_mapping(combined_mapping, entity_linkage)
logger.info(f"Final mapping: {final_mapping}")
return final_mapping

View File

@ -0,0 +1,7 @@
from .txt_processor import TxtDocumentProcessor
# from .docx_processor import DocxDocumentProcessor
from .pdf_processor import PdfDocumentProcessor
from .md_processor import MarkdownDocumentProcessor
# __all__ = ['TxtDocumentProcessor', 'DocxDocumentProcessor', 'PdfDocumentProcessor', 'MarkdownDocumentProcessor']
__all__ = ['TxtDocumentProcessor', 'PdfDocumentProcessor', 'MarkdownDocumentProcessor']

View File

@ -1,13 +1,13 @@
import os
import docx
from document_handlers.document_processor import DocumentProcessor
from ...document_handlers.document_processor import DocumentProcessor
from magic_pdf.data.data_reader_writer import FileBasedDataWriter
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.data.read_api import read_local_office
import logging
from services.ollama_client import OllamaClient
from config.settings import settings
from prompts.masking_prompts import get_masking_mapping_prompt
from ...services.ollama_client import OllamaClient
from ...config import settings
from ...prompts.masking_prompts import get_masking_mapping_prompt
logger = logging.getLogger(__name__)

View File

@ -1,8 +1,8 @@
import os
from document_handlers.document_processor import DocumentProcessor
from services.ollama_client import OllamaClient
from ...document_handlers.document_processor import DocumentProcessor
from ...services.ollama_client import OllamaClient
import logging
from config.settings import settings
from ...config import settings
logger = logging.getLogger(__name__)

View File

@ -0,0 +1,204 @@
import os
import requests
import logging
from typing import Dict, Any, Optional
from ...document_handlers.document_processor import DocumentProcessor
from ...services.ollama_client import OllamaClient
from ...config import settings
logger = logging.getLogger(__name__)
class PdfDocumentProcessor(DocumentProcessor):
def __init__(self, input_path: str, output_path: str):
super().__init__() # Call parent class's __init__
self.input_path = input_path
self.output_path = output_path
self.output_dir = os.path.dirname(output_path)
self.name_without_suff = os.path.splitext(os.path.basename(input_path))[0]
# Setup work directory for temporary files
self.work_dir = os.path.join(
os.path.dirname(output_path),
".work",
os.path.splitext(os.path.basename(input_path))[0]
)
os.makedirs(self.work_dir, exist_ok=True)
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
# Mineru API configuration
self.mineru_base_url = getattr(settings, 'MINERU_API_URL', 'http://mineru-api:8000')
self.mineru_timeout = getattr(settings, 'MINERU_TIMEOUT', 300) # 5 minutes timeout
self.mineru_lang_list = getattr(settings, 'MINERU_LANG_LIST', ['ch'])
self.mineru_backend = getattr(settings, 'MINERU_BACKEND', 'pipeline')
self.mineru_parse_method = getattr(settings, 'MINERU_PARSE_METHOD', 'auto')
self.mineru_formula_enable = getattr(settings, 'MINERU_FORMULA_ENABLE', True)
self.mineru_table_enable = getattr(settings, 'MINERU_TABLE_ENABLE', True)
def _call_mineru_api(self, file_path: str) -> Optional[Dict[str, Any]]:
"""
Call Mineru API to convert PDF to markdown
Args:
file_path: Path to the PDF file
Returns:
API response as dictionary or None if failed
"""
try:
url = f"{self.mineru_base_url}/file_parse"
with open(file_path, 'rb') as file:
files = {'files': (os.path.basename(file_path), file, 'application/pdf')}
# Prepare form data according to Mineru API specification
data = {
'output_dir': './output',
'lang_list': self.mineru_lang_list,
'backend': self.mineru_backend,
'parse_method': self.mineru_parse_method,
'formula_enable': self.mineru_formula_enable,
'table_enable': self.mineru_table_enable,
'return_md': True,
'return_middle_json': False,
'return_model_output': False,
'return_content_list': False,
'return_images': False,
'start_page_id': 0,
'end_page_id': 99999
}
logger.info(f"Calling Mineru API at {url}")
response = requests.post(
url,
files=files,
data=data,
timeout=self.mineru_timeout
)
if response.status_code == 200:
result = response.json()
logger.info("Successfully received response from Mineru API")
return result
else:
logger.error(f"Mineru API returned status code {response.status_code}: {response.text}")
return None
except requests.exceptions.Timeout:
logger.error(f"Mineru API request timed out after {self.mineru_timeout} seconds")
return None
except requests.exceptions.RequestException as e:
logger.error(f"Error calling Mineru API: {str(e)}")
return None
except Exception as e:
logger.error(f"Unexpected error calling Mineru API: {str(e)}")
return None
def _extract_markdown_from_response(self, response: Dict[str, Any]) -> str:
"""
Extract markdown content from Mineru API response
Args:
response: Mineru API response dictionary
Returns:
Extracted markdown content as string
"""
try:
logger.debug(f"Mineru API response structure: {response}")
# Try different possible response formats based on Mineru API
if 'markdown' in response:
return response['markdown']
elif 'md' in response:
return response['md']
elif 'content' in response:
return response['content']
elif 'text' in response:
return response['text']
elif 'result' in response and isinstance(response['result'], dict):
result = response['result']
if 'markdown' in result:
return result['markdown']
elif 'md' in result:
return result['md']
elif 'content' in result:
return result['content']
elif 'text' in result:
return result['text']
elif 'data' in response and isinstance(response['data'], dict):
data = response['data']
if 'markdown' in data:
return data['markdown']
elif 'md' in data:
return data['md']
elif 'content' in data:
return data['content']
elif 'text' in data:
return data['text']
elif isinstance(response, list) and len(response) > 0:
# If response is a list, try to extract from first item
first_item = response[0]
if isinstance(first_item, dict):
return self._extract_markdown_from_response(first_item)
elif isinstance(first_item, str):
return first_item
else:
# If no standard format found, try to extract from the response structure
logger.warning("Could not find standard markdown field in Mineru response")
# Return the response as string if it's simple, or empty string
if isinstance(response, str):
return response
elif isinstance(response, dict):
# Try to find any text-like content
for key, value in response.items():
if isinstance(value, str) and len(value) > 100: # Likely content
return value
elif isinstance(value, dict):
# Recursively search in nested dictionaries
nested_content = self._extract_markdown_from_response(value)
if nested_content:
return nested_content
return ""
except Exception as e:
logger.error(f"Error extracting markdown from Mineru response: {str(e)}")
return ""
def read_content(self) -> str:
logger.info("Starting PDF content processing with Mineru API")
# Call Mineru API to convert PDF to markdown
mineru_response = self._call_mineru_api(self.input_path)
if not mineru_response:
raise Exception("Failed to get response from Mineru API")
# Extract markdown content from the response
markdown_content = self._extract_markdown_from_response(mineru_response)
if not markdown_content:
raise Exception("No markdown content found in Mineru API response")
logger.info(f"Successfully extracted {len(markdown_content)} characters of markdown content")
# Save the raw markdown content to work directory for reference
md_output_path = os.path.join(self.work_dir, f"{self.name_without_suff}.md")
with open(md_output_path, 'w', encoding='utf-8') as file:
file.write(markdown_content)
logger.info(f"Saved raw markdown content to {md_output_path}")
return markdown_content
def save_content(self, content: str) -> None:
# Ensure output path has .md extension
output_dir = os.path.dirname(self.output_path)
base_name = os.path.splitext(os.path.basename(self.output_path))[0]
md_output_path = os.path.join(output_dir, f"{base_name}.md")
logger.info(f"Saving masked content to: {md_output_path}")
with open(md_output_path, 'w', encoding='utf-8') as file:
file.write(content)

View File

@ -1,8 +1,8 @@
from document_handlers.document_processor import DocumentProcessor
from services.ollama_client import OllamaClient
from ...document_handlers.document_processor import DocumentProcessor
from ...services.ollama_client import OllamaClient
import logging
from prompts.masking_prompts import get_masking_prompt
from config.settings import settings
# from ...prompts.masking_prompts import get_masking_prompt
from ...config import settings
logger = logging.getLogger(__name__)
class TxtDocumentProcessor(DocumentProcessor):

View File

@ -0,0 +1,18 @@
import re
def extract_id_number_entities(chunk: str) -> dict:
"""Extract Chinese ID numbers and return in entity mapping format."""
id_pattern = r'\b\d{17}[\dXx]\b'
entities = []
for match in re.findall(id_pattern, chunk):
entities.append({"text": match, "type": "身份证号"})
return {"entities": entities} if entities else {}
def extract_social_credit_code_entities(chunk: str) -> dict:
"""Extract social credit codes and return in entity mapping format."""
credit_pattern = r'\b[0-9A-Z]{18}\b'
entities = []
for match in re.findall(credit_pattern, chunk):
entities.append({"text": match, "type": "统一社会信用代码"})
return {"entities": entities} if entities else {}

View File

@ -0,0 +1,225 @@
import textwrap
def get_ner_name_prompt(text: str) -> str:
"""
Returns a prompt that generates a mapping of original names/companies to their masked versions.
Args:
text (str): The input text to be analyzed for masking
Returns:
str: The formatted prompt that will generate a mapping dictionary
"""
prompt = textwrap.dedent("""
你是一个专业的法律文本实体识别助手请从以下文本中抽取出所有需要脱敏的敏感信息并按照指定的类别进行分类请严格按照JSON格式输出结果
实体类别包括:
- 人名 (不包括律师法官书记员检察官等公职人员)
- 英文人名
待处理文本:
{text}
输出格式:
{{
"entities": [
{{"text": "原始文本内容", "type": "人名"}},
{{"text": "原始文本内容", "type": "英文人名"}},
...
]
}}
请严格按照JSON格式输出结果
""")
return prompt.format(text=text)
def get_ner_company_prompt(text: str) -> str:
"""
Returns a prompt that generates a mapping of original companies to their masked versions.
Args:
text (str): The input text to be analyzed for masking
Returns:
str: The formatted prompt that will generate a mapping dictionary
"""
prompt = textwrap.dedent("""
你是一个专业的法律文本实体识别助手请从以下文本中抽取出所有需要脱敏的敏感信息并按照指定的类别进行分类请严格按照JSON格式输出结果
实体类别包括:
- 公司名称
- 英文公司名称
- Company with English name
- 公司名称简称
- 公司英文名称简称
待处理文本:
{text}
输出格式:
{{
"entities": [
{{"text": "原始文本内容", "type": "公司名称"}},
{{"text": "原始文本内容", "type": "英文公司名称"}},
{{"text": "原始文本内容", "type": "公司名称简称"}},
{{"text": "原始文本内容", "type": "公司英文名称简称"}},
...
]
}}
请严格按照JSON格式输出结果
""")
return prompt.format(text=text)
def get_ner_address_prompt(text: str) -> str:
"""
Returns a prompt that generates a mapping of original addresses to their masked versions.
Args:
text (str): The input text to be analyzed for masking
Returns:
str: The formatted prompt that will generate a mapping dictionary
"""
prompt = textwrap.dedent("""
你是一个专业的法律文本实体识别助手请从以下文本中抽取出所有需要脱敏的敏感信息并按照指定的类别进行分类请严格按照JSON格式输出结果
实体类别包括:
- 地址
待处理文本:
{text}
输出格式:
{{
"entities": [
{{"text": "原始文本内容", "type": "地址"}},
...
]
}}
请严格按照JSON格式输出结果
""")
return prompt.format(text=text)
def get_ner_project_prompt(text: str) -> str:
"""
Returns a prompt that generates a mapping of original project names to their masked versions.
"""
prompt = textwrap.dedent("""
你是一个专业的法律文本实体识别助手请从以下文本中抽取出所有需要脱敏的敏感信息并按照指定的类别进行分类请严格按照JSON格式输出结果
实体类别包括:
- 项目名(此处项目特指商业工程合同等项目)
待处理文本:
{text}
输出格式:
{{
"entities": [
{{"text": "原始文本内容", "type": "项目名"}},
...
]
}}
请严格按照JSON格式输出结果
""")
return prompt.format(text=text)
def get_ner_case_number_prompt(text: str) -> str:
"""
Returns a prompt that generates a mapping of original case numbers to their masked versions.
"""
prompt = textwrap.dedent("""
你是一个专业的法律文本实体识别助手请从以下文本中抽取出所有需要脱敏的敏感信息并按照指定的类别进行分类请严格按照JSON格式输出结果
实体类别包括:
- 案号
待处理文本:
{text}
输出格式:
{{
"entities": [
{{"text": "原始文本内容", "type": "案号"}},
...
]
}}
请严格按照JSON格式输出结果
""")
return prompt.format(text=text)
def get_entity_linkage_prompt(entities_text: str) -> str:
"""
Returns a prompt that identifies related entities and groups them together.
Args:
entities_text (str): The list of entities to be analyzed for linkage
Returns:
str: The formatted prompt that will generate entity linkage information
"""
prompt = textwrap.dedent("""
你是一个专业的法律文本实体关联分析助手请分析以下实体列表识别出相互关联的实体如全称与简称中文名与英文名等并将它们分组
关联规则
1. 公司名称关联
- 全称与简称"阿里巴巴集团控股有限公司" "阿里巴巴"
- 中文名与英文名"腾讯科技有限公司" "Tencent Technology Ltd."
- 母公司与子公司"腾讯" "腾讯音乐"
2. 每个组中应指定一个主要实体is_primary: true通常是
- 对于公司选择最正式的全称
- 对于人名选择最常用的称呼
待分析实体列表:
{entities_text}
输出格式:
{{
"entity_groups": [
{{
"group_id": "group_1",
"group_type": "公司名称",
"entities": [
{{
"text": "阿里巴巴集团控股有限公司",
"type": "公司名称",
"is_primary": true
}},
{{
"text": "阿里巴巴",
"type": "公司名称简称",
"is_primary": false
}}
]
}}
]
}}
注意事项
1. 只对确实有关联的实体进行分组
2. 每个实体只能属于一个组
3. 每个组必须有且仅有一个主要实体is_primary: true
4. 如果实体之间没有明显关联不要强制分组
5. group_type 应该是 "公司名称"
请严格按照JSON格式输出结果
""")
return prompt.format(entities_text=entities_text)

View File

@ -1,12 +1,12 @@
import logging
from document_handlers.document_factory import DocumentProcessorFactory
from services.ollama_client import OllamaClient
from ..document_handlers.document_factory import DocumentProcessorFactory
from ..services.ollama_client import OllamaClient
logger = logging.getLogger(__name__)
class DocumentService:
def __init__(self, ollama_client: OllamaClient):
self.ollama_client = ollama_client
def __init__(self):
pass
def process_document(self, input_path: str, output_path: str) -> bool:
try:

View File

@ -0,0 +1,240 @@
import logging
from typing import Any, Dict, Optional
from jsonschema import validate, ValidationError
logger = logging.getLogger(__name__)
class LLMResponseValidator:
"""Validator for LLM JSON responses with different schemas for different entity types"""
# Schema for basic entity extraction responses
ENTITY_EXTRACTION_SCHEMA = {
"type": "object",
"properties": {
"entities": {
"type": "array",
"items": {
"type": "object",
"properties": {
"text": {"type": "string"},
"type": {"type": "string"}
},
"required": ["text", "type"]
}
}
},
"required": ["entities"]
}
# Schema for entity linkage responses
ENTITY_LINKAGE_SCHEMA = {
"type": "object",
"properties": {
"entity_groups": {
"type": "array",
"items": {
"type": "object",
"properties": {
"group_id": {"type": "string"},
"group_type": {"type": "string"},
"entities": {
"type": "array",
"items": {
"type": "object",
"properties": {
"text": {"type": "string"},
"type": {"type": "string"},
"is_primary": {"type": "boolean"}
},
"required": ["text", "type", "is_primary"]
}
}
},
"required": ["group_id", "group_type", "entities"]
}
}
},
"required": ["entity_groups"]
}
# Schema for regex-based entity extraction (from entity_regex.py)
REGEX_ENTITY_SCHEMA = {
"type": "object",
"properties": {
"entities": {
"type": "array",
"items": {
"type": "object",
"properties": {
"text": {"type": "string"},
"type": {"type": "string"}
},
"required": ["text", "type"]
}
}
},
"required": ["entities"]
}
@classmethod
def validate_entity_extraction(cls, response: Dict[str, Any]) -> bool:
"""
Validate entity extraction response from LLM.
Args:
response: The parsed JSON response from LLM
Returns:
bool: True if valid, False otherwise
"""
try:
validate(instance=response, schema=cls.ENTITY_EXTRACTION_SCHEMA)
logger.debug(f"Entity extraction validation passed for response: {response}")
return True
except ValidationError as e:
logger.warning(f"Entity extraction validation failed: {e}")
logger.warning(f"Response that failed validation: {response}")
return False
@classmethod
def validate_entity_linkage(cls, response: Dict[str, Any]) -> bool:
"""
Validate entity linkage response from LLM.
Args:
response: The parsed JSON response from LLM
Returns:
bool: True if valid, False otherwise
"""
try:
validate(instance=response, schema=cls.ENTITY_LINKAGE_SCHEMA)
content_valid = cls._validate_linkage_content(response)
if content_valid:
logger.debug(f"Entity linkage validation passed for response: {response}")
return True
else:
logger.warning(f"Entity linkage content validation failed for response: {response}")
return False
except ValidationError as e:
logger.warning(f"Entity linkage validation failed: {e}")
logger.warning(f"Response that failed validation: {response}")
return False
@classmethod
def validate_regex_entity(cls, response: Dict[str, Any]) -> bool:
"""
Validate regex-based entity extraction response.
Args:
response: The parsed JSON response from regex extractors
Returns:
bool: True if valid, False otherwise
"""
try:
validate(instance=response, schema=cls.REGEX_ENTITY_SCHEMA)
logger.debug(f"Regex entity validation passed for response: {response}")
return True
except ValidationError as e:
logger.warning(f"Regex entity validation failed: {e}")
logger.warning(f"Response that failed validation: {response}")
return False
@classmethod
def _validate_linkage_content(cls, response: Dict[str, Any]) -> bool:
"""
Additional content validation for entity linkage responses.
Args:
response: The parsed JSON response from LLM
Returns:
bool: True if content is valid, False otherwise
"""
entity_groups = response.get('entity_groups', [])
for group in entity_groups:
# Validate group type
group_type = group.get('group_type', '')
if group_type not in ['公司名称', '人名']:
logger.warning(f"Invalid group_type: {group_type}")
return False
# Validate entities in group
entities = group.get('entities', [])
if not entities:
logger.warning("Empty entity group found")
return False
# Check that exactly one entity is marked as primary
primary_count = sum(1 for entity in entities if entity.get('is_primary', False))
if primary_count != 1:
logger.warning(f"Group must have exactly one primary entity, found {primary_count}")
return False
# Validate entity types within group
for entity in entities:
entity_type = entity.get('type', '')
if group_type == '公司名称' and not any(keyword in entity_type for keyword in ['公司', 'Company']):
logger.warning(f"Company group contains non-company entity: {entity_type}")
return False
elif group_type == '人名' and not any(keyword in entity_type for keyword in ['人名', '英文人名']):
logger.warning(f"Person group contains non-person entity: {entity_type}")
return False
return True
@classmethod
def validate_response_by_type(cls, response: Dict[str, Any], response_type: str) -> bool:
"""
Generic validator that routes to appropriate validation method based on response type.
Args:
response: The parsed JSON response from LLM
response_type: Type of response ('entity_extraction', 'entity_linkage', 'regex_entity')
Returns:
bool: True if valid, False otherwise
"""
validators = {
'entity_extraction': cls.validate_entity_extraction,
'entity_linkage': cls.validate_entity_linkage,
'regex_entity': cls.validate_regex_entity
}
validator = validators.get(response_type)
if not validator:
logger.error(f"Unknown response type: {response_type}")
return False
return validator(response)
@classmethod
def get_validation_errors(cls, response: Dict[str, Any], response_type: str) -> Optional[str]:
"""
Get detailed validation errors for debugging.
Args:
response: The parsed JSON response from LLM
response_type: Type of response
Returns:
Optional[str]: Error message or None if valid
"""
try:
if response_type == 'entity_extraction':
validate(instance=response, schema=cls.ENTITY_EXTRACTION_SCHEMA)
elif response_type == 'entity_linkage':
validate(instance=response, schema=cls.ENTITY_LINKAGE_SCHEMA)
if not cls._validate_linkage_content(response):
return "Content validation failed for entity linkage"
elif response_type == 'regex_entity':
validate(instance=response, schema=cls.REGEX_ENTITY_SCHEMA)
else:
return f"Unknown response type: {response_type}"
return None
except ValidationError as e:
return f"Schema validation error: {e}"

33
backend/app/main.py Normal file
View File

@ -0,0 +1,33 @@
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from .core.config import settings
from .api.endpoints import files
from .core.database import engine, Base
# Create database tables
Base.metadata.create_all(bind=engine)
app = FastAPI(
title=settings.PROJECT_NAME,
openapi_url=f"{settings.API_V1_STR}/openapi.json"
)
# Set up CORS
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # In production, replace with specific origins
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Include routers
app.include_router(
files.router,
prefix=f"{settings.API_V1_STR}/files",
tags=["files"]
)
@app.get("/")
async def root():
return {"message": "Welcome to Legal Document Masker API"}

View File

@ -0,0 +1,22 @@
from sqlalchemy import Column, String, DateTime, Text
from datetime import datetime
import uuid
from ..core.database import Base
class FileStatus(str):
NOT_STARTED = "not_started"
PROCESSING = "processing"
SUCCESS = "success"
FAILED = "failed"
class File(Base):
__tablename__ = "files"
id = Column(String(36), primary_key=True, default=lambda: str(uuid.uuid4()))
filename = Column(String(255), nullable=False)
original_path = Column(String(255), nullable=False)
processed_path = Column(String(255))
status = Column(String(20), nullable=False, default=FileStatus.NOT_STARTED)
error_message = Column(Text)
created_at = Column(DateTime, nullable=False, default=datetime.utcnow)
updated_at = Column(DateTime, nullable=False, default=datetime.utcnow, onupdate=datetime.utcnow)

View File

@ -0,0 +1,21 @@
from pydantic import BaseModel
from datetime import datetime
from typing import Optional
from uuid import UUID
class FileBase(BaseModel):
filename: str
status: str
error_message: Optional[str] = None
class FileResponse(FileBase):
id: UUID
created_at: datetime
updated_at: datetime
class Config:
from_attributes = True
class FileList(BaseModel):
files: list[FileResponse]
total: int

View File

@ -0,0 +1,87 @@
from celery import Celery
from ..core.config import settings
from ..models.file import File, FileStatus
from sqlalchemy.orm import Session
from ..core.database import SessionLocal
import sys
import os
from ..core.services.document_service import DocumentService
from pathlib import Path
from fastapi import HTTPException
celery = Celery(
'file_service',
broker=settings.CELERY_BROKER_URL,
backend=settings.CELERY_RESULT_BACKEND
)
def delete_file(file_id: str):
"""
Delete a file and its associated records.
This will:
1. Delete the database record
2. Delete the original uploaded file
3. Delete the processed markdown file (if it exists)
"""
db = SessionLocal()
try:
# Get the file record
file = db.query(File).filter(File.id == file_id).first()
if not file:
raise HTTPException(status_code=404, detail="File not found")
# Delete the original file if it exists
if file.original_path and os.path.exists(file.original_path):
os.remove(file.original_path)
# Delete the processed file if it exists
if file.processed_path and os.path.exists(file.processed_path):
os.remove(file.processed_path)
# Delete the database record
db.delete(file)
db.commit()
except Exception as e:
db.rollback()
raise HTTPException(status_code=500, detail=f"Error deleting file: {str(e)}")
finally:
db.close()
@celery.task
def process_file(file_id: str):
db = SessionLocal()
try:
file = db.query(File).filter(File.id == file_id).first()
if not file:
return
# Update status to processing
file.status = FileStatus.PROCESSING
db.commit()
try:
# Process the file using your existing masking system
process_service = DocumentService()
# Determine output path using file_id with .md extension
output_filename = f"{file_id}.md"
output_path = str(settings.PROCESSED_FOLDER / output_filename)
# Process document with both input and output paths
process_service.process_document(file.original_path, output_path)
# Update file record with processed path
file.processed_path = output_path
file.status = FileStatus.SUCCESS
db.commit()
except Exception as e:
file.status = FileStatus.FAILED
file.error_message = str(e)
db.commit()
raise
finally:
db.close()

View File

@ -0,0 +1,37 @@
version: '3.8'
services:
api:
build: .
ports:
- "8000:8000"
volumes:
- ./storage:/app/storage
- ./legal_doc_masker.db:/app/legal_doc_masker.db
env_file:
- .env
environment:
- CELERY_BROKER_URL=redis://redis:6379/0
- CELERY_RESULT_BACKEND=redis://redis:6379/0
depends_on:
- redis
celery_worker:
build: .
command: celery -A app.services.file_service worker --loglevel=info
volumes:
- ./storage:/app/storage
- ./legal_doc_masker.db:/app/legal_doc_masker.db
env_file:
- .env
environment:
- CELERY_BROKER_URL=redis://redis:6379/0
- CELERY_RESULT_BACKEND=redis://redis:6379/0
depends_on:
- redis
- api
redis:
image: redis:alpine
ports:
- "6379:6379"

127
backend/log Normal file
View File

@ -0,0 +1,127 @@
[2025-07-14 14:20:19,015: INFO/ForkPoolWorker-4] Raw response from LLM: {
celery_worker-1 | "entities": []
celery_worker-1 | }
celery_worker-1 | [2025-07-14 14:20:19,016: INFO/ForkPoolWorker-4] Parsed mapping: {'entities': []}
celery_worker-1 | [2025-07-14 14:20:19,020: INFO/ForkPoolWorker-4] Calling ollama to generate case numbers mapping for chunk (attempt 1/3):
celery_worker-1 | 你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息并按照指定的类别进行分类。请严格按照JSON格式输出结果。
celery_worker-1 |
celery_worker-1 | 实体类别包括:
celery_worker-1 | - 案号
celery_worker-1 |
celery_worker-1 | 待处理文本:
celery_worker-1 |
celery_worker-1 |
celery_worker-1 | 二审案件受理费450892 元,由北京丰复久信营销科技有限公司负担(已交纳)。
celery_worker-1 |
celery_worker-1 | 29. 本判决为终审判决。
celery_worker-1 |
celery_worker-1 | 审 判 长 史晓霞审 判 员 邓青菁审 判 员 李 淼二〇二二年七月七日法 官 助 理 黎 铧书 记 员 郑海兴
celery_worker-1 |
celery_worker-1 | 输出格式:
celery_worker-1 | {
celery_worker-1 | "entities": [
celery_worker-1 | {"text": "原始文本内容", "type": "案号"},
celery_worker-1 | ...
celery_worker-1 | ]
celery_worker-1 | }
celery_worker-1 |
celery_worker-1 | 请严格按照JSON格式输出结果。
celery_worker-1 |
api-1 | INFO: 192.168.65.1:60045 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:34054 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:34054 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:22084 - "GET /api/v1/files/files HTTP/1.1" 200 OK
celery_worker-1 | [2025-07-14 14:20:31,279: INFO/ForkPoolWorker-4] Raw response from LLM: {
celery_worker-1 | "entities": []
celery_worker-1 | }
celery_worker-1 | [2025-07-14 14:20:31,281: INFO/ForkPoolWorker-4] Parsed mapping: {'entities': []}
celery_worker-1 | [2025-07-14 14:20:31,287: INFO/ForkPoolWorker-4] Chunk mapping: [{'entities': []}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': []}, {'entities': []}]
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Final chunk mappings: [{'entities': [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}]}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}]}, {'entities': [{'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}]}, {'entities': [{'text': '服务合同', 'type': '项目名'}]}, {'entities': [{'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '2020京0105 民初69754 号', 'type': '案号'}]}, {'entities': [{'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}]}, {'entities': [{'text': '丰复久信公司', 'type': '公司名称'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}]}, {'entities': [{'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}]}, {'entities': [{'text': '《计算机设备采购合同》', 'type': '项目名'}]}, {'entities': []}, {'entities': []}, {'entities': [{'text': '丰复久信公司', 'type': '公司名称'}, {'text': '中研智创公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': [{'text': '《服务合同书》', 'type': '项目名'}]}, {'entities': []}, {'entities': []}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': []}, {'entities': []}]
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '丰复久信公司', 'type': '公司名称'}
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '丰复久信公司', 'type': '公司名称'}
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '中研智创公司', 'type': '公司名称'}
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Merged 22 unique entities
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Unique entities: [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}, {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}, {'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}, {'text': '服务合同', 'type': '项目名'}, {'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '2020京0105 民初69754 号', 'type': '案号'}, {'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}, {'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}, {'text': '《计算机设备采购合同》', 'type': '项目名'}, {'text': '《服务合同书》', 'type': '项目名'}]
celery_worker-1 | [2025-07-14 14:20:31,289: INFO/ForkPoolWorker-4] Calling ollama to generate entity linkage (attempt 1/3)
api-1 | INFO: 192.168.65.1:52168 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:61426 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:30702 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:48159 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:16860 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:21262 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:45564 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:32142 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:27769 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:21196 - "GET /api/v1/files/files HTTP/1.1" 200 OK
celery_worker-1 | [2025-07-14 14:21:21,436: INFO/ForkPoolWorker-4] Raw entity linkage response from LLM: {
celery_worker-1 | "entity_groups": [
celery_worker-1 | {
celery_worker-1 | "group_id": "group_1",
celery_worker-1 | "group_type": "公司名称",
celery_worker-1 | "entities": [
celery_worker-1 | {
celery_worker-1 | "text": "北京丰复久信营销科技有限公司",
celery_worker-1 | "type": "公司名称",
celery_worker-1 | "is_primary": true
celery_worker-1 | },
celery_worker-1 | {
celery_worker-1 | "text": "丰复久信公司",
celery_worker-1 | "type": "公司名称简称",
celery_worker-1 | "is_primary": false
celery_worker-1 | },
celery_worker-1 | {
celery_worker-1 | "text": "丰复久信",
celery_worker-1 | "type": "公司名称简称",
celery_worker-1 | "is_primary": false
celery_worker-1 | }
celery_worker-1 | ]
celery_worker-1 | },
celery_worker-1 | {
celery_worker-1 | "group_id": "group_2",
celery_worker-1 | "group_type": "公司名称",
celery_worker-1 | "entities": [
celery_worker-1 | {
celery_worker-1 | "text": "中研智创区块链技术有限公司",
celery_worker-1 | "type": "公司名称",
celery_worker-1 | "is_primary": true
celery_worker-1 | },
celery_worker-1 | {
celery_worker-1 | "text": "中研智创公司",
celery_worker-1 | "type": "公司名称简称",
celery_worker-1 | "is_primary": false
celery_worker-1 | },
celery_worker-1 | {
celery_worker-1 | "text": "中研智创",
celery_worker-1 | "type": "公司名称简称",
celery_worker-1 | "is_primary": false
celery_worker-1 | }
celery_worker-1 | ]
celery_worker-1 | }
celery_worker-1 | ]
celery_worker-1 | }
celery_worker-1 | [2025-07-14 14:21:21,437: INFO/ForkPoolWorker-4] Parsed entity linkage: {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
celery_worker-1 | [2025-07-14 14:21:21,445: INFO/ForkPoolWorker-4] Successfully created entity linkage with 2 groups
celery_worker-1 | [2025-07-14 14:21:21,445: INFO/ForkPoolWorker-4] Entity linkage: {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Generated masked mapping for 22 entities
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Combined mapping: {'郭东军': '某', '王欢子': '某甲', '北京丰复久信营销科技有限公司': '某公司', '丰复久信公司': '某公司甲', '中研智创区块链技术有限公司': '某公司乙', '中研智才公司': '某公司丙', '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室': '某乙', '天津市津南区双港镇工业园区优谷产业园5 号楼-1505': '某丙', '服务合同': '某丁', '(2022)京 03 民终 3852 号': '某戊', '2020京0105 民初69754 号': '某己', '李圣艳': '某庚', '闫向东': '某辛', '李敏': '某壬', '布兰登·斯密特': '某癸', '中研智创公司': '某公司丁', '丰复久信': '某公司戊', '中研智创': '某公司己', '上海市': '某11', '北京': '某12', '《计算机设备采购合同》': '某13', '《服务合同书》': '某14'}
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '北京丰复久信营销科技有限公司' to '北京丰复久信营销科技有限公司' with masked name '某公司'
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '丰复久信公司' to '北京丰复久信营销科技有限公司' with masked name '某公司'
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '丰复久信' to '北京丰复久信营销科技有限公司' with masked name '某公司'
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创区块链技术有限公司' to '中研智创区块链技术有限公司' with masked name '某公司乙'
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创公司' to '中研智创区块链技术有限公司' with masked name '某公司乙'
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创' to '中研智创区块链技术有限公司' with masked name '某公司乙'
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Final mapping: {'郭东军': '某', '王欢子': '某甲', '北京丰复久信营销科技有限公司': '某公司', '丰复久信公司': '某公司', '中研智创区块链技术有限公司': '某公司乙', '中研智才公司': '某公司丙', '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室': '某乙', '天津市津南区双港镇工业园区优谷产业园5 号楼-1505': '某丙', '服务合同': '某丁', '(2022)京 03 民终 3852 号': '某戊', '2020京0105 民初69754 号': '某己', '李圣艳': '某庚', '闫向东': '某辛', '李敏': '某壬', '布兰登·斯密特': '某癸', '中研智创公司': '某公司乙', '丰复久信': '某公司', '中研智创': '某公司乙', '上海市': '某11', '北京': '某12', '《计算机设备采购合同》': '某13', '《服务合同书》': '某14'}
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Successfully masked content
celery_worker-1 | [2025-07-14 14:21:21,449: INFO/ForkPoolWorker-4] Successfully saved masked content to /app/storage/processed/47522ea9-c259-4304-bfe4-1d3ed6902ede.md
celery_worker-1 | [2025-07-14 14:21:21,470: INFO/ForkPoolWorker-4] Task app.services.file_service.process_file[5cfbca4c-0f6f-4c71-a66b-b22ee2d28139] succeeded in 311.847165101s: None
api-1 | INFO: 192.168.65.1:33432 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:40073 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:29550 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:61350 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:61755 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:63726 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:43446 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:45624 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:25256 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:43464 - "GET /api/v1/files/files HTTP/1.1" 200 OK

6
backend/package-lock.json generated Normal file
View File

@ -0,0 +1,6 @@
{
"name": "backend",
"lockfileVersion": 3,
"requires": true,
"packages": {}
}

32
backend/requirements.txt Normal file
View File

@ -0,0 +1,32 @@
# FastAPI and server
fastapi>=0.104.0
uvicorn>=0.24.0
python-multipart>=0.0.6
websockets>=12.0
# Database
sqlalchemy>=2.0.0
alembic>=1.12.0
# Background tasks
celery>=5.3.0
redis>=5.0.0
# Security
python-jose[cryptography]>=3.3.0
passlib[bcrypt]>=1.7.4
python-dotenv>=1.0.0
# Testing
pytest>=7.4.0
httpx>=0.25.0
# Existing project dependencies
pydantic-settings>=2.0.0
watchdog==2.1.6
requests==2.28.1
python-docx>=0.8.11
PyPDF2>=3.0.0
pandas>=2.0.0
# magic-pdf[full]
jsonschema>=4.20.0

View File

@ -0,0 +1,62 @@
import pytest
from app.core.document_handlers.ner_processor import NerProcessor
def test_generate_masked_mapping():
processor = NerProcessor()
unique_entities = [
{'text': '李雷', 'type': '人名'},
{'text': '李明', 'type': '人名'},
{'text': '王强', 'type': '人名'},
{'text': 'Acme Manufacturing Inc.', 'type': '英文公司名', 'industry': 'manufacturing'},
{'text': 'Google LLC', 'type': '英文公司名'},
{'text': 'A公司', 'type': '公司名称'},
{'text': 'B公司', 'type': '公司名称'},
{'text': 'John Smith', 'type': '英文人名'},
{'text': 'Elizabeth Windsor', 'type': '英文人名'},
{'text': '华梦龙光伏项目', 'type': '项目名'},
{'text': '案号12345', 'type': '案号'},
{'text': '310101198802080000', 'type': '身份证号'},
{'text': '9133021276453538XT', 'type': '社会信用代码'},
]
linkage = {
'entity_groups': [
{
'group_id': 'g1',
'group_type': '公司名称',
'entities': [
{'text': 'A公司', 'type': '公司名称', 'is_primary': True},
{'text': 'B公司', 'type': '公司名称', 'is_primary': False},
]
},
{
'group_id': 'g2',
'group_type': '人名',
'entities': [
{'text': '李雷', 'type': '人名', 'is_primary': True},
{'text': '李明', 'type': '人名', 'is_primary': False},
]
}
]
}
mapping = processor._generate_masked_mapping(unique_entities, linkage)
# 人名
assert mapping['李雷'].startswith('李某')
assert mapping['李明'].startswith('李某')
assert mapping['王强'].startswith('王某')
# 英文公司名
assert mapping['Acme Manufacturing Inc.'] == 'MANUFACTURING'
assert mapping['Google LLC'] == 'COMPANY'
# 公司名同组
assert mapping['A公司'] == mapping['B公司']
assert mapping['A公司'].endswith('公司')
# 英文人名
assert mapping['John Smith'] == 'J*** S***'
assert mapping['Elizabeth Windsor'] == 'E*** W***'
# 项目名
assert mapping['华梦龙光伏项目'].endswith('项目')
# 案号
assert mapping['案号12345'] == '***'
# 身份证号
assert mapping['310101198802080000'] == 'XXXXXX'
# 社会信用代码
assert mapping['9133021276453538XT'] == 'XXXXXXXX'

View File

@ -1,2 +0,0 @@
rm ./doc_src/*.md
cp ./doc/*.md ./doc_src/

105
docker-compose.yml Normal file
View File

@ -0,0 +1,105 @@
version: '3.8'
services:
# Mineru API Service
mineru-api:
build:
context: ./mineru
dockerfile: Dockerfile
platform: linux/arm64
ports:
- "8001:8000"
volumes:
- ./mineru/storage/uploads:/app/storage/uploads
- ./mineru/storage/processed:/app/storage/processed
environment:
- PYTHONUNBUFFERED=1
- MINERU_MODEL_SOURCE=local
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
networks:
- app-network
# Backend API Service
backend-api:
build:
context: ./backend
dockerfile: Dockerfile
ports:
- "8000:8000"
volumes:
- ./backend/storage:/app/storage
- ./backend/legal_doc_masker.db:/app/legal_doc_masker.db
env_file:
- ./backend/.env
environment:
- CELERY_BROKER_URL=redis://redis:6379/0
- CELERY_RESULT_BACKEND=redis://redis:6379/0
- MINERU_API_URL=http://mineru-api:8000
depends_on:
- redis
- mineru-api
networks:
- app-network
# Celery Worker
celery-worker:
build:
context: ./backend
dockerfile: Dockerfile
command: celery -A app.services.file_service worker --loglevel=info
volumes:
- ./backend/storage:/app/storage
- ./backend/legal_doc_masker.db:/app/legal_doc_masker.db
env_file:
- ./backend/.env
environment:
- CELERY_BROKER_URL=redis://redis:6379/0
- CELERY_RESULT_BACKEND=redis://redis:6379/0
- MINERU_API_URL=http://mineru-api:8000
depends_on:
- redis
- backend-api
networks:
- app-network
# Redis Service
redis:
image: redis:alpine
ports:
- "6379:6379"
networks:
- app-network
# Frontend Service
frontend:
build:
context: ./frontend
dockerfile: Dockerfile
args:
- REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
ports:
- "3000:80"
env_file:
- ./frontend/.env
environment:
- NODE_ENV=production
- REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
restart: unless-stopped
depends_on:
- backend-api
networks:
- app-network
networks:
app-network:
driver: bridge
volumes:
uploads:
processed:

168
export-images.sh Normal file
View File

@ -0,0 +1,168 @@
#!/bin/bash
# Docker Image Export Script
# Exports all project Docker images for migration to another environment
set -e
echo "🚀 Legal Document Masker - Docker Image Export"
echo "=============================================="
# Function to check if Docker is running
check_docker() {
if ! docker info > /dev/null 2>&1; then
echo "❌ Docker is not running. Please start Docker and try again."
exit 1
fi
echo "✅ Docker is running"
}
# Function to check if images exist
check_images() {
echo "🔍 Checking for required images..."
local missing_images=()
if ! docker images | grep -q "legal-doc-masker-backend-api"; then
missing_images+=("legal-doc-masker-backend-api")
fi
if ! docker images | grep -q "legal-doc-masker-frontend"; then
missing_images+=("legal-doc-masker-frontend")
fi
if ! docker images | grep -q "legal-doc-masker-mineru-api"; then
missing_images+=("legal-doc-masker-mineru-api")
fi
if ! docker images | grep -q "redis:alpine"; then
missing_images+=("redis:alpine")
fi
if [ ${#missing_images[@]} -ne 0 ]; then
echo "❌ Missing images: ${missing_images[*]}"
echo "Please build the images first using: docker-compose build"
exit 1
fi
echo "✅ All required images found"
}
# Function to create export directory
create_export_dir() {
local export_dir="docker-images-export-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$export_dir"
cd "$export_dir"
echo "📁 Created export directory: $export_dir"
echo "$export_dir"
}
# Function to export images
export_images() {
local export_dir="$1"
echo "📦 Exporting Docker images..."
# Export backend image
echo " 📦 Exporting backend-api image..."
docker save legal-doc-masker-backend-api:latest -o backend-api.tar
# Export frontend image
echo " 📦 Exporting frontend image..."
docker save legal-doc-masker-frontend:latest -o frontend.tar
# Export mineru image
echo " 📦 Exporting mineru-api image..."
docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
# Export redis image
echo " 📦 Exporting redis image..."
docker save redis:alpine -o redis.tar
echo "✅ All images exported successfully!"
}
# Function to show export summary
show_summary() {
echo ""
echo "📊 Export Summary:"
echo "=================="
ls -lh *.tar
echo ""
echo "📋 Files to transfer:"
echo "===================="
for file in *.tar; do
echo " - $file"
done
echo ""
echo "💾 Total size: $(du -sh . | cut -f1)"
}
# Function to create compressed archive
create_archive() {
echo ""
echo "🗜️ Creating compressed archive..."
local archive_name="legal-doc-masker-images-$(date +%Y%m%d-%H%M%S).tar.gz"
tar -czf "$archive_name" *.tar
echo "✅ Created archive: $archive_name"
echo "📊 Archive size: $(du -sh "$archive_name" | cut -f1)"
echo ""
echo "📋 Transfer options:"
echo "==================="
echo "1. Transfer individual .tar files"
echo "2. Transfer compressed archive: $archive_name"
}
# Function to show transfer instructions
show_transfer_instructions() {
echo ""
echo "📤 Transfer Instructions:"
echo "========================"
echo ""
echo "Option 1: Transfer individual files"
echo "-----------------------------------"
echo "scp *.tar user@target-server:/path/to/destination/"
echo ""
echo "Option 2: Transfer compressed archive"
echo "-------------------------------------"
echo "scp legal-doc-masker-images-*.tar.gz user@target-server:/path/to/destination/"
echo ""
echo "Option 3: USB Drive"
echo "-------------------"
echo "cp *.tar /Volumes/USB_DRIVE/docker-images/"
echo "cp legal-doc-masker-images-*.tar.gz /Volumes/USB_DRIVE/"
echo ""
echo "Option 4: Cloud Storage"
echo "----------------------"
echo "aws s3 cp *.tar s3://your-bucket/docker-images/"
echo "aws s3 cp legal-doc-masker-images-*.tar.gz s3://your-bucket/docker-images/"
}
# Main execution
main() {
check_docker
check_images
local export_dir=$(create_export_dir)
export_images "$export_dir"
show_summary
create_archive
show_transfer_instructions
echo ""
echo "🎉 Export completed successfully!"
echo "📁 Export location: $(pwd)"
echo ""
echo "Next steps:"
echo "1. Transfer the files to your target environment"
echo "2. Use import-images.sh on the target environment"
echo "3. Copy docker-compose.yml and other config files"
}
# Run main function
main "$@"

11
frontend/.dockerignore Normal file
View File

@ -0,0 +1,11 @@
node_modules
npm-debug.log
build
.git
.gitignore
README.md
.env
.env.local
.env.development.local
.env.test.local
.env.production.local

2
frontend/.env Normal file
View File

@ -0,0 +1,2 @@
# REACT_APP_API_BASE_URL=http://192.168.2.203:8000/api/v1
REACT_APP_API_BASE_URL=http://localhost:8000/api/v1

33
frontend/Dockerfile Normal file
View File

@ -0,0 +1,33 @@
# Build stage
FROM node:18-alpine as build
WORKDIR /app
# Copy package files
COPY package*.json ./
# Install dependencies
RUN npm ci
# Copy source code
COPY . .
# Build the app with environment variables
ARG REACT_APP_API_BASE_URL
ENV REACT_APP_API_BASE_URL=$REACT_APP_API_BASE_URL
RUN npm run build
# Production stage
FROM nginx:alpine
# Copy built assets from build stage
COPY --from=build /app/build /usr/share/nginx/html
# Copy nginx configuration
COPY nginx.conf /etc/nginx/conf.d/default.conf
# Expose port 80
EXPOSE 80
# Start nginx
CMD ["nginx", "-g", "daemon off;"]

55
frontend/README.md Normal file
View File

@ -0,0 +1,55 @@
# Legal Document Masker Frontend
This is the frontend application for the Legal Document Masker service. It provides a user interface for uploading legal documents, monitoring their processing status, and downloading the masked versions.
## Features
- Drag and drop file upload
- Real-time status updates
- File list with processing status
- Multi-file selection and download
- Modern Material-UI interface
## Prerequisites
- Node.js (v14 or higher)
- npm (v6 or higher)
## Installation
1. Install dependencies:
```bash
npm install
```
2. Start the development server:
```bash
npm start
```
The application will be available at http://localhost:3000
## Development
The frontend is built with:
- React 18
- TypeScript
- Material-UI
- React Query for data fetching
- React Dropzone for file uploads
## Building for Production
To create a production build:
```bash
npm run build
```
The build artifacts will be stored in the `build/` directory.
## Environment Variables
The following environment variables can be configured:
- `REACT_APP_API_URL`: The URL of the backend API (default: http://localhost:8000/api/v1)

View File

@ -0,0 +1,24 @@
version: '3.8'
services:
frontend:
build:
context: .
dockerfile: Dockerfile
args:
- REACT_APP_API_BASE_URL=${REACT_APP_API_BASE_URL}
ports:
- "3000:80"
env_file:
- .env
environment:
- NODE_ENV=production
- REACT_APP_API_BASE_URL=${REACT_APP_API_BASE_URL}
restart: unless-stopped
networks:
- app-network
networks:
app-network:
driver: bridge

25
frontend/nginx.conf Normal file
View File

@ -0,0 +1,25 @@
server {
listen 80;
server_name localhost;
location / {
root /usr/share/nginx/html;
index index.html;
try_files $uri $uri/ /index.html;
}
# Cache static assets
location /static/ {
root /usr/share/nginx/html;
expires 1y;
add_header Cache-Control "public, no-transform";
}
# Enable gzip compression
gzip on;
gzip_vary on;
gzip_min_length 10240;
gzip_proxied expired no-cache no-store private auth;
gzip_types text/plain text/css text/xml text/javascript application/x-javascript application/xml application/javascript;
gzip_disable "MSIE [1-6]\.";
}

16946
frontend/package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

50
frontend/package.json Normal file
View File

@ -0,0 +1,50 @@
{
"name": "legal-doc-masker-frontend",
"version": "0.1.0",
"private": true,
"dependencies": {
"@emotion/react": "^11.11.3",
"@emotion/styled": "^11.11.0",
"@mui/icons-material": "^5.15.10",
"@mui/material": "^5.15.10",
"@testing-library/jest-dom": "^5.17.0",
"@testing-library/react": "^13.4.0",
"@testing-library/user-event": "^13.5.0",
"@types/jest": "^27.5.2",
"@types/node": "^16.18.80",
"@types/react": "^18.2.55",
"@types/react-dom": "^18.2.19",
"axios": "^1.6.7",
"react": "^18.2.0",
"react-dom": "^18.2.0",
"react-dropzone": "^14.2.3",
"react-query": "^3.39.3",
"react-scripts": "5.0.1",
"typescript": "^4.9.5",
"web-vitals": "^2.1.4"
},
"scripts": {
"start": "react-scripts start",
"build": "react-scripts build",
"test": "react-scripts test",
"eject": "react-scripts eject"
},
"eslintConfig": {
"extends": [
"react-app",
"react-app/jest"
]
},
"browserslist": {
"production": [
">0.2%",
"not dead",
"not op_mini all"
],
"development": [
"last 1 chrome version",
"last 1 firefox version",
"last 1 safari version"
]
}
}

View File

@ -0,0 +1,20 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<link rel="icon" href="%PUBLIC_URL%/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="theme-color" content="#000000" />
<meta
name="description"
content="Legal Document Masker - Upload and process legal documents"
/>
<link rel="apple-touch-icon" href="%PUBLIC_URL%/logo192.png" />
<link rel="manifest" href="%PUBLIC_URL%/manifest.json" />
<title>Legal Document Masker</title>
</head>
<body>
<noscript>You need to enable JavaScript to run this app.</noscript>
<div id="root"></div>
</body>
</html>

View File

@ -0,0 +1,15 @@
{
"short_name": "Legal Doc Masker",
"name": "Legal Document Masker",
"icons": [
{
"src": "favicon.ico",
"sizes": "64x64 32x32 24x24 16x16",
"type": "image/x-icon"
}
],
"start_url": ".",
"display": "standalone",
"theme_color": "#000000",
"background_color": "#ffffff"
}

58
frontend/src/App.tsx Normal file
View File

@ -0,0 +1,58 @@
import React, { useEffect, useState } from 'react';
import { Container, Typography, Box } from '@mui/material';
import { useQuery, useQueryClient } from 'react-query';
import FileUpload from './components/FileUpload';
import FileList from './components/FileList';
import { File } from './types/file';
import { api } from './services/api';
function App() {
const queryClient = useQueryClient();
const [files, setFiles] = useState<File[]>([]);
const { data, isLoading, error } = useQuery<File[]>('files', api.listFiles, {
refetchInterval: 5000, // Poll every 5 seconds
});
useEffect(() => {
if (data) {
setFiles(data);
}
}, [data]);
const handleUploadComplete = () => {
queryClient.invalidateQueries('files');
};
if (isLoading) {
return (
<Container>
<Typography>Loading...</Typography>
</Container>
);
}
if (error) {
return (
<Container>
<Typography color="error">Error loading files</Typography>
</Container>
);
}
return (
<Container maxWidth="lg">
<Box sx={{ my: 4 }}>
<Typography variant="h4" component="h1" gutterBottom>
Legal Document Masker
</Typography>
<Box sx={{ mb: 4 }}>
<FileUpload onUploadComplete={handleUploadComplete} />
</Box>
<FileList files={files} onFileStatusChange={handleUploadComplete} />
</Box>
</Container>
);
}
export default App;

View File

@ -0,0 +1,230 @@
import React, { useState } from 'react';
import {
Table,
TableBody,
TableCell,
TableContainer,
TableHead,
TableRow,
Paper,
IconButton,
Checkbox,
Button,
Chip,
Dialog,
DialogTitle,
DialogContent,
DialogActions,
Typography,
} from '@mui/material';
import { Download as DownloadIcon, Delete as DeleteIcon } from '@mui/icons-material';
import { File, FileStatus } from '../types/file';
import { api } from '../services/api';
interface FileListProps {
files: File[];
onFileStatusChange: () => void;
}
const FileList: React.FC<FileListProps> = ({ files, onFileStatusChange }) => {
const [selectedFiles, setSelectedFiles] = useState<string[]>([]);
const [deleteDialogOpen, setDeleteDialogOpen] = useState(false);
const [fileToDelete, setFileToDelete] = useState<string | null>(null);
const handleSelectFile = (fileId: string) => {
setSelectedFiles((prev) =>
prev.includes(fileId)
? prev.filter((id) => id !== fileId)
: [...prev, fileId]
);
};
const handleSelectAll = () => {
setSelectedFiles((prev) =>
prev.length === files.length ? [] : files.map((file) => file.id)
);
};
const handleDownload = async (fileId: string) => {
try {
console.log('=== FRONTEND DOWNLOAD START ===');
console.log('File ID:', fileId);
const file = files.find((f) => f.id === fileId);
console.log('File object:', file);
const blob = await api.downloadFile(fileId);
console.log('Blob received:', blob);
console.log('Blob type:', blob.type);
console.log('Blob size:', blob.size);
const url = window.URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
// Match backend behavior: change extension to .md
const originalFilename = file?.filename || 'downloaded-file';
const filenameWithoutExt = originalFilename.replace(/\.[^/.]+$/, ''); // Remove extension
const downloadFilename = `${filenameWithoutExt}.md`;
console.log('Original filename:', originalFilename);
console.log('Filename without extension:', filenameWithoutExt);
console.log('Download filename:', downloadFilename);
a.download = downloadFilename;
document.body.appendChild(a);
a.click();
window.URL.revokeObjectURL(url);
document.body.removeChild(a);
console.log('=== FRONTEND DOWNLOAD END ===');
} catch (error) {
console.error('Error downloading file:', error);
}
};
const handleDownloadSelected = async () => {
for (const fileId of selectedFiles) {
await handleDownload(fileId);
}
};
const handleDeleteClick = (fileId: string) => {
setFileToDelete(fileId);
setDeleteDialogOpen(true);
};
const handleDeleteConfirm = async () => {
if (fileToDelete) {
try {
await api.deleteFile(fileToDelete);
onFileStatusChange();
} catch (error) {
console.error('Error deleting file:', error);
}
}
setDeleteDialogOpen(false);
setFileToDelete(null);
};
const handleDeleteCancel = () => {
setDeleteDialogOpen(false);
setFileToDelete(null);
};
const getStatusColor = (status: FileStatus) => {
switch (status) {
case FileStatus.SUCCESS:
return 'success';
case FileStatus.FAILED:
return 'error';
case FileStatus.PROCESSING:
return 'warning';
default:
return 'default';
}
};
return (
<div>
<div style={{ marginBottom: '1rem' }}>
<Button
variant="contained"
color="primary"
onClick={handleDownloadSelected}
disabled={selectedFiles.length === 0}
sx={{ mr: 1 }}
>
Download Selected
</Button>
</div>
<TableContainer component={Paper}>
<Table>
<TableHead>
<TableRow>
<TableCell padding="checkbox">
<Checkbox
checked={selectedFiles.length === files.length}
indeterminate={selectedFiles.length > 0 && selectedFiles.length < files.length}
onChange={handleSelectAll}
/>
</TableCell>
<TableCell>Filename</TableCell>
<TableCell>Status</TableCell>
<TableCell>Created At</TableCell>
<TableCell>Finished At</TableCell>
<TableCell>Actions</TableCell>
</TableRow>
</TableHead>
<TableBody>
{files.map((file) => (
<TableRow key={file.id}>
<TableCell padding="checkbox">
<Checkbox
checked={selectedFiles.includes(file.id)}
onChange={() => handleSelectFile(file.id)}
/>
</TableCell>
<TableCell>{file.filename}</TableCell>
<TableCell>
<Chip
label={file.status}
color={getStatusColor(file.status) as any}
size="small"
/>
</TableCell>
<TableCell>
{new Date(file.created_at).toLocaleString()}
</TableCell>
<TableCell>
{(file.status === FileStatus.SUCCESS || file.status === FileStatus.FAILED)
? new Date(file.updated_at).toLocaleString()
: '—'}
</TableCell>
<TableCell>
<IconButton
onClick={() => handleDeleteClick(file.id)}
size="small"
color="error"
sx={{ mr: 1 }}
>
<DeleteIcon />
</IconButton>
{file.status === FileStatus.SUCCESS && (
<IconButton
onClick={() => handleDownload(file.id)}
size="small"
color="primary"
>
<DownloadIcon />
</IconButton>
)}
</TableCell>
</TableRow>
))}
</TableBody>
</Table>
</TableContainer>
<Dialog
open={deleteDialogOpen}
onClose={handleDeleteCancel}
>
<DialogTitle>Confirm Delete</DialogTitle>
<DialogContent>
<Typography>
Are you sure you want to delete this file? This action cannot be undone.
</Typography>
</DialogContent>
<DialogActions>
<Button onClick={handleDeleteCancel}>Cancel</Button>
<Button onClick={handleDeleteConfirm} color="error" variant="contained">
Delete
</Button>
</DialogActions>
</Dialog>
</div>
);
};
export default FileList;

View File

@ -0,0 +1,66 @@
import React, { useCallback } from 'react';
import { useDropzone } from 'react-dropzone';
import { Box, Typography, CircularProgress } from '@mui/material';
import { api } from '../services/api';
interface FileUploadProps {
onUploadComplete: () => void;
}
const FileUpload: React.FC<FileUploadProps> = ({ onUploadComplete }) => {
const [isUploading, setIsUploading] = React.useState(false);
const onDrop = useCallback(async (acceptedFiles: File[]) => {
setIsUploading(true);
try {
for (const file of acceptedFiles) {
await api.uploadFile(file);
}
onUploadComplete();
} catch (error) {
console.error('Error uploading files:', error);
} finally {
setIsUploading(false);
}
}, [onUploadComplete]);
const { getRootProps, getInputProps, isDragActive } = useDropzone({
onDrop,
accept: {
'application/pdf': ['.pdf'],
'application/msword': ['.doc'],
'application/vnd.openxmlformats-officedocument.wordprocessingml.document': ['.docx'],
'text/markdown': ['.md'],
},
});
return (
<Box
{...getRootProps()}
sx={{
border: '2px dashed #ccc',
borderRadius: 2,
p: 3,
textAlign: 'center',
cursor: 'pointer',
bgcolor: isDragActive ? 'action.hover' : 'background.paper',
'&:hover': {
bgcolor: 'action.hover',
},
}}
>
<input {...getInputProps()} />
{isUploading ? (
<CircularProgress />
) : (
<Typography>
{isDragActive
? 'Drop the files here...'
: 'Drag and drop files here, or click to select files'}
</Typography>
)}
</Box>
);
};
export default FileUpload;

8
frontend/src/env.d.ts vendored Normal file
View File

@ -0,0 +1,8 @@
/// <reference types="react-scripts" />
declare namespace NodeJS {
interface ProcessEnv {
readonly REACT_APP_API_BASE_URL: string;
// Add other environment variables here
}
}

29
frontend/src/index.tsx Normal file
View File

@ -0,0 +1,29 @@
import React from 'react';
import ReactDOM from 'react-dom/client';
import { QueryClient, QueryClientProvider } from 'react-query';
import { ThemeProvider, createTheme } from '@mui/material';
import CssBaseline from '@mui/material/CssBaseline';
import App from './App';
const queryClient = new QueryClient();
const theme = createTheme({
palette: {
mode: 'light',
},
});
const root = ReactDOM.createRoot(
document.getElementById('root') as HTMLElement
);
root.render(
<React.StrictMode>
<QueryClientProvider client={queryClient}>
<ThemeProvider theme={theme}>
<CssBaseline />
<App />
</ThemeProvider>
</QueryClientProvider>
</React.StrictMode>
);

View File

@ -0,0 +1,44 @@
import axios from 'axios';
import { File, FileUploadResponse } from '../types/file';
const API_BASE_URL = process.env.REACT_APP_API_BASE_URL || 'http://localhost:8000/api/v1';
// Create axios instance with default config
const axiosInstance = axios.create({
baseURL: API_BASE_URL,
timeout: 30000, // 30 seconds timeout
});
export const api = {
uploadFile: async (file: globalThis.File): Promise<FileUploadResponse> => {
const formData = new FormData();
formData.append('file', file);
const response = await axiosInstance.post('/files/upload', formData, {
headers: {
'Content-Type': 'multipart/form-data',
},
});
return response.data;
},
listFiles: async (): Promise<File[]> => {
const response = await axiosInstance.get('/files/files');
return response.data;
},
getFile: async (fileId: string): Promise<File> => {
const response = await axiosInstance.get(`/files/files/${fileId}`);
return response.data;
},
downloadFile: async (fileId: string): Promise<Blob> => {
const response = await axiosInstance.get(`/files/files/${fileId}/download`, {
responseType: 'blob',
});
return response.data;
},
deleteFile: async (fileId: string): Promise<void> => {
await axiosInstance.delete(`/files/files/${fileId}`);
},
};

View File

@ -0,0 +1,23 @@
export enum FileStatus {
NOT_STARTED = "not_started",
PROCESSING = "processing",
SUCCESS = "success",
FAILED = "failed"
}
export interface File {
id: string;
filename: string;
status: FileStatus;
error_message?: string;
created_at: string;
updated_at: string;
}
export interface FileUploadResponse {
id: string;
filename: string;
status: FileStatus;
created_at: string;
updated_at: string;
}

26
frontend/tsconfig.json Normal file
View File

@ -0,0 +1,26 @@
{
"compilerOptions": {
"target": "es5",
"lib": [
"dom",
"dom.iterable",
"esnext"
],
"allowJs": true,
"skipLibCheck": true,
"esModuleInterop": true,
"allowSyntheticDefaultImports": true,
"strict": true,
"forceConsistentCasingInFileNames": true,
"noFallthroughCasesInSwitch": true,
"module": "esnext",
"moduleResolution": "node",
"resolveJsonModule": true,
"isolatedModules": true,
"noEmit": true,
"jsx": "react-jsx"
},
"include": [
"src"
]
}

232
import-images.sh Normal file
View File

@ -0,0 +1,232 @@
#!/bin/bash
# Docker Image Import Script
# Imports Docker images on target environment for migration
set -e
echo "🚀 Legal Document Masker - Docker Image Import"
echo "=============================================="
# Function to check if Docker is running
check_docker() {
if ! docker info > /dev/null 2>&1; then
echo "❌ Docker is not running. Please start Docker and try again."
exit 1
fi
echo "✅ Docker is running"
}
# Function to check for tar files
check_tar_files() {
echo "🔍 Checking for Docker image files..."
local missing_files=()
if [ ! -f "backend-api.tar" ]; then
missing_files+=("backend-api.tar")
fi
if [ ! -f "frontend.tar" ]; then
missing_files+=("frontend.tar")
fi
if [ ! -f "mineru-api.tar" ]; then
missing_files+=("mineru-api.tar")
fi
if [ ! -f "redis.tar" ]; then
missing_files+=("redis.tar")
fi
if [ ${#missing_files[@]} -ne 0 ]; then
echo "❌ Missing files: ${missing_files[*]}"
echo ""
echo "Please ensure all .tar files are in the current directory."
echo "If you have a compressed archive, extract it first:"
echo " tar -xzf legal-doc-masker-images-*.tar.gz"
exit 1
fi
echo "✅ All required files found"
}
# Function to check available disk space
check_disk_space() {
echo "💾 Checking available disk space..."
local required_space=0
for file in *.tar; do
local file_size=$(stat -f%z "$file" 2>/dev/null || stat -c%s "$file" 2>/dev/null || echo 0)
required_space=$((required_space + file_size))
done
local available_space=$(df . | awk 'NR==2 {print $4}')
available_space=$((available_space * 1024)) # Convert to bytes
if [ $required_space -gt $available_space ]; then
echo "❌ Insufficient disk space"
echo "Required: $(numfmt --to=iec $required_space)"
echo "Available: $(numfmt --to=iec $available_space)"
exit 1
fi
echo "✅ Sufficient disk space available"
}
# Function to import images
import_images() {
echo "📦 Importing Docker images..."
# Import backend image
echo " 📦 Importing backend-api image..."
docker load -i backend-api.tar
# Import frontend image
echo " 📦 Importing frontend image..."
docker load -i frontend.tar
# Import mineru image
echo " 📦 Importing mineru-api image..."
docker load -i mineru-api.tar
# Import redis image
echo " 📦 Importing redis image..."
docker load -i redis.tar
echo "✅ All images imported successfully!"
}
# Function to verify imported images
verify_images() {
echo "🔍 Verifying imported images..."
local missing_images=()
if ! docker images | grep -q "legal-doc-masker-backend-api"; then
missing_images+=("legal-doc-masker-backend-api")
fi
if ! docker images | grep -q "legal-doc-masker-frontend"; then
missing_images+=("legal-doc-masker-frontend")
fi
if ! docker images | grep -q "legal-doc-masker-mineru-api"; then
missing_images+=("legal-doc-masker-mineru-api")
fi
if ! docker images | grep -q "redis:alpine"; then
missing_images+=("redis:alpine")
fi
if [ ${#missing_images[@]} -ne 0 ]; then
echo "❌ Missing imported images: ${missing_images[*]}"
exit 1
fi
echo "✅ All images verified successfully!"
}
# Function to show imported images
show_imported_images() {
echo ""
echo "📊 Imported Images:"
echo "==================="
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep legal-doc-masker
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep redis
}
# Function to create necessary directories
create_directories() {
echo ""
echo "📁 Creating necessary directories..."
mkdir -p backend/storage
mkdir -p mineru/storage/uploads
mkdir -p mineru/storage/processed
echo "✅ Directories created"
}
# Function to check for required files
check_required_files() {
echo ""
echo "🔍 Checking for required configuration files..."
local missing_files=()
if [ ! -f "docker-compose.yml" ]; then
missing_files+=("docker-compose.yml")
fi
if [ ! -f "DOCKER_COMPOSE_README.md" ]; then
missing_files+=("DOCKER_COMPOSE_README.md")
fi
if [ ${#missing_files[@]} -ne 0 ]; then
echo "⚠️ Missing files: ${missing_files[*]}"
echo "Please copy these files from the source environment:"
echo " - docker-compose.yml"
echo " - DOCKER_COMPOSE_README.md"
echo " - backend/.env (if exists)"
echo " - frontend/.env (if exists)"
echo " - mineru/.env (if exists)"
else
echo "✅ All required configuration files found"
fi
}
# Function to show next steps
show_next_steps() {
echo ""
echo "🎉 Import completed successfully!"
echo ""
echo "📋 Next Steps:"
echo "=============="
echo ""
echo "1. Copy configuration files (if not already present):"
echo " - docker-compose.yml"
echo " - backend/.env"
echo " - frontend/.env"
echo " - mineru/.env"
echo ""
echo "2. Start the services:"
echo " docker-compose up -d"
echo ""
echo "3. Verify services are running:"
echo " docker-compose ps"
echo ""
echo "4. Test the endpoints:"
echo " - Frontend: http://localhost:3000"
echo " - Backend API: http://localhost:8000"
echo " - Mineru API: http://localhost:8001"
echo ""
echo "5. View logs if needed:"
echo " docker-compose logs -f [service-name]"
}
# Function to handle compressed archive
handle_compressed_archive() {
if ls legal-doc-masker-images-*.tar.gz 1> /dev/null 2>&1; then
echo "🗜️ Found compressed archive, extracting..."
tar -xzf legal-doc-masker-images-*.tar.gz
echo "✅ Archive extracted"
fi
}
# Main execution
main() {
check_docker
handle_compressed_archive
check_tar_files
check_disk_space
import_images
verify_images
show_imported_images
create_directories
check_required_files
show_next_steps
}
# Run main function
main "$@"

46
mineru/Dockerfile Normal file
View File

@ -0,0 +1,46 @@
FROM python:3.12-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
libreoffice \
wget \
&& rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade pip
RUN pip install uv
# Configure uv and install mineru
ENV UV_SYSTEM_PYTHON=1
RUN uv pip install --system -U "mineru[core]"
# Copy requirements first to leverage Docker cache
# COPY requirements.txt .
# RUN pip install huggingface_hub
# RUN wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
# RUN wget https://raw.githubusercontent.com/opendatalab/MinerU/refs/heads/release-1.3.1/scripts/download_models_hf.py -O download_models_hf.py
# RUN python download_models_hf.py
RUN mineru-models-download -s modelscope -m pipeline
# RUN pip install --no-cache-dir -r requirements.txt
# RUN pip install -U magic-pdf[full]
# Copy the rest of the application
# COPY . .
# Create storage directories
# RUN mkdir -p storage/uploads storage/processed
# Expose the port the app runs on
EXPOSE 8000
# Command to run the application
CMD ["mineru-api", "--host", "0.0.0.0", "--port", "8000"]

27
mineru/docker-compose.yml Normal file
View File

@ -0,0 +1,27 @@
version: '3.8'
services:
mineru-api:
build:
context: .
dockerfile: Dockerfile
platform: linux/arm64
ports:
- "8001:8000"
volumes:
- ./storage/uploads:/app/storage/uploads
- ./storage/processed:/app/storage/processed
environment:
- PYTHONUNBUFFERED=1
- MINERU_MODEL_SOURCE=local
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
volumes:
uploads:
processed:

View File

@ -1,11 +0,0 @@
# Base dependencies
pydantic-settings>=2.0.0
python-dotenv==1.0.0
watchdog==2.1.6
requests==2.28.1
# Document processing
python-docx>=0.8.11
PyPDF2>=3.0.0
pandas>=2.0.0
magic-pdf[full]

43
sample_doc/short_doc.md Normal file
View File

@ -0,0 +1,43 @@
# 北京市第三中级人民法院民事判决书
(2022)京 03 民终 3852 号
上诉人原审原告北京丰复久信营销科技有限公司住所地北京市海淀区北小马厂6 号1 号楼华天大厦1306 室。
法定代表人:郭东军,执行董事、经理。委托诉讼代理人:周大海,北京市康达律师事务所律师。委托诉讼代理人:王乃哲,北京市康达律师事务所律师。
被上诉人原审被告中研智创区块链技术有限公司住所地天津市津南区双港镇工业园区优谷产业园5 号楼-1505。
法定代表人:王欢子,总经理。
委托诉讼代理人:魏鑫,北京市昊衡律师事务所律师。
1.上诉人北京丰复久信营销科技有限公司以下简称丰复久信公司因与被上诉人中研智创区块链技术有限公司以下简称中研智创公司服务合同纠纷一案不服北京市朝阳区人民法院2020京0105 民初69754 号民事判决,向本院提起上诉。本院立案后,依法组成合议庭开庭进行了审理。上诉人丰复久信公司之委托诉讼代理人周大海、王乃哲,被上诉人中研智创公司之委托诉讼代理人魏鑫到庭参加诉讼。本案现已审理终结。
2.丰复久信公司上诉请求1.撤销一审判决发回重审或依法改判支持丰复久信公司一审全部诉讼请求2.或在维持原判的同时判令中研智创公司向丰复久信公司返还 1000 万元款项,并赔偿丰复久信公司因此支付的律师费 220 万元3.判令中研智创公司承担本案一审、二审全部诉讼费用。事实与理由一、根据2019 年的政策导向丰复久信公司的投资行为并无任何法律或政策瑕疵。丰复久信公司仅投资挖矿没有购买比特币故在当时国家、政府层面有相关政策支持甚至鼓励的前提下一审法院仅凭“挖矿”行为就得出丰复久信公司扰乱金融秩序的结论是错误的。二、一审法院没有全面、深入审查相关事实且遗漏了最核心的数据调查工作。三、本案一审判决适用法律错误。涉案合同成立及履行期间并无合同无效的情形当属有效。一审法院以挖矿活动耗能巨大、不利于我国产业结构调整为依据之一作出合同无效的判决实属牵强。最高人民法院发布的全国法院系统2020 年度优秀案例分析评选活动获奖名单中,由上海市第一中级人民法院刘江法官编写的“李圣艳、布兰登·斯密特诉闫向东、李敏等财产损害赔偿纠纷案— —比特币的法律属性及其司法救济”一案入选,该案同样发生在丰复久信公司与中研智创公司合同履行过程中,一审法院认定同时期同类型的涉案合同无效,与上述最高人民法院的优秀案例相悖。四、一审法院径行认定合同无效,未向丰复久信公司进行释明构成程序违法。
3.中研智创公司辩称,同意一审判决,不同意丰复久信公司的上诉请求。首先,一审法院曾在庭审中询问丰复久信公司关于机器返还的问题,一审法院进行了释明。其次,如二审法院对其该项上诉请求进行判决,会剥夺中研智创公司针对该部分请求再行上诉的权利。
4.丰复久信公司向一审法院起诉请求1.中研智创公司交付278.1654976 个比特币,或者按照 2021 年 1 月 25 日比特币的价格交付9550812.36 美元2.中研智创公司赔偿丰复久信公司服务期到期后占用微型存储空间服务器的损失(自2020 年7 月1日起至实际返还服务器时止按照bitinfocharts 网站公布的相关日产比特币数据计算应赔偿比特币数量或按照2021 年1 月25 日比特币的价格交付美元)。
5.一审法院查明事实2019 年5 月6 日,丰复久信公司作为甲方(买方)与乙方(卖方)中研智创公司签订《计算机设备采购合
同》约定货物名称为计算机设备型号规格及数量为T2T-30T 规格型号的微型存储空间服务器1542 台单价5040/ 台合同金额为 7 771 680 元;交货期 2019 年 8 月 31 日前;交货方式为乙方自行送货到甲方所在地,并提供安装服务,运输工具及运费由乙方负责;交货地点北京;签订购货合同,设备安装完毕后一次性支付项目总货款;乙方提供货物的质量保证期为自交货验收结束之日起不少于十二个月(具体按清单要求);乙方交货前应对产品作出全面检查和对验收文件进行整理,并列出清单,作为甲方收货验收和使用的技术条件依据,检验的结果应随货物交甲方,甲方对乙方提供的货物在使用前进行调试时,乙方协助甲方一起调试,直到符合技术要求,甲方才做最终验收,验收时乙方必须在现场,验收完毕后作出验收结果报告,并经双方签字生效。
6.同日丰复久信公司作为甲方客户方与乙方中研智创公司服务方签订《服务合同书》约定乙方同意就采购合同中的微型存储空间服务器向甲方提供特定服务服务的内容包括质保、维修、服务器设备代为运行管理、代为缴纳服务器相关用度花费如电费等详细内容见附件一如果乙方在工作中因自身过错而发生任何错误或遗漏应无条件更正不另外收费并对因此而对甲方造成的损失承担赔偿责任赔偿额以本合同约定的服务费为限若因甲方原因造成工作延误将由甲方承担相应的损失服务费总金额为2 228 320 元甲乙双方一致同意项目服务费以人民币形式于本合同签订后3 日内一次性支付甲方可以提前10 个工作日以书面形式要求变更或增加所提供的服务该等变更最终应由双方商定认可其中包括与该等变更有关的任何费用调整等。合同后附附件一以表格形式列明1.1542 台T2T-30T 微型存储空间服务器的质保、维修时限12 个月完成标准为完成甲方指定的运行量2.服务器的日常运行管理时限12 个月3.代扣代缴电费4.其他(空白)。
24. 2021 年9 月3 日国家发展和改革委员会等部门《关于整治虚拟货币“挖矿”活动的通知》显示,虚拟货币挖矿活动能源消耗和碳排放量大,对国民经济贡献度低,对产业发展、科技进步等带动作用有限,加之虚拟货币生产、交易环节衍生的风险越发突出,其盲目无序发展对推动经济社会高质量发展和节能减排带来不利影响。故以电力资源、碳排放量为代价的“挖矿”行为,与经济社会高质量发展和碳达峰、碳中和目标相悖,与公共利益相悖。
26. 综上,相关部门整治虚拟货币“挖矿”活动、认定虚拟货币相关业务活动属于非法金融活动,有利于保障我国发展利益和金融安全。从“挖矿”行为的高能耗以及比特币交易活动对国家金融秩序和社会秩序的影响来看,一审法院认定涉案合同无效是正确的。双方作为社会主义市场经济主体,既应遵守市场经济规则,亦应承担起相应的社会责任,推动经济社会高质量发展、可持续发展。
27. 关于合同无效后的返还问题,一审法院未予处理,双方可另行解决。
28. 综上所述,丰复久信公司的上诉请求不能成立,应予驳回;一审判决并无不当,应予维持。依照《中华人民共和国民事诉讼法》第一百七十七条第一款第一项规定,判决如下:
驳回上诉,维持原判。
二审案件受理费450892 元,由北京丰复久信营销科技有限公司负担(已交纳)。
29. 本判决为终审判决。
审 判 长 史晓霞审 判 员 邓青菁审 判 员 李 淼二〇二二年七月七日法 官 助 理 黎 铧书 记 员 郑海兴

110
setup-unified-docker.sh Normal file
View File

@ -0,0 +1,110 @@
#!/bin/bash
# Unified Docker Compose Setup Script
# This script helps set up the unified Docker Compose environment
set -e
echo "🚀 Setting up Unified Docker Compose Environment"
# Function to check if Docker is running
check_docker() {
if ! docker info > /dev/null 2>&1; then
echo "❌ Docker is not running. Please start Docker and try again."
exit 1
fi
echo "✅ Docker is running"
}
# Function to stop existing individual services
stop_individual_services() {
echo "🛑 Stopping individual Docker Compose services..."
if [ -f "backend/docker-compose.yml" ]; then
echo "Stopping backend services..."
cd backend && docker-compose down 2>/dev/null || true && cd ..
fi
if [ -f "frontend/docker-compose.yml" ]; then
echo "Stopping frontend services..."
cd frontend && docker-compose down 2>/dev/null || true && cd ..
fi
if [ -f "mineru/docker-compose.yml" ]; then
echo "Stopping mineru services..."
cd mineru && docker-compose down 2>/dev/null || true && cd ..
fi
echo "✅ Individual services stopped"
}
# Function to create necessary directories
create_directories() {
echo "📁 Creating necessary directories..."
mkdir -p backend/storage
mkdir -p mineru/storage/uploads
mkdir -p mineru/storage/processed
echo "✅ Directories created"
}
# Function to check if unified docker-compose.yml exists
check_unified_compose() {
if [ ! -f "docker-compose.yml" ]; then
echo "❌ Unified docker-compose.yml not found in current directory"
echo "Please run this script from the project root directory"
exit 1
fi
echo "✅ Unified docker-compose.yml found"
}
# Function to build and start services
start_unified_services() {
echo "🔨 Building and starting unified services..."
# Build all services
docker-compose build
# Start services
docker-compose up -d
echo "✅ Unified services started"
}
# Function to check service status
check_service_status() {
echo "📊 Checking service status..."
docker-compose ps
echo ""
echo "🌐 Service URLs:"
echo "Frontend: http://localhost:3000"
echo "Backend API: http://localhost:8000"
echo "Mineru API: http://localhost:8001"
echo ""
echo "📝 To view logs: docker-compose logs -f [service-name]"
echo "📝 To stop services: docker-compose down"
}
# Main execution
main() {
echo "=========================================="
echo "Unified Docker Compose Setup"
echo "=========================================="
check_docker
check_unified_compose
stop_individual_services
create_directories
start_unified_services
check_service_status
echo ""
echo "🎉 Setup complete! Your unified Docker environment is ready."
echo "Check the DOCKER_COMPOSE_README.md for more information."
}
# Run main function
main "$@"

View File

@ -1,31 +0,0 @@
# settings.py
from pydantic_settings import BaseSettings
from typing import Optional
class Settings(BaseSettings):
# Storage paths
OBJECT_STORAGE_PATH: str = ""
TARGET_DIRECTORY_PATH: str = ""
# Ollama API settings
OLLAMA_API_URL: str = "https://api.ollama.com"
OLLAMA_API_KEY: str = ""
OLLAMA_MODEL: str = "llama2"
# File monitoring settings
MONITOR_INTERVAL: int = 5
# Logging settings
LOG_LEVEL: str = "INFO"
LOG_FORMAT: str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
LOG_DATE_FORMAT: str = "%Y-%m-%d %H:%M:%S"
LOG_FILE: str = "app.log"
class Config:
env_file = ".env"
env_file_encoding = "utf-8"
extra = "allow"
# Create settings instance
settings = Settings()

View File

@ -1,190 +0,0 @@
from abc import ABC, abstractmethod
from typing import Any, Dict
from prompts.masking_prompts import get_masking_mapping_prompt
import logging
import json
from services.ollama_client import OllamaClient
from config.settings import settings
from utils.json_extractor import LLMJsonExtractor
logger = logging.getLogger(__name__)
class DocumentProcessor(ABC):
def __init__(self):
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
self.max_chunk_size = 1000 # Maximum number of characters per chunk
self.max_retries = 3 # Maximum number of retries for mapping generation
@abstractmethod
def read_content(self) -> str:
"""Read document content"""
pass
def _split_into_chunks(self, sentences: list[str]) -> list[str]:
"""Split sentences into chunks that don't exceed max_chunk_size"""
chunks = []
current_chunk = ""
for sentence in sentences:
if not sentence.strip():
continue
# If adding this sentence would exceed the limit, save current chunk and start new one
if len(current_chunk) + len(sentence) > self.max_chunk_size and current_chunk:
chunks.append(current_chunk)
current_chunk = sentence
else:
if current_chunk:
current_chunk += "" + sentence
else:
current_chunk = sentence
# Add the last chunk if it's not empty
if current_chunk:
chunks.append(current_chunk)
return chunks
def _validate_mapping_format(self, mapping: Dict[str, Any]) -> bool:
"""
Validate that the mapping follows the required format:
{
"原文1": "脱敏后1",
"原文2": "脱敏后2",
...
}
"""
if not isinstance(mapping, dict):
logger.warning("Mapping is not a dictionary")
return False
# Check if any key or value is not a string
for key, value in mapping.items():
if not isinstance(key, str) or not isinstance(value, str):
logger.warning(f"Invalid mapping format - key or value is not a string: {key}: {value}")
return False
# Check if the mapping has any nested structures
if any(isinstance(v, (dict, list)) for v in mapping.values()):
logger.warning("Invalid mapping format - contains nested structures")
return False
return True
def _build_mapping(self, chunk: str) -> Dict[str, str]:
"""Build mapping for a single chunk of text with retry logic"""
for attempt in range(self.max_retries):
try:
formatted_prompt = get_masking_mapping_prompt(chunk)
logger.info(f"Calling ollama to generate mapping for chunk (attempt {attempt + 1}/{self.max_retries}): {formatted_prompt}")
response = self.ollama_client.generate(formatted_prompt)
logger.info(f"Raw response from LLM: {response}")
# Parse the JSON response into a dictionary
mapping = LLMJsonExtractor.parse_raw_json_str(response)
logger.info(f"Parsed mapping: {mapping}")
if mapping and self._validate_mapping_format(mapping):
return mapping
else:
logger.warning(f"Invalid mapping format received on attempt {attempt + 1}, retrying...")
except Exception as e:
logger.error(f"Error generating mapping on attempt {attempt + 1}: {e}")
if attempt < self.max_retries - 1:
logger.info("Retrying...")
else:
logger.error("Max retries reached, returning empty mapping")
return {}
def _apply_mapping(self, text: str, mapping: Dict[str, str]) -> str:
"""Apply the mapping to replace sensitive information"""
masked_text = text
for original, masked in mapping.items():
# Ensure masked value is a string
if isinstance(masked, dict):
# If it's a dict, use the first value or a default
masked = next(iter(masked.values()), "")
elif not isinstance(masked, str):
# If it's not a string, convert to string or use default
masked = str(masked) if masked is not None else ""
masked_text = masked_text.replace(original, masked)
return masked_text
def _get_next_suffix(self, value: str) -> str:
"""Get the next available suffix for a value that already has a suffix"""
# Define the sequence of suffixes
suffixes = ['', '', '', '', '', '', '', '', '', '']
# Check if the value already has a suffix
for suffix in suffixes:
if value.endswith(suffix):
# Find the next suffix in the sequence
current_index = suffixes.index(suffix)
if current_index + 1 < len(suffixes):
return value[:-1] + suffixes[current_index + 1]
else:
# If we've used all suffixes, start over with the first one
return value[:-1] + suffixes[0]
# If no suffix found, return the value with the first suffix
return value + ''
def _merge_mappings(self, existing: Dict[str, str], new: Dict[str, str]) -> Dict[str, str]:
"""
Merge two mappings following the rules:
1. If key exists in existing, keep existing value
2. If value exists in existing:
- If value ends with a suffix (甲乙丙丁...), add next suffix
- If no suffix, add ''
"""
result = existing.copy()
# Get all existing values
existing_values = set(result.values())
for key, value in new.items():
if key in result:
# Rule 1: Keep existing value if key exists
continue
if value in existing_values:
# Rule 2: Handle duplicate values
new_value = self._get_next_suffix(value)
result[key] = new_value
existing_values.add(new_value)
else:
# No conflict, add as is
result[key] = value
existing_values.add(value)
return result
def process_content(self, content: str) -> str:
"""Process document content by masking sensitive information"""
# Split content into sentences
sentences = content.split("")
# Split sentences into manageable chunks
chunks = self._split_into_chunks(sentences)
logger.info(f"Split content into {len(chunks)} chunks")
# Build mapping for each chunk
combined_mapping = {}
for i, chunk in enumerate(chunks):
logger.info(f"Processing chunk {i+1}/{len(chunks)}")
chunk_mapping = self._build_mapping(chunk)
if chunk_mapping: # Only update if we got a valid mapping
combined_mapping = self._merge_mappings(combined_mapping, chunk_mapping)
else:
logger.warning(f"Failed to generate mapping for chunk {i+1}")
# Apply the combined mapping to the entire content
masked_content = self._apply_mapping(content, combined_mapping)
logger.info("Successfully masked content")
return masked_content
@abstractmethod
def save_content(self, content: str) -> None:
"""Save processed content"""
pass

View File

@ -1,6 +0,0 @@
from document_handlers.processors.txt_processor import TxtDocumentProcessor
from document_handlers.processors.docx_processor import DocxDocumentProcessor
from document_handlers.processors.pdf_processor import PdfDocumentProcessor
from document_handlers.processors.md_processor import MarkdownDocumentProcessor
__all__ = ['TxtDocumentProcessor', 'DocxDocumentProcessor', 'PdfDocumentProcessor', 'MarkdownDocumentProcessor']

View File

@ -1,105 +0,0 @@
import os
import PyPDF2
from document_handlers.document_processor import DocumentProcessor
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.data.dataset import PymuDocDataset
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.config.enums import SupportedPdfParseMethod
from prompts.masking_prompts import get_masking_prompt, get_masking_mapping_prompt
import logging
from services.ollama_client import OllamaClient
from config.settings import settings
logger = logging.getLogger(__name__)
class PdfDocumentProcessor(DocumentProcessor):
def __init__(self, input_path: str, output_path: str):
super().__init__() # Call parent class's __init__
self.input_path = input_path
self.output_path = output_path
self.output_dir = os.path.dirname(output_path)
self.name_without_suff = os.path.splitext(os.path.basename(input_path))[0]
# Setup output directories
self.local_image_dir = os.path.join(self.output_dir, "images")
self.image_dir = os.path.basename(self.local_image_dir)
os.makedirs(self.local_image_dir, exist_ok=True)
# Setup work directory under output directory
self.work_dir = os.path.join(
os.path.dirname(output_path),
".work",
os.path.splitext(os.path.basename(input_path))[0]
)
os.makedirs(self.work_dir, exist_ok=True)
self.work_local_image_dir = os.path.join(self.work_dir, "images")
self.work_image_dir = os.path.basename(self.work_local_image_dir)
os.makedirs(self.work_local_image_dir, exist_ok=True)
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
def read_content(self) -> str:
logger.info("Starting PDF content processing")
# Read the PDF file
with open(self.input_path, 'rb') as file:
content = file.read()
# Initialize writers
image_writer = FileBasedDataWriter(self.work_local_image_dir)
md_writer = FileBasedDataWriter(self.work_dir)
# Create Dataset Instance
ds = PymuDocDataset(content)
logger.info("Classifying PDF type: %s", ds.classify())
# Process based on PDF type
if ds.classify() == SupportedPdfParseMethod.OCR:
infer_result = ds.apply(doc_analyze, ocr=True)
pipe_result = infer_result.pipe_ocr_mode(image_writer)
else:
infer_result = ds.apply(doc_analyze, ocr=False)
pipe_result = infer_result.pipe_txt_mode(image_writer)
logger.info("Generating all outputs")
# Generate all outputs
infer_result.draw_model(os.path.join(self.work_dir, f"{self.name_without_suff}_model.pdf"))
model_inference_result = infer_result.get_infer_res()
pipe_result.draw_layout(os.path.join(self.work_dir, f"{self.name_without_suff}_layout.pdf"))
pipe_result.draw_span(os.path.join(self.work_dir, f"{self.name_without_suff}_spans.pdf"))
md_content = pipe_result.get_markdown(self.work_image_dir)
pipe_result.dump_md(md_writer, f"{self.name_without_suff}.md", self.work_image_dir)
content_list = pipe_result.get_content_list(self.work_image_dir)
pipe_result.dump_content_list(md_writer, f"{self.name_without_suff}_content_list.json", self.work_image_dir)
middle_json = pipe_result.get_middle_json()
pipe_result.dump_middle_json(md_writer, f'{self.name_without_suff}_middle.json')
return md_content
# def process_content(self, content: str) -> str:
# logger.info("Starting content masking process")
# sentences = content.split("。")
# final_md = ""
# for sentence in sentences:
# if not sentence.strip(): # Skip empty sentences
# continue
# formatted_prompt = get_masking_mapping_prompt(sentence)
# logger.info("Calling ollama to generate response, prompt: %s", formatted_prompt)
# response = self.ollama_client.generate(formatted_prompt)
# logger.info(f"Response generated: {response}")
# final_md += response + "。"
# return final_md
def save_content(self, content: str) -> None:
# Ensure output path has .md extension
output_dir = os.path.dirname(self.output_path)
base_name = os.path.splitext(os.path.basename(self.output_path))[0]
md_output_path = os.path.join(output_dir, f"{base_name}.md")
logger.info(f"Saving masked content to: {md_output_path}")
with open(md_output_path, 'w', encoding='utf-8') as file:
file.write(content)

View File

@ -1,22 +0,0 @@
from config.logging_config import setup_logging
def main():
# Setup logging first
setup_logging()
from services.file_monitor import FileMonitor
from config.settings import settings
import logging
logger = logging.getLogger(__name__)
logger.info("Starting the application")
logger.info(f"Monitoring directory: {settings.OBJECT_STORAGE_PATH}")
logger.info(f"Target directory: {settings.TARGET_DIRECTORY_PATH}")
# Initialize the file monitor
file_monitor = FileMonitor(settings.OBJECT_STORAGE_PATH, settings.TARGET_DIRECTORY_PATH)
# Start monitoring the directory for new files
file_monitor.start_monitoring()
if __name__ == "__main__":
main()

View File

@ -1,81 +0,0 @@
import textwrap
def get_masking_prompt(text: str) -> str:
"""
Returns the prompt for masking sensitive information in legal documents.
Args:
text (str): The input text to be masked
Returns:
str: The formatted prompt with the input text
"""
prompt = textwrap.dedent("""
您是一位专业的法律文档脱敏专家请按照以下规则对文本进行脱敏处理
规则
1. 人名
- 两字名改为"姓+某"张三 张某
- 三字名改为"姓+某某"张三丰 张某某
2. 公司名
- 保留地理位置信息北京上海等
- 保留公司类型有限公司股份公司等
- ""替换核心名称
3. 保持原文其他部分不变
4. 确保脱敏后的文本保持原有的语言流畅性和可读性
输入文本
{text}
请直接输出脱敏后的文本无需解释或其他备注
""")
return prompt.format(text=text)
def get_masking_mapping_prompt(text: str) -> str:
"""
Returns a prompt that generates a mapping of original names/companies to their masked versions.
Args:
text (str): The input text to be analyzed for masking
Returns:
str: The formatted prompt that will generate a mapping dictionary
"""
prompt = textwrap.dedent("""
您是一位专业的法律文档脱敏专家请分析文本并生成一个脱敏映射表遵循以下规则
规则
1. 人名映射规则
- 对于同一姓氏的不同人名使用字母区分
* 第一个出现的用"姓+某"张三 张某
* 第二个出现的用"姓+某A"张四 张某A
* 第三个出现的用"姓+某B"张五 张某B
依此类推
- 三字名同样遵循此规则张三丰 张某某张四海 张某某A
2. 公司名映射规则
- 保留地理位置信息北京上海等
- 保留公司类型有限公司股份公司等
- ""替换核心名称,但保留首尾字(北京智慧科技有限公司 北京智某科技有限公司)
- 对于多个相似公司名使用字母区分
北京智慧科技有限公司 北京某科技有限公司
北京智能科技有限公司 北京某科技有限公司A
3. 公权机关不做脱敏处理公安局法院检察院中国人民银行银监会及其他未列明的公权机关
请分析以下文本并生成一个JSON格式的映射表包含所有需要脱敏的名称及其对应的脱敏后的形式
{text}
请直接输出JSON格式的映射表格式如下
{{
"原文1": "脱敏后1",
"原文2": "脱敏后2",
...
}}
如无需要输出的映射请输出空json如下:
{{}}
""")
return prompt.format(text=text)

View File

@ -1,54 +0,0 @@
import logging
import os
from services.document_service import DocumentService
from services.ollama_client import OllamaClient
from config.settings import settings
logger = logging.getLogger(__name__)
class FileMonitor:
def __init__(self, input_directory: str, output_directory: str):
self.input_directory = input_directory
self.output_directory = output_directory
# Create OllamaClient instance using settings
ollama_client = OllamaClient(
model_name=settings.OLLAMA_MODEL,
base_url=settings.OLLAMA_API_URL
)
# Inject OllamaClient into DocumentService
self.document_service = DocumentService(ollama_client=ollama_client)
def process_new_file(self, file_path: str) -> None:
try:
# Get the filename without directory path
filename = os.path.basename(file_path)
# Create output path
output_path = os.path.join(self.output_directory, filename)
logger.info(f"Processing file: {filename}")
# Process the document using document service
self.document_service.process_document(file_path, output_path)
logger.info(f"File processed successfully: {filename}")
except Exception as e:
logger.error(f"Error processing file {file_path}: {str(e)}")
def start_monitoring(self):
import time
# Ensure output directory exists
os.makedirs(self.output_directory, exist_ok=True)
already_seen = set(os.listdir(self.input_directory))
while True:
time.sleep(1) # Check every second
current_files = set(os.listdir(self.input_directory))
new_files = current_files - already_seen
for new_file in new_files:
file_path = os.path.join(self.input_directory, new_file)
logger.info(f"New file found: {new_file}")
self.process_new_file(file_path)
already_seen = current_files