Initial commit

This commit is contained in:
tigermren 2025-07-20 21:51:39 +08:00
parent 0904ab5073
commit 7233176ab9
88 changed files with 21937 additions and 321 deletions

View File

@ -1,19 +0,0 @@
# Storage paths
OBJECT_STORAGE_PATH=/path/to/mounted/object/storage
TARGET_DIRECTORY_PATH=/path/to/target/directory
# Ollama API Configuration
OLLAMA_API_URL=https://api.ollama.com
OLLAMA_API_KEY=your_api_key_here
OLLAMA_MODEL=llama2
# Application Settings
MONITOR_INTERVAL=5
# Logging Configuration
LOG_LEVEL=INFO
LOG_FILE=app.log
# Optional: Additional security settings
# MAX_FILE_SIZE=10485760 # 10MB in bytes
# ALLOWED_FILE_TYPES=.txt,.doc,.docx,.pdf

76
.gitignore vendored Normal file
View File

@ -0,0 +1,76 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# Virtual Environment
venv/
env/
ENV/
.env
# IDE
.idea/
.vscode/
*.swp
*.swo
.DS_Store
# Logs
*.log
logs/
# Testing
.coverage
.pytest_cache/
htmlcov/
.tox/
# Project specific
target_folder/
output/
temp/
# Jupyter Notebook
.ipynb_checkpoints
# mypy
.mypy_cache/
# Distribution / packaging
.Python
*.pyc
# Local development settings
.env.local
.env.development.local
.env.test.local
.env.production.local
src_folder
target_folder
app.log
__pycache__
data/doc_dest
data/doc_src
data/doc_intermediate
node_modules
backend/storage/

206
DOCKER_COMPOSE_README.md Normal file
View File

@ -0,0 +1,206 @@
# Unified Docker Compose Setup
This project now includes a unified Docker Compose configuration that allows all services (mineru, backend, frontend) to run together and communicate using service names.
## Architecture
The unified setup includes the following services:
- **mineru-api**: Document processing service (port 8001)
- **backend-api**: Main API service (port 8000)
- **celery-worker**: Background task processor
- **redis**: Message broker for Celery
- **frontend**: React frontend application (port 3000)
## Network Configuration
All services are connected through a custom bridge network called `app-network`, allowing them to communicate using service names:
- Backend → Mineru: `http://mineru-api:8000`
- Frontend → Backend: `http://localhost:8000/api/v1` (external access)
- Backend → Redis: `redis://redis:6379/0`
## Usage
### Starting all services
```bash
# From the root directory
docker-compose up -d
```
### Starting specific services
```bash
# Start only backend and mineru
docker-compose up -d backend-api mineru-api redis
# Start only frontend and backend
docker-compose up -d frontend backend-api redis
```
### Stopping services
```bash
# Stop all services
docker-compose down
# Stop and remove volumes
docker-compose down -v
```
### Viewing logs
```bash
# View all logs
docker-compose logs -f
# View specific service logs
docker-compose logs -f backend-api
docker-compose logs -f mineru-api
docker-compose logs -f frontend
```
## Building Services
### Building all services
```bash
# Build all services
docker-compose build
# Build and start all services
docker-compose up -d --build
```
### Building individual services
```bash
# Build only backend
docker-compose build backend-api
# Build only frontend
docker-compose build frontend
# Build only mineru
docker-compose build mineru-api
# Build multiple specific services
docker-compose build backend-api frontend
```
### Building and restarting specific services
```bash
# Build and restart only backend
docker-compose build backend-api
docker-compose up -d backend-api
# Or combine in one command
docker-compose up -d --build backend-api
# Build and restart backend and celery worker
docker-compose up -d --build backend-api celery-worker
```
### Force rebuild (no cache)
```bash
# Force rebuild all services
docker-compose build --no-cache
# Force rebuild specific service
docker-compose build --no-cache backend-api
```
## Environment Variables
The unified setup uses environment variables from the individual service `.env` files:
- `./backend/.env` - Backend configuration
- `./frontend/.env` - Frontend configuration
- `./mineru/.env` - Mineru configuration (if exists)
### Key Configuration Changes
1. **Backend Configuration** (`backend/app/core/config.py`):
```python
MINERU_API_URL: str = "http://mineru-api:8000"
```
2. **Frontend Configuration**:
```javascript
REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
```
## Service Dependencies
- `backend-api` depends on `redis` and `mineru-api`
- `celery-worker` depends on `redis` and `backend-api`
- `frontend` depends on `backend-api`
## Port Mapping
- **Frontend**: `http://localhost:3000`
- **Backend API**: `http://localhost:8000`
- **Mineru API**: `http://localhost:8001`
- **Redis**: `localhost:6379`
## Health Checks
The mineru-api service includes a health check that verifies the service is running properly.
## Development vs Production
For development, you can still use the individual docker-compose files in each service directory. The unified setup is ideal for:
- Production deployments
- End-to-end testing
- Simplified development environment
## Troubleshooting
### Service Communication Issues
If services can't communicate:
1. Check if all services are running: `docker-compose ps`
2. Verify network connectivity: `docker network ls`
3. Check service logs: `docker-compose logs [service-name]`
### Port Conflicts
If you get port conflicts, you can modify the port mappings in the `docker-compose.yml` file:
```yaml
ports:
- "8002:8000" # Change external port
```
### Volume Issues
Make sure the storage directories exist:
```bash
mkdir -p backend/storage
mkdir -p mineru/storage/uploads
mkdir -p mineru/storage/processed
```
## Migration from Individual Compose Files
If you were previously using individual docker-compose files:
1. Stop all individual services:
```bash
cd backend && docker-compose down
cd ../frontend && docker-compose down
cd ../mineru && docker-compose down
```
2. Start the unified setup:
```bash
cd .. && docker-compose up -d
```
The unified setup maintains the same functionality while providing better service discovery and networking.

399
DOCKER_MIGRATION_GUIDE.md Normal file
View File

@ -0,0 +1,399 @@
# Docker Image Migration Guide
This guide explains how to export your built Docker images, transfer them to another environment, and run them without rebuilding.
## Overview
The migration process involves:
1. **Export**: Save built images to tar files
2. **Transfer**: Copy tar files to target environment
3. **Import**: Load images on target environment
4. **Run**: Start services with imported images
## Prerequisites
### Source Environment (where images are built)
- Docker installed and running
- All services built and working
- Sufficient disk space for image export
### Target Environment (where images will run)
- Docker installed and running
- Sufficient disk space for image import
- Network access to source environment (or USB drive)
## Step 1: Export Docker Images
### 1.1 List Current Images
First, check what images you have:
```bash
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.ID}}\t{{.Size}}"
```
You should see images like:
- `legal-doc-masker-backend-api`
- `legal-doc-masker-frontend`
- `legal-doc-masker-mineru-api`
- `redis:alpine`
### 1.2 Export Individual Images
Create a directory for exports:
```bash
mkdir -p docker-images-export
cd docker-images-export
```
Export each image:
```bash
# Export backend image
docker save legal-doc-masker-backend-api:latest -o backend-api.tar
# Export frontend image
docker save legal-doc-masker-frontend:latest -o frontend.tar
# Export mineru image
docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
# Export redis image (if not using official)
docker save redis:alpine -o redis.tar
```
### 1.3 Export All Images at Once (Alternative)
If you want to export all images in one command:
```bash
# Export all project images
docker save \
legal-doc-masker-backend-api:latest \
legal-doc-masker-frontend:latest \
legal-doc-masker-mineru-api:latest \
redis:alpine \
-o legal-doc-masker-all.tar
```
### 1.4 Verify Export Files
Check the exported files:
```bash
ls -lh *.tar
```
You should see files like:
- `backend-api.tar` (~200-500MB)
- `frontend.tar` (~100-300MB)
- `mineru-api.tar` (~1-3GB)
- `redis.tar` (~30-50MB)
## Step 2: Transfer Images
### 2.1 Transfer via Network (SCP/RSYNC)
```bash
# Transfer to remote server
scp *.tar user@remote-server:/path/to/destination/
# Or using rsync (more efficient for large files)
rsync -avz --progress *.tar user@remote-server:/path/to/destination/
```
### 2.2 Transfer via USB Drive
```bash
# Copy to USB drive
cp *.tar /Volumes/USB_DRIVE/docker-images/
# Or create a compressed archive
tar -czf legal-doc-masker-images.tar.gz *.tar
cp legal-doc-masker-images.tar.gz /Volumes/USB_DRIVE/
```
### 2.3 Transfer via Cloud Storage
```bash
# Upload to cloud storage (example with AWS S3)
aws s3 cp *.tar s3://your-bucket/docker-images/
# Or using Google Cloud Storage
gsutil cp *.tar gs://your-bucket/docker-images/
```
## Step 3: Import Images on Target Environment
### 3.1 Prepare Target Environment
```bash
# Create directory for images
mkdir -p docker-images-import
cd docker-images-import
# Copy images from transfer method
# (SCP, USB, or download from cloud storage)
```
### 3.2 Import Individual Images
```bash
# Import backend image
docker load -i backend-api.tar
# Import frontend image
docker load -i frontend.tar
# Import mineru image
docker load -i mineru-api.tar
# Import redis image
docker load -i redis.tar
```
### 3.3 Import All Images at Once (if exported together)
```bash
docker load -i legal-doc-masker-all.tar
```
### 3.4 Verify Imported Images
```bash
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.ID}}\t{{.Size}}"
```
## Step 4: Prepare Target Environment
### 4.1 Copy Project Files
Transfer the following files to target environment:
```bash
# Essential files to copy
docker-compose.yml
DOCKER_COMPOSE_README.md
setup-unified-docker.sh
# Environment files (if they exist)
backend/.env
frontend/.env
mineru/.env
# Storage directories (if you want to preserve data)
backend/storage/
mineru/storage/
backend/legal_doc_masker.db
```
### 4.2 Create Directory Structure
```bash
# Create necessary directories
mkdir -p backend/storage
mkdir -p mineru/storage/uploads
mkdir -p mineru/storage/processed
```
## Step 5: Run Services
### 5.1 Start All Services
```bash
# Start all services using imported images
docker-compose up -d
```
### 5.2 Verify Services
```bash
# Check service status
docker-compose ps
# Check service logs
docker-compose logs -f
```
### 5.3 Test Endpoints
```bash
# Test frontend
curl -I http://localhost:3000
# Test backend API
curl -I http://localhost:8000/api/v1
# Test mineru API
curl -I http://localhost:8001/health
```
## Automation Scripts
### Export Script
Create `export-images.sh`:
```bash
#!/bin/bash
set -e
echo "🚀 Exporting Docker Images"
# Create export directory
mkdir -p docker-images-export
cd docker-images-export
# Export images
echo "📦 Exporting backend-api image..."
docker save legal-doc-masker-backend-api:latest -o backend-api.tar
echo "📦 Exporting frontend image..."
docker save legal-doc-masker-frontend:latest -o frontend.tar
echo "📦 Exporting mineru-api image..."
docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
echo "📦 Exporting redis image..."
docker save redis:alpine -o redis.tar
# Show file sizes
echo "📊 Export complete. File sizes:"
ls -lh *.tar
echo "✅ Images exported successfully!"
```
### Import Script
Create `import-images.sh`:
```bash
#!/bin/bash
set -e
echo "🚀 Importing Docker Images"
# Check if tar files exist
if [ ! -f "backend-api.tar" ]; then
echo "❌ backend-api.tar not found"
exit 1
fi
# Import images
echo "📦 Importing backend-api image..."
docker load -i backend-api.tar
echo "📦 Importing frontend image..."
docker load -i frontend.tar
echo "📦 Importing mineru-api image..."
docker load -i mineru-api.tar
echo "📦 Importing redis image..."
docker load -i redis.tar
# Verify imports
echo "📊 Imported images:"
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep legal-doc-masker
echo "✅ Images imported successfully!"
```
## Troubleshooting
### Common Issues
1. **Image not found during import**
```bash
# Check if image exists
docker images | grep image-name
# Re-export if needed
docker save image-name:tag -o image-name.tar
```
2. **Port conflicts on target environment**
```bash
# Check what's using the ports
lsof -i :8000
lsof -i :8001
lsof -i :3000
# Modify docker-compose.yml if needed
ports:
- "8002:8000" # Change external port
```
3. **Permission issues**
```bash
# Fix file permissions
chmod +x setup-unified-docker.sh
chmod +x export-images.sh
chmod +x import-images.sh
```
4. **Storage directory issues**
```bash
# Create directories with proper permissions
sudo mkdir -p backend/storage
sudo mkdir -p mineru/storage/uploads
sudo mkdir -p mineru/storage/processed
sudo chown -R $USER:$USER backend/storage mineru/storage
```
### Performance Optimization
1. **Compress images for transfer**
```bash
# Compress before transfer
gzip *.tar
# Decompress on target
gunzip *.tar.gz
```
2. **Use parallel transfer**
```bash
# Transfer multiple files in parallel
parallel scp {} user@server:/path/ ::: *.tar
```
3. **Use Docker registry (alternative)**
```bash
# Push to registry
docker tag legal-doc-masker-backend-api:latest your-registry/backend-api:latest
docker push your-registry/backend-api:latest
# Pull on target
docker pull your-registry/backend-api:latest
```
## Complete Migration Checklist
- [ ] Export all Docker images
- [ ] Transfer image files to target environment
- [ ] Transfer project configuration files
- [ ] Import images on target environment
- [ ] Create necessary directories
- [ ] Start services
- [ ] Verify all services are running
- [ ] Test all endpoints
- [ ] Update any environment-specific configurations
## Security Considerations
1. **Secure transfer**: Use encrypted transfer methods (SCP, SFTP)
2. **Image verification**: Verify image integrity after transfer
3. **Environment isolation**: Ensure target environment is properly secured
4. **Access control**: Limit access to Docker daemon on target environment
## Cost Optimization
1. **Image size**: Remove unnecessary layers before export
2. **Compression**: Use compression for large images
3. **Selective transfer**: Only transfer images you need
4. **Cleanup**: Remove old images after successful migration

View File

@ -1,48 +0,0 @@
# Build stage
FROM python:3.12-slim AS builder
WORKDIR /app
# Install build dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements first to leverage Docker cache
COPY requirements.txt .
RUN pip wheel --no-cache-dir --no-deps --wheel-dir /app/wheels -r requirements.txt
# Final stage
FROM python:3.12-slim
WORKDIR /app
# Create non-root user
RUN useradd -m -r appuser && \
chown appuser:appuser /app
# Copy wheels from builder
COPY --from=builder /app/wheels /wheels
COPY --from=builder /app/requirements.txt .
# Install dependencies
RUN pip install --no-cache /wheels/*
# Copy application code
COPY src/ ./src/
# Create directories for mounted volumes
RUN mkdir -p /data/input /data/output && \
chown -R appuser:appuser /data
# Switch to non-root user
USER appuser
# Environment variables
ENV PYTHONPATH=/app \
OBJECT_STORAGE_PATH=/data/input \
TARGET_DIRECTORY_PATH=/data/output
# Run the application
CMD ["python", "src/main.py"]

View File

@ -0,0 +1,178 @@
# Docker Migration Quick Reference
## 🚀 Quick Migration Process
### Source Environment (Export)
```bash
# 1. Build images first (if not already built)
docker-compose build
# 2. Export all images
./export-images.sh
# 3. Transfer files to target environment
# Option A: SCP
scp -r docker-images-export-*/ user@target-server:/path/to/destination/
# Option B: USB Drive
cp -r docker-images-export-*/ /Volumes/USB_DRIVE/
# Option C: Compressed archive
scp legal-doc-masker-images-*.tar.gz user@target-server:/path/to/destination/
```
### Target Environment (Import)
```bash
# 1. Copy project files
scp docker-compose.yml user@target-server:/path/to/destination/
scp DOCKER_COMPOSE_README.md user@target-server:/path/to/destination/
# 2. Import images
./import-images.sh
# 3. Start services
docker-compose up -d
# 4. Verify
docker-compose ps
```
## 📋 Essential Files to Transfer
### Required Files
- `docker-compose.yml` - Unified compose configuration
- `DOCKER_COMPOSE_README.md` - Documentation
- `backend/.env` - Backend environment variables
- `frontend/.env` - Frontend environment variables
- `mineru/.env` - Mineru environment variables (if exists)
### Optional Files (for data preservation)
- `backend/storage/` - Backend storage directory
- `mineru/storage/` - Mineru storage directory
- `backend/legal_doc_masker.db` - Database file
## 🔧 Common Commands
### Export Commands
```bash
# Manual export
docker save legal-doc-masker-backend-api:latest -o backend-api.tar
docker save legal-doc-masker-frontend:latest -o frontend.tar
docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
docker save redis:alpine -o redis.tar
# Compress for transfer
tar -czf legal-doc-masker-images.tar.gz *.tar
```
### Import Commands
```bash
# Manual import
docker load -i backend-api.tar
docker load -i frontend.tar
docker load -i mineru-api.tar
docker load -i redis.tar
# Extract compressed archive
tar -xzf legal-doc-masker-images.tar.gz
```
### Service Management
```bash
# Start all services
docker-compose up -d
# Stop all services
docker-compose down
# View logs
docker-compose logs -f [service-name]
# Check status
docker-compose ps
```
### Building Individual Services
```bash
# Build specific service only
docker-compose build backend-api
docker-compose build frontend
docker-compose build mineru-api
# Build and restart specific service
docker-compose up -d --build backend-api
# Force rebuild (no cache)
docker-compose build --no-cache backend-api
# Using the build script
./build-service.sh backend-api --restart
./build-service.sh frontend --no-cache
./build-service.sh backend-api celery-worker
```
## 🌐 Service URLs
After successful migration:
- **Frontend**: http://localhost:3000
- **Backend API**: http://localhost:8000
- **Mineru API**: http://localhost:8001
## ⚠️ Troubleshooting
### Port Conflicts
```bash
# Check what's using ports
lsof -i :8000
lsof -i :8001
lsof -i :3000
# Modify docker-compose.yml if needed
ports:
- "8002:8000" # Change external port
```
### Permission Issues
```bash
# Fix script permissions
chmod +x export-images.sh
chmod +x import-images.sh
chmod +x setup-unified-docker.sh
# Fix directory permissions
sudo chown -R $USER:$USER backend/storage mineru/storage
```
### Disk Space Issues
```bash
# Check available space
df -h
# Clean up Docker
docker system prune -a
```
## 📊 Expected File Sizes
- `backend-api.tar`: ~200-500MB
- `frontend.tar`: ~100-300MB
- `mineru-api.tar`: ~1-3GB
- `redis.tar`: ~30-50MB
- `legal-doc-masker-images.tar.gz`: ~1-2GB (compressed)
## 🔒 Security Notes
1. Use encrypted transfer (SCP, SFTP) for sensitive environments
2. Verify image integrity after transfer
3. Update environment variables for target environment
4. Ensure proper network security on target environment
## 📞 Support
If you encounter issues:
1. Check the full `DOCKER_MIGRATION_GUIDE.md`
2. Verify all required files are present
3. Check Docker logs: `docker-compose logs -f`
4. Ensure sufficient disk space and permissions

View File

@ -35,14 +35,20 @@ doc-processing-app
cd doc-processing-app
```
2. Install the required dependencies:
2. Install LibreOffice (required for document processing):
```
brew install libreoffice
```
3. Install the required dependencies:
```
pip install -r requirements.txt
pip install -U magic-pdf[full]
```
3. Configure the application by editing the `src/config/settings.py` file to set the paths for the object storage and target directory.
4. Configure the application by editing the `src/config/settings.py` file to set the paths for the object storage and target directory.
4. Create a `.env` file based on the `.env.example` file to set up necessary environment variables.
5. Create a `.env` file based on the `.env.example` file to set up necessary environment variables.
## Usage

View File

@ -1 +0,0 @@
2025-04-20 20:14:00 - services.file_monitor - INFO - monitor: new file found: README.md

20
backend/.env Normal file
View File

@ -0,0 +1,20 @@
# Storage paths
OBJECT_STORAGE_PATH=/Users/tigeren/Dev/digisky/legal-doc-masker/data/doc_src
TARGET_DIRECTORY_PATH=/Users/tigeren/Dev/digisky/legal-doc-masker/data/doc_dest
INTERMEDIATE_DIR_PATH=/Users/tigeren/Dev/digisky/legal-doc-masker/data/doc_intermediate
# Ollama API Configuration
OLLAMA_API_URL=http://192.168.2.245:11434
# OLLAMA_API_KEY=your_api_key_here
OLLAMA_MODEL=qwen3:8b
# Application Settings
MONITOR_INTERVAL=5
# Logging Configuration
LOG_LEVEL=INFO
LOG_FILE=app.log
# Optional: Additional security settings
# MAX_FILE_SIZE=10485760 # 10MB in bytes
# ALLOWED_FILE_TYPES=.txt,.doc,.docx,.pdf

36
backend/Dockerfile Normal file
View File

@ -0,0 +1,36 @@
FROM python:3.11-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
libreoffice \
wget \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements first to leverage Docker cache
COPY requirements.txt .
# RUN pip install huggingface_hub
# RUN wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
# RUN wget https://raw.githubusercontent.com/opendatalab/MinerU/refs/heads/release-1.3.1/scripts/download_models_hf.py -O download_models_hf.py
# RUN python download_models_hf.py
RUN pip install --no-cache-dir -r requirements.txt
# RUN pip install -U magic-pdf[full]
# Copy the rest of the application
COPY . .
# Create storage directories
RUN mkdir -p storage/uploads storage/processed
# Expose the port the app runs on
EXPOSE 8000
# Command to run the application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

View File

@ -0,0 +1,202 @@
# PDF Processor with Mineru API
## Overview
The PDF processor has been rewritten to use Mineru's REST API instead of the magic_pdf library. This provides better separation of concerns and allows for more flexible deployment options.
## Changes Made
### 1. Removed Dependencies
- Removed all `magic_pdf` imports and dependencies
- Removed `PyPDF2` direct usage (though kept in requirements for potential other uses)
### 2. New Implementation
- **REST API Integration**: Uses HTTP requests to call Mineru's API
- **Configurable Settings**: Mineru API URL and timeout are configurable
- **Error Handling**: Comprehensive error handling for network issues, timeouts, and API errors
- **Flexible Response Parsing**: Handles multiple possible response formats from Mineru API
### 3. Configuration
Add the following settings to your environment or `.env` file:
```bash
# Mineru API Configuration
MINERU_API_URL=http://mineru-api:8000
MINERU_TIMEOUT=300
MINERU_LANG_LIST=["ch"]
MINERU_BACKEND=pipeline
MINERU_PARSE_METHOD=auto
MINERU_FORMULA_ENABLE=true
MINERU_TABLE_ENABLE=true
```
### 4. API Endpoint
The processor expects Mineru to provide a REST API endpoint at `/file_parse` that accepts PDF files via multipart form data and returns JSON with markdown content.
#### Expected Request Format:
```
POST /file_parse
Content-Type: multipart/form-data
files: [PDF file]
output_dir: ./output
lang_list: ["ch"]
backend: pipeline
parse_method: auto
formula_enable: true
table_enable: true
return_md: true
return_middle_json: false
return_model_output: false
return_content_list: false
return_images: false
start_page_id: 0
end_page_id: 99999
```
#### Expected Response Format:
The processor can handle multiple response formats:
```json
{
"markdown": "# Document Title\n\nContent here..."
}
```
OR
```json
{
"md": "# Document Title\n\nContent here..."
}
```
OR
```json
{
"content": "# Document Title\n\nContent here..."
}
```
OR
```json
{
"result": {
"markdown": "# Document Title\n\nContent here..."
}
}
```
## Usage
### Basic Usage
```python
from app.core.document_handlers.processors.pdf_processor import PdfDocumentProcessor
# Create processor instance
processor = PdfDocumentProcessor("input.pdf", "output.md")
# Read and convert PDF to markdown
content = processor.read_content()
# Process content (apply masking)
processed_content = processor.process_content(content)
# Save processed content
processor.save_content(processed_content)
```
### Through Document Service
```python
from app.core.services.document_service import DocumentService
service = DocumentService()
success = service.process_document("input.pdf", "output.md")
```
## Testing
Run the test script to verify the implementation:
```bash
cd backend
python test_pdf_processor.py
```
Make sure you have:
1. A sample PDF file in the `sample_doc/` directory
2. Mineru API service running and accessible
3. Proper network connectivity between services
## Error Handling
The processor handles various error scenarios:
- **Network Timeouts**: Configurable timeout (default: 5 minutes)
- **API Errors**: HTTP status code errors are logged and handled
- **Response Parsing**: Multiple fallback strategies for extracting markdown content
- **File Operations**: Proper error handling for file reading/writing
## Logging
The processor provides detailed logging for debugging:
- API call attempts and responses
- Content extraction results
- Error conditions and stack traces
- Processing statistics
## Deployment
### Docker Compose
Ensure your Mineru service is running and accessible. The default configuration expects it at `http://mineru-api:8000`.
### Environment Variables
Set the following environment variables in your deployment:
```bash
MINERU_API_URL=http://your-mineru-service:8000
MINERU_TIMEOUT=300
```
## Troubleshooting
### Common Issues
1. **Connection Refused**: Check if Mineru service is running and accessible
2. **Timeout Errors**: Increase `MINERU_TIMEOUT` for large PDF files
3. **Empty Content**: Check Mineru API response format and logs
4. **Network Issues**: Verify network connectivity between services
### Debug Mode
Enable debug logging to see detailed API interactions:
```python
import logging
logging.getLogger('app.core.document_handlers.processors.pdf_processor').setLevel(logging.DEBUG)
```
## Migration from magic_pdf
If you were previously using magic_pdf:
1. **No Code Changes Required**: The interface remains the same
2. **Configuration Update**: Add Mineru API settings
3. **Service Dependencies**: Ensure Mineru service is running
4. **Testing**: Run the test script to verify functionality
## Performance Considerations
- **Timeout**: Large PDFs may require longer timeouts
- **Memory**: The processor loads the entire PDF into memory for API calls
- **Network**: API calls add network latency to processing time
- **Caching**: Consider implementing caching for frequently processed documents

103
backend/README.md Normal file
View File

@ -0,0 +1,103 @@
# Legal Document Masker API
This is the backend API for the Legal Document Masking system. It provides endpoints for file upload, processing status tracking, and file download.
## Prerequisites
- Python 3.8+
- Redis (for Celery)
## File Storage
Files are stored in the following structure:
```
backend/
├── storage/
│ ├── uploads/ # Original uploaded files
│ └── processed/ # Masked/processed files
```
## Setup
### Option 1: Local Development
1. Create a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Set up environment variables:
Create a `.env` file in the backend directory with the following variables:
```env
SECRET_KEY=your-secret-key-here
```
The database (SQLite) will be automatically created when you first run the application.
4. Start Redis (required for Celery):
```bash
redis-server
```
5. Start Celery worker:
```bash
celery -A app.services.file_service worker --loglevel=info
```
6. Start the FastAPI server:
```bash
uvicorn app.main:app --reload
```
### Option 2: Docker Deployment
1. Build and start the services:
```bash
docker-compose up --build
```
This will start:
- FastAPI server on port 8000
- Celery worker for background processing
- Redis for task queue
## API Documentation
Once the server is running, you can access:
- Swagger UI: `http://localhost:8000/docs`
- ReDoc: `http://localhost:8000/redoc`
## API Endpoints
- `POST /api/v1/files/upload` - Upload a new file
- `GET /api/v1/files` - List all files
- `GET /api/v1/files/{file_id}` - Get file details
- `GET /api/v1/files/{file_id}/download` - Download processed file
- `WS /api/v1/files/ws/status/{file_id}` - WebSocket for real-time status updates
## Development
### Running Tests
```bash
pytest
```
### Code Style
The project uses Black for code formatting:
```bash
black .
```
### Docker Commands
- Start services: `docker-compose up`
- Start in background: `docker-compose up -d`
- Stop services: `docker-compose down`
- View logs: `docker-compose logs -f`
- Rebuild: `docker-compose up --build`

View File

@ -0,0 +1,166 @@
from fastapi import APIRouter, Depends, HTTPException, UploadFile, File, WebSocket, Response
from fastapi.responses import FileResponse
from sqlalchemy.orm import Session
from typing import List
import os
from ...core.config import settings
from ...core.database import get_db
from ...models.file import File as FileModel, FileStatus
from ...services.file_service import process_file, delete_file
from ...schemas.file import FileResponse as FileResponseSchema, FileList
import asyncio
from fastapi import WebSocketDisconnect
import uuid
router = APIRouter()
@router.post("/upload", response_model=FileResponseSchema)
async def upload_file(
file: UploadFile = File(...),
db: Session = Depends(get_db)
):
if not file.filename:
raise HTTPException(status_code=400, detail="No file provided")
if not any(file.filename.lower().endswith(ext) for ext in settings.ALLOWED_EXTENSIONS):
raise HTTPException(
status_code=400,
detail=f"File type not allowed. Allowed types: {', '.join(settings.ALLOWED_EXTENSIONS)}"
)
# Generate unique file ID
file_id = str(uuid.uuid4())
file_extension = os.path.splitext(file.filename)[1]
unique_filename = f"{file_id}{file_extension}"
# Save file with unique name
file_path = settings.UPLOAD_FOLDER / unique_filename
with open(file_path, "wb") as buffer:
content = await file.read()
buffer.write(content)
# Create database entry
db_file = FileModel(
id=file_id,
filename=file.filename,
original_path=str(file_path),
status=FileStatus.NOT_STARTED
)
db.add(db_file)
db.commit()
db.refresh(db_file)
# Start processing
process_file.delay(str(db_file.id))
return db_file
@router.get("/files", response_model=List[FileResponseSchema])
def list_files(
skip: int = 0,
limit: int = 100,
db: Session = Depends(get_db)
):
files = db.query(FileModel).offset(skip).limit(limit).all()
return files
@router.get("/files/{file_id}", response_model=FileResponseSchema)
def get_file(
file_id: str,
db: Session = Depends(get_db)
):
file = db.query(FileModel).filter(FileModel.id == file_id).first()
if not file:
raise HTTPException(status_code=404, detail="File not found")
return file
@router.get("/files/{file_id}/download")
async def download_file(
file_id: str,
db: Session = Depends(get_db)
):
print(f"=== DOWNLOAD REQUEST ===")
print(f"File ID: {file_id}")
file = db.query(FileModel).filter(FileModel.id == file_id).first()
if not file:
print(f"❌ File not found for ID: {file_id}")
raise HTTPException(status_code=404, detail="File not found")
print(f"✅ File found: {file.filename}")
print(f"File status: {file.status}")
print(f"Original path: {file.original_path}")
print(f"Processed path: {file.processed_path}")
if file.status != FileStatus.SUCCESS:
print(f"❌ File not ready for download. Status: {file.status}")
raise HTTPException(status_code=400, detail="File is not ready for download")
if not os.path.exists(file.processed_path):
print(f"❌ Processed file not found at: {file.processed_path}")
raise HTTPException(status_code=404, detail="Processed file not found")
print(f"✅ Processed file exists at: {file.processed_path}")
# Get the original filename without extension and add .md extension
original_filename = file.filename
filename_without_ext = os.path.splitext(original_filename)[0]
download_filename = f"{filename_without_ext}.md"
print(f"Original filename: {original_filename}")
print(f"Filename without extension: {filename_without_ext}")
print(f"Download filename: {download_filename}")
response = FileResponse(
path=file.processed_path,
filename=download_filename,
media_type="text/markdown"
)
print(f"Response headers: {dict(response.headers)}")
print(f"=== END DOWNLOAD REQUEST ===")
return response
@router.websocket("/ws/status/{file_id}")
async def websocket_endpoint(websocket: WebSocket, file_id: str, db: Session = Depends(get_db)):
await websocket.accept()
try:
while True:
file = db.query(FileModel).filter(FileModel.id == file_id).first()
if not file:
await websocket.send_json({"error": "File not found"})
break
await websocket.send_json({
"status": file.status,
"error": file.error_message
})
if file.status in [FileStatus.SUCCESS, FileStatus.FAILED]:
break
await asyncio.sleep(1)
except WebSocketDisconnect:
pass
@router.delete("/files/{file_id}")
async def delete_file_endpoint(
file_id: str,
db: Session = Depends(get_db)
):
"""
Delete a file and its associated records.
This will remove:
1. The database record
2. The original uploaded file
3. The processed markdown file (if it exists)
"""
try:
delete_file(file_id)
return {"message": "File deleted successfully"}
except HTTPException as e:
raise e
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))

View File

@ -0,0 +1,65 @@
from pydantic_settings import BaseSettings
from typing import Optional
import os
from pathlib import Path
class Settings(BaseSettings):
# API Settings
API_V1_STR: str = "/api/v1"
PROJECT_NAME: str = "Legal Document Masker API"
# Security
SECRET_KEY: str = "your-secret-key-here" # Change in production
ACCESS_TOKEN_EXPIRE_MINUTES: int = 60 * 24 * 8 # 8 days
# Database
BASE_DIR: Path = Path(__file__).parent.parent.parent
DATABASE_URL: str = f"sqlite:///{BASE_DIR}/storage/legal_doc_masker.db"
# File Storage
UPLOAD_FOLDER: Path = BASE_DIR / "storage" / "uploads"
PROCESSED_FOLDER: Path = BASE_DIR / "storage" / "processed"
MAX_FILE_SIZE: int = 50 * 1024 * 1024 # 50MB
ALLOWED_EXTENSIONS: set = {"pdf", "docx", "doc", "md"}
# Celery
CELERY_BROKER_URL: str = "redis://redis:6379/0"
CELERY_RESULT_BACKEND: str = "redis://redis:6379/0"
# Ollama API settings
OLLAMA_API_URL: str = "https://api.ollama.com"
OLLAMA_API_KEY: str = ""
OLLAMA_MODEL: str = "llama2"
# Mineru API settings
MINERU_API_URL: str = "http://mineru-api:8000"
# MINERU_API_URL: str = "http://host.docker.internal:8001"
MINERU_TIMEOUT: int = 300 # 5 minutes timeout
MINERU_LANG_LIST: list = ["ch"] # Language list for parsing
MINERU_BACKEND: str = "pipeline" # Backend to use
MINERU_PARSE_METHOD: str = "auto" # Parse method
MINERU_FORMULA_ENABLE: bool = True # Enable formula parsing
MINERU_TABLE_ENABLE: bool = True # Enable table parsing
# Logging settings
LOG_LEVEL: str = "INFO"
LOG_FORMAT: str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
LOG_DATE_FORMAT: str = "%Y-%m-%d %H:%M:%S"
LOG_FILE: str = "app.log"
class Config:
case_sensitive = True
env_file = ".env"
env_file_encoding = "utf-8"
extra = "allow"
def __init__(self, **kwargs):
super().__init__(**kwargs)
# Create storage directories if they don't exist
self.UPLOAD_FOLDER.mkdir(parents=True, exist_ok=True)
self.PROCESSED_FOLDER.mkdir(parents=True, exist_ok=True)
# Create storage directory for database
(self.BASE_DIR / "storage").mkdir(parents=True, exist_ok=True)
settings = Settings()

View File

@ -1,5 +1,6 @@
import logging.config
from config.settings import settings
# from config.settings import settings
from .settings import settings
LOGGING_CONFIG = {
"version": 1,

View File

View File

@ -0,0 +1,21 @@
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from .config import settings
# Create SQLite engine with check_same_thread=False for FastAPI
engine = create_engine(
settings.DATABASE_URL,
connect_args={"check_same_thread": False}
)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
Base = declarative_base()
# Dependency
def get_db():
db = SessionLocal()
try:
yield db
finally:
db.close()

View File

@ -1,10 +1,11 @@
import os
from typing import Optional
from models.document_processor import DocumentProcessor
from models.processors import (
from .document_processor import DocumentProcessor
from .processors import (
TxtDocumentProcessor,
DocxDocumentProcessor,
PdfDocumentProcessor
# DocxDocumentProcessor,
PdfDocumentProcessor,
MarkdownDocumentProcessor
)
class DocumentProcessorFactory:
@ -14,9 +15,11 @@ class DocumentProcessorFactory:
processors = {
'.txt': TxtDocumentProcessor,
'.docx': DocxDocumentProcessor,
'.doc': DocxDocumentProcessor,
'.pdf': PdfDocumentProcessor
# '.docx': DocxDocumentProcessor,
# '.doc': DocxDocumentProcessor,
'.pdf': PdfDocumentProcessor,
'.md': MarkdownDocumentProcessor,
'.markdown': MarkdownDocumentProcessor
}
processor_class = processors.get(file_extension)

View File

@ -0,0 +1,71 @@
from abc import ABC, abstractmethod
from typing import Any, Dict
import logging
from .ner_processor import NerProcessor
logger = logging.getLogger(__name__)
class DocumentProcessor(ABC):
def __init__(self):
self.max_chunk_size = 1000 # Maximum number of characters per chunk
self.ner_processor = NerProcessor()
@abstractmethod
def read_content(self) -> str:
"""Read document content"""
pass
def _split_into_chunks(self, sentences: list[str]) -> list[str]:
"""Split sentences into chunks that don't exceed max_chunk_size"""
chunks = []
current_chunk = ""
for sentence in sentences:
if not sentence.strip():
continue
if len(current_chunk) + len(sentence) > self.max_chunk_size and current_chunk:
chunks.append(current_chunk)
current_chunk = sentence
else:
if current_chunk:
current_chunk += "" + sentence
else:
current_chunk = sentence
if current_chunk:
chunks.append(current_chunk)
logger.info(f"Split content into {len(chunks)} chunks")
return chunks
def _apply_mapping(self, text: str, mapping: Dict[str, str]) -> str:
"""Apply the mapping to replace sensitive information"""
masked_text = text
for original, masked in mapping.items():
if isinstance(masked, dict):
masked = next(iter(masked.values()), "")
elif not isinstance(masked, str):
masked = str(masked) if masked is not None else ""
masked_text = masked_text.replace(original, masked)
return masked_text
def process_content(self, content: str) -> str:
"""Process document content by masking sensitive information"""
sentences = content.split("")
chunks = self._split_into_chunks(sentences)
logger.info(f"Split content into {len(chunks)} chunks")
final_mapping = self.ner_processor.process(chunks)
masked_content = self._apply_mapping(content, final_mapping)
logger.info("Successfully masked content")
return masked_content
@abstractmethod
def save_content(self, content: str) -> None:
"""Save processed content"""
pass

View File

@ -0,0 +1,305 @@
from typing import Any, Dict
from ..prompts.masking_prompts import get_ner_name_prompt, get_ner_company_prompt, get_ner_address_prompt, get_ner_project_prompt, get_ner_case_number_prompt, get_entity_linkage_prompt
import logging
import json
from ..services.ollama_client import OllamaClient
from ...core.config import settings
from ..utils.json_extractor import LLMJsonExtractor
from ..utils.llm_validator import LLMResponseValidator
import re
from .regs.entity_regex import extract_id_number_entities, extract_social_credit_code_entities
logger = logging.getLogger(__name__)
class NerProcessor:
def __init__(self):
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
self.max_retries = 3
def _validate_mapping_format(self, mapping: Dict[str, Any]) -> bool:
return LLMResponseValidator.validate_entity_extraction(mapping)
def _process_entity_type(self, chunk: str, prompt_func, entity_type: str) -> Dict[str, str]:
for attempt in range(self.max_retries):
try:
formatted_prompt = prompt_func(chunk)
logger.info(f"Calling ollama to generate {entity_type} mapping for chunk (attempt {attempt + 1}/{self.max_retries}): {formatted_prompt}")
response = self.ollama_client.generate(formatted_prompt)
logger.info(f"Raw response from LLM: {response}")
mapping = LLMJsonExtractor.parse_raw_json_str(response)
logger.info(f"Parsed mapping: {mapping}")
if mapping and self._validate_mapping_format(mapping):
return mapping
else:
logger.warning(f"Invalid mapping format received on attempt {attempt + 1}, retrying...")
except Exception as e:
logger.error(f"Error generating {entity_type} mapping on attempt {attempt + 1}: {e}")
if attempt < self.max_retries - 1:
logger.info("Retrying...")
else:
logger.error(f"Max retries reached for {entity_type}, returning empty mapping")
return {}
def build_mapping(self, chunk: str) -> list[Dict[str, str]]:
mapping_pipeline = []
entity_configs = [
(get_ner_name_prompt, "people names"),
(get_ner_company_prompt, "company names"),
(get_ner_address_prompt, "addresses"),
(get_ner_project_prompt, "project names"),
(get_ner_case_number_prompt, "case numbers")
]
for prompt_func, entity_type in entity_configs:
mapping = self._process_entity_type(chunk, prompt_func, entity_type)
if mapping:
mapping_pipeline.append(mapping)
regex_entity_extractors = [
extract_id_number_entities,
extract_social_credit_code_entities
]
for extractor in regex_entity_extractors:
mapping = extractor(chunk)
if mapping and LLMResponseValidator.validate_regex_entity(mapping):
mapping_pipeline.append(mapping)
elif mapping:
logger.warning(f"Invalid regex entity mapping format: {mapping}")
return mapping_pipeline
def _merge_entity_mappings(self, chunk_mappings: list[Dict[str, Any]]) -> list[Dict[str, str]]:
all_entities = []
for mapping in chunk_mappings:
if isinstance(mapping, dict) and 'entities' in mapping:
entities = mapping['entities']
if isinstance(entities, list):
all_entities.extend(entities)
unique_entities = []
seen_texts = set()
for entity in all_entities:
if isinstance(entity, dict) and 'text' in entity:
text = entity['text'].strip()
if text and text not in seen_texts:
seen_texts.add(text)
unique_entities.append(entity)
elif text and text in seen_texts:
# 暂时记录下可能存在冲突的entity
logger.info(f"Duplicate entity found: {entity}")
continue
logger.info(f"Merged {len(unique_entities)} unique entities")
return unique_entities
def _generate_masked_mapping(self, unique_entities: list[Dict[str, str]], linkage: Dict[str, Any]) -> Dict[str, str]:
"""
结合 linkage 信息按实体分组映射同一脱敏名并实现如下规则
1. 人名/简称保留姓名变为某同姓编号
2. 公司名同组公司名映射为大写字母公司A公司B公司...
3. 英文人名每个单词首字母+***
4. 英文公司名替换为所属行业名称英文大写如无行业信息默认 COMPANY
5. 项目名项目名称变为小写英文字母 a项目b项目...
6. 案号只替换案号中的数字部分为***保留前后结构和支持中间有空格
7. 身份证号6位X
8. 社会信用代码8位X
9. 地址保留区级及以上行政区划去除详细位置
10. 其他类型按原有逻辑
"""
import re
entity_mapping = {}
used_masked_names = set()
group_mask_map = {}
surname_counter = {}
company_letter = ord('A')
project_letter = ord('a')
# 优先区县级单位,后市、省等
admin_keywords = [
'市辖区', '自治县', '自治旗', '林区', '', '', '', '', '', '地区', '自治州',
'', '', '自治区', '特别行政区'
]
admin_pattern = r"^(.*?(?:" + '|'.join(admin_keywords) + r"))"
for group in linkage.get('entity_groups', []):
group_type = group.get('group_type', '')
entities = group.get('entities', [])
if '公司' in group_type or 'Company' in group_type:
masked = chr(company_letter) + '公司'
company_letter += 1
for entity in entities:
group_mask_map[entity['text']] = masked
elif '人名' in group_type:
surname_local_counter = {}
for entity in entities:
name = entity['text']
if not name:
continue
surname = name[0]
surname_local_counter.setdefault(surname, 0)
surname_local_counter[surname] += 1
if surname_local_counter[surname] == 1:
masked = f"{surname}"
else:
masked = f"{surname}{surname_local_counter[surname]}"
group_mask_map[name] = masked
elif '英文人名' in group_type:
for entity in entities:
name = entity['text']
if not name:
continue
masked = ' '.join([n[0] + '***' if n else '' for n in name.split()])
group_mask_map[name] = masked
for entity in unique_entities:
text = entity['text']
entity_type = entity.get('type', '')
if text in group_mask_map:
entity_mapping[text] = group_mask_map[text]
used_masked_names.add(group_mask_map[text])
elif '英文公司名' in entity_type or 'English Company' in entity_type:
industry = entity.get('industry', 'COMPANY')
masked = industry.upper()
entity_mapping[text] = masked
used_masked_names.add(masked)
elif '项目名' in entity_type:
masked = chr(project_letter) + '项目'
project_letter += 1
entity_mapping[text] = masked
used_masked_names.add(masked)
elif '案号' in entity_type:
masked = re.sub(r'(\d[\d\s]*)(号)', r'***\2', text)
entity_mapping[text] = masked
used_masked_names.add(masked)
elif '身份证号' in entity_type:
masked = 'X' * 6
entity_mapping[text] = masked
used_masked_names.add(masked)
elif '社会信用代码' in entity_type:
masked = 'X' * 8
entity_mapping[text] = masked
used_masked_names.add(masked)
elif '地址' in entity_type:
# 保留区级及以上行政区划,去除详细位置
match = re.match(admin_pattern, text)
if match:
masked = match.group(1)
else:
masked = text # fallback
entity_mapping[text] = masked
used_masked_names.add(masked)
elif '人名' in entity_type:
name = text
if not name:
masked = ''
else:
surname = name[0]
surname_counter.setdefault(surname, 0)
surname_counter[surname] += 1
if surname_counter[surname] == 1:
masked = f"{surname}"
else:
masked = f"{surname}{surname_counter[surname]}"
entity_mapping[text] = masked
used_masked_names.add(masked)
elif '公司' in entity_type or 'Company' in entity_type:
masked = chr(company_letter) + '公司'
company_letter += 1
entity_mapping[text] = masked
used_masked_names.add(masked)
elif '英文人名' in entity_type:
name = text
masked = ' '.join([n[0] + '***' if n else '' for n in name.split()])
entity_mapping[text] = masked
used_masked_names.add(masked)
else:
base_name = ''
masked = base_name
counter = 1
while masked in used_masked_names:
if counter <= 10:
suffixes = ['', '', '', '', '', '', '', '', '', '']
masked = base_name + suffixes[counter - 1]
else:
masked = f"{base_name}{counter}"
counter += 1
entity_mapping[text] = masked
used_masked_names.add(masked)
return entity_mapping
def _validate_linkage_format(self, linkage: Dict[str, Any]) -> bool:
return LLMResponseValidator.validate_entity_linkage(linkage)
def _create_entity_linkage(self, unique_entities: list[Dict[str, str]]) -> Dict[str, Any]:
linkable_entities = []
for entity in unique_entities:
entity_type = entity.get('type', '')
if any(keyword in entity_type for keyword in ['公司', 'Company', '人名', '英文人名']):
linkable_entities.append(entity)
if not linkable_entities:
logger.info("No linkable entities found")
return {"entity_groups": []}
entities_text = "\n".join([
f"- {entity['text']} (类型: {entity['type']})"
for entity in linkable_entities
])
for attempt in range(self.max_retries):
try:
formatted_prompt = get_entity_linkage_prompt(entities_text)
logger.info(f"Calling ollama to generate entity linkage (attempt {attempt + 1}/{self.max_retries})")
response = self.ollama_client.generate(formatted_prompt)
logger.info(f"Raw entity linkage response from LLM: {response}")
linkage = LLMJsonExtractor.parse_raw_json_str(response)
logger.info(f"Parsed entity linkage: {linkage}")
if linkage and self._validate_linkage_format(linkage):
logger.info(f"Successfully created entity linkage with {len(linkage.get('entity_groups', []))} groups")
return linkage
else:
logger.warning(f"Invalid entity linkage format received on attempt {attempt + 1}, retrying...")
except Exception as e:
logger.error(f"Error generating entity linkage on attempt {attempt + 1}: {e}")
if attempt < self.max_retries - 1:
logger.info("Retrying...")
else:
logger.error("Max retries reached for entity linkage, returning empty linkage")
return {"entity_groups": []}
def _apply_entity_linkage_to_mapping(self, entity_mapping: Dict[str, str], entity_linkage: Dict[str, Any]) -> Dict[str, str]:
"""
linkage 已在 _generate_masked_mapping 中处理此处直接返回 entity_mapping
"""
return entity_mapping
def process(self, chunks: list[str]) -> Dict[str, str]:
chunk_mappings = []
for i, chunk in enumerate(chunks):
logger.info(f"Processing chunk {i+1}/{len(chunks)}")
chunk_mapping = self.build_mapping(chunk)
logger.info(f"Chunk mapping: {chunk_mapping}")
chunk_mappings.extend(chunk_mapping)
logger.info(f"Final chunk mappings: {chunk_mappings}")
unique_entities = self._merge_entity_mappings(chunk_mappings)
logger.info(f"Unique entities: {unique_entities}")
entity_linkage = self._create_entity_linkage(unique_entities)
logger.info(f"Entity linkage: {entity_linkage}")
# for quick test
# unique_entities = [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}, {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}, {'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}, {'text': '服务合同', 'type': '项目名'}, {'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '2020京0105 民初69754 号', 'type': '案号'}, {'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}, {'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}, {'text': '《计算机设备采购合同》', 'type': '项目名'}, {'text': '《服务合同书》', 'type': '项目名'}]
# entity_linkage = {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
combined_mapping = self._generate_masked_mapping(unique_entities, entity_linkage)
logger.info(f"Combined mapping: {combined_mapping}")
final_mapping = self._apply_entity_linkage_to_mapping(combined_mapping, entity_linkage)
logger.info(f"Final mapping: {final_mapping}")
return final_mapping

View File

@ -0,0 +1,7 @@
from .txt_processor import TxtDocumentProcessor
# from .docx_processor import DocxDocumentProcessor
from .pdf_processor import PdfDocumentProcessor
from .md_processor import MarkdownDocumentProcessor
# __all__ = ['TxtDocumentProcessor', 'DocxDocumentProcessor', 'PdfDocumentProcessor', 'MarkdownDocumentProcessor']
__all__ = ['TxtDocumentProcessor', 'PdfDocumentProcessor', 'MarkdownDocumentProcessor']

View File

@ -0,0 +1,77 @@
import os
import docx
from ...document_handlers.document_processor import DocumentProcessor
from magic_pdf.data.data_reader_writer import FileBasedDataWriter
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.data.read_api import read_local_office
import logging
from ...services.ollama_client import OllamaClient
from ...config import settings
from ...prompts.masking_prompts import get_masking_mapping_prompt
logger = logging.getLogger(__name__)
class DocxDocumentProcessor(DocumentProcessor):
def __init__(self, input_path: str, output_path: str):
super().__init__() # Call parent class's __init__
self.input_path = input_path
self.output_path = output_path
self.output_dir = os.path.dirname(output_path)
self.name_without_suff = os.path.splitext(os.path.basename(input_path))[0]
# Setup output directories
self.local_image_dir = os.path.join(self.output_dir, "images")
self.image_dir = os.path.basename(self.local_image_dir)
os.makedirs(self.local_image_dir, exist_ok=True)
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
def read_content(self) -> str:
try:
# Initialize writers
image_writer = FileBasedDataWriter(self.local_image_dir)
md_writer = FileBasedDataWriter(self.output_dir)
# Create Dataset Instance and process
ds = read_local_office(self.input_path)[0]
pipe_result = ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer)
# Generate markdown
md_content = pipe_result.get_markdown(self.image_dir)
pipe_result.dump_md(md_writer, f"{self.name_without_suff}.md", self.image_dir)
return md_content
except Exception as e:
logger.error(f"Error converting DOCX to MD: {e}")
raise
# def process_content(self, content: str) -> str:
# logger.info("Processing DOCX content")
# # Split content into sentences and apply masking
# sentences = content.split("。")
# final_md = ""
# for sentence in sentences:
# if sentence.strip(): # Only process non-empty sentences
# formatted_prompt = get_masking_mapping_prompt(sentence)
# logger.info("Calling ollama to generate response, prompt: %s", formatted_prompt)
# response = self.ollama_client.generate(formatted_prompt)
# logger.info(f"Response generated: {response}")
# final_md += response + "。"
# return final_md
def save_content(self, content: str) -> None:
# Ensure output path has .md extension
output_dir = os.path.dirname(self.output_path)
base_name = os.path.splitext(os.path.basename(self.output_path))[0]
md_output_path = os.path.join(output_dir, f"{base_name}.md")
logger.info(f"Saving masked content to: {md_output_path}")
try:
with open(md_output_path, 'w', encoding='utf-8') as file:
file.write(content)
logger.info(f"Successfully saved content to {md_output_path}")
except Exception as e:
logger.error(f"Error saving content: {e}")
raise

View File

@ -0,0 +1,39 @@
import os
from ...document_handlers.document_processor import DocumentProcessor
from ...services.ollama_client import OllamaClient
import logging
from ...config import settings
logger = logging.getLogger(__name__)
class MarkdownDocumentProcessor(DocumentProcessor):
def __init__(self, input_path: str, output_path: str):
super().__init__() # Call parent class's __init__
self.input_path = input_path
self.output_path = output_path
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
def read_content(self) -> str:
"""Read markdown content from file"""
try:
with open(self.input_path, 'r', encoding='utf-8') as file:
content = file.read()
logger.info(f"Successfully read markdown content from {self.input_path}")
return content
except Exception as e:
logger.error(f"Error reading markdown file {self.input_path}: {e}")
raise
def save_content(self, content: str) -> None:
"""Save processed markdown content"""
try:
# Ensure output directory exists
output_dir = os.path.dirname(self.output_path)
os.makedirs(output_dir, exist_ok=True)
with open(self.output_path, 'w', encoding='utf-8') as file:
file.write(content)
logger.info(f"Successfully saved masked content to {self.output_path}")
except Exception as e:
logger.error(f"Error saving content to {self.output_path}: {e}")
raise

View File

@ -0,0 +1,204 @@
import os
import requests
import logging
from typing import Dict, Any, Optional
from ...document_handlers.document_processor import DocumentProcessor
from ...services.ollama_client import OllamaClient
from ...config import settings
logger = logging.getLogger(__name__)
class PdfDocumentProcessor(DocumentProcessor):
def __init__(self, input_path: str, output_path: str):
super().__init__() # Call parent class's __init__
self.input_path = input_path
self.output_path = output_path
self.output_dir = os.path.dirname(output_path)
self.name_without_suff = os.path.splitext(os.path.basename(input_path))[0]
# Setup work directory for temporary files
self.work_dir = os.path.join(
os.path.dirname(output_path),
".work",
os.path.splitext(os.path.basename(input_path))[0]
)
os.makedirs(self.work_dir, exist_ok=True)
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
# Mineru API configuration
self.mineru_base_url = getattr(settings, 'MINERU_API_URL', 'http://mineru-api:8000')
self.mineru_timeout = getattr(settings, 'MINERU_TIMEOUT', 300) # 5 minutes timeout
self.mineru_lang_list = getattr(settings, 'MINERU_LANG_LIST', ['ch'])
self.mineru_backend = getattr(settings, 'MINERU_BACKEND', 'pipeline')
self.mineru_parse_method = getattr(settings, 'MINERU_PARSE_METHOD', 'auto')
self.mineru_formula_enable = getattr(settings, 'MINERU_FORMULA_ENABLE', True)
self.mineru_table_enable = getattr(settings, 'MINERU_TABLE_ENABLE', True)
def _call_mineru_api(self, file_path: str) -> Optional[Dict[str, Any]]:
"""
Call Mineru API to convert PDF to markdown
Args:
file_path: Path to the PDF file
Returns:
API response as dictionary or None if failed
"""
try:
url = f"{self.mineru_base_url}/file_parse"
with open(file_path, 'rb') as file:
files = {'files': (os.path.basename(file_path), file, 'application/pdf')}
# Prepare form data according to Mineru API specification
data = {
'output_dir': './output',
'lang_list': self.mineru_lang_list,
'backend': self.mineru_backend,
'parse_method': self.mineru_parse_method,
'formula_enable': self.mineru_formula_enable,
'table_enable': self.mineru_table_enable,
'return_md': True,
'return_middle_json': False,
'return_model_output': False,
'return_content_list': False,
'return_images': False,
'start_page_id': 0,
'end_page_id': 99999
}
logger.info(f"Calling Mineru API at {url}")
response = requests.post(
url,
files=files,
data=data,
timeout=self.mineru_timeout
)
if response.status_code == 200:
result = response.json()
logger.info("Successfully received response from Mineru API")
return result
else:
logger.error(f"Mineru API returned status code {response.status_code}: {response.text}")
return None
except requests.exceptions.Timeout:
logger.error(f"Mineru API request timed out after {self.mineru_timeout} seconds")
return None
except requests.exceptions.RequestException as e:
logger.error(f"Error calling Mineru API: {str(e)}")
return None
except Exception as e:
logger.error(f"Unexpected error calling Mineru API: {str(e)}")
return None
def _extract_markdown_from_response(self, response: Dict[str, Any]) -> str:
"""
Extract markdown content from Mineru API response
Args:
response: Mineru API response dictionary
Returns:
Extracted markdown content as string
"""
try:
logger.debug(f"Mineru API response structure: {response}")
# Try different possible response formats based on Mineru API
if 'markdown' in response:
return response['markdown']
elif 'md' in response:
return response['md']
elif 'content' in response:
return response['content']
elif 'text' in response:
return response['text']
elif 'result' in response and isinstance(response['result'], dict):
result = response['result']
if 'markdown' in result:
return result['markdown']
elif 'md' in result:
return result['md']
elif 'content' in result:
return result['content']
elif 'text' in result:
return result['text']
elif 'data' in response and isinstance(response['data'], dict):
data = response['data']
if 'markdown' in data:
return data['markdown']
elif 'md' in data:
return data['md']
elif 'content' in data:
return data['content']
elif 'text' in data:
return data['text']
elif isinstance(response, list) and len(response) > 0:
# If response is a list, try to extract from first item
first_item = response[0]
if isinstance(first_item, dict):
return self._extract_markdown_from_response(first_item)
elif isinstance(first_item, str):
return first_item
else:
# If no standard format found, try to extract from the response structure
logger.warning("Could not find standard markdown field in Mineru response")
# Return the response as string if it's simple, or empty string
if isinstance(response, str):
return response
elif isinstance(response, dict):
# Try to find any text-like content
for key, value in response.items():
if isinstance(value, str) and len(value) > 100: # Likely content
return value
elif isinstance(value, dict):
# Recursively search in nested dictionaries
nested_content = self._extract_markdown_from_response(value)
if nested_content:
return nested_content
return ""
except Exception as e:
logger.error(f"Error extracting markdown from Mineru response: {str(e)}")
return ""
def read_content(self) -> str:
logger.info("Starting PDF content processing with Mineru API")
# Call Mineru API to convert PDF to markdown
mineru_response = self._call_mineru_api(self.input_path)
if not mineru_response:
raise Exception("Failed to get response from Mineru API")
# Extract markdown content from the response
markdown_content = self._extract_markdown_from_response(mineru_response)
if not markdown_content:
raise Exception("No markdown content found in Mineru API response")
logger.info(f"Successfully extracted {len(markdown_content)} characters of markdown content")
# Save the raw markdown content to work directory for reference
md_output_path = os.path.join(self.work_dir, f"{self.name_without_suff}.md")
with open(md_output_path, 'w', encoding='utf-8') as file:
file.write(markdown_content)
logger.info(f"Saved raw markdown content to {md_output_path}")
return markdown_content
def save_content(self, content: str) -> None:
# Ensure output path has .md extension
output_dir = os.path.dirname(self.output_path)
base_name = os.path.splitext(os.path.basename(self.output_path))[0]
md_output_path = os.path.join(output_dir, f"{base_name}.md")
logger.info(f"Saving masked content to: {md_output_path}")
with open(md_output_path, 'w', encoding='utf-8') as file:
file.write(content)

View File

@ -0,0 +1,28 @@
from ...document_handlers.document_processor import DocumentProcessor
from ...services.ollama_client import OllamaClient
import logging
# from ...prompts.masking_prompts import get_masking_prompt
from ...config import settings
logger = logging.getLogger(__name__)
class TxtDocumentProcessor(DocumentProcessor):
def __init__(self, input_path: str, output_path: str):
super().__init__()
self.input_path = input_path
self.output_path = output_path
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
def read_content(self) -> str:
with open(self.input_path, 'r', encoding='utf-8') as file:
return file.read()
# def process_content(self, content: str) -> str:
# formatted_prompt = get_masking_prompt(content)
# response = self.ollama_client.generate(formatted_prompt)
# logger.debug(f"Processed content: {response}")
# return response
def save_content(self, content: str) -> None:
with open(self.output_path, 'w', encoding='utf-8') as file:
file.write(content)

View File

@ -0,0 +1,18 @@
import re
def extract_id_number_entities(chunk: str) -> dict:
"""Extract Chinese ID numbers and return in entity mapping format."""
id_pattern = r'\b\d{17}[\dXx]\b'
entities = []
for match in re.findall(id_pattern, chunk):
entities.append({"text": match, "type": "身份证号"})
return {"entities": entities} if entities else {}
def extract_social_credit_code_entities(chunk: str) -> dict:
"""Extract social credit codes and return in entity mapping format."""
credit_pattern = r'\b[0-9A-Z]{18}\b'
entities = []
for match in re.findall(credit_pattern, chunk):
entities.append({"text": match, "type": "统一社会信用代码"})
return {"entities": entities} if entities else {}

View File

@ -0,0 +1,225 @@
import textwrap
def get_ner_name_prompt(text: str) -> str:
"""
Returns a prompt that generates a mapping of original names/companies to their masked versions.
Args:
text (str): The input text to be analyzed for masking
Returns:
str: The formatted prompt that will generate a mapping dictionary
"""
prompt = textwrap.dedent("""
你是一个专业的法律文本实体识别助手请从以下文本中抽取出所有需要脱敏的敏感信息并按照指定的类别进行分类请严格按照JSON格式输出结果
实体类别包括:
- 人名 (不包括律师法官书记员检察官等公职人员)
- 英文人名
待处理文本:
{text}
输出格式:
{{
"entities": [
{{"text": "原始文本内容", "type": "人名"}},
{{"text": "原始文本内容", "type": "英文人名"}},
...
]
}}
请严格按照JSON格式输出结果
""")
return prompt.format(text=text)
def get_ner_company_prompt(text: str) -> str:
"""
Returns a prompt that generates a mapping of original companies to their masked versions.
Args:
text (str): The input text to be analyzed for masking
Returns:
str: The formatted prompt that will generate a mapping dictionary
"""
prompt = textwrap.dedent("""
你是一个专业的法律文本实体识别助手请从以下文本中抽取出所有需要脱敏的敏感信息并按照指定的类别进行分类请严格按照JSON格式输出结果
实体类别包括:
- 公司名称
- 英文公司名称
- Company with English name
- 公司名称简称
- 公司英文名称简称
待处理文本:
{text}
输出格式:
{{
"entities": [
{{"text": "原始文本内容", "type": "公司名称"}},
{{"text": "原始文本内容", "type": "英文公司名称"}},
{{"text": "原始文本内容", "type": "公司名称简称"}},
{{"text": "原始文本内容", "type": "公司英文名称简称"}},
...
]
}}
请严格按照JSON格式输出结果
""")
return prompt.format(text=text)
def get_ner_address_prompt(text: str) -> str:
"""
Returns a prompt that generates a mapping of original addresses to their masked versions.
Args:
text (str): The input text to be analyzed for masking
Returns:
str: The formatted prompt that will generate a mapping dictionary
"""
prompt = textwrap.dedent("""
你是一个专业的法律文本实体识别助手请从以下文本中抽取出所有需要脱敏的敏感信息并按照指定的类别进行分类请严格按照JSON格式输出结果
实体类别包括:
- 地址
待处理文本:
{text}
输出格式:
{{
"entities": [
{{"text": "原始文本内容", "type": "地址"}},
...
]
}}
请严格按照JSON格式输出结果
""")
return prompt.format(text=text)
def get_ner_project_prompt(text: str) -> str:
"""
Returns a prompt that generates a mapping of original project names to their masked versions.
"""
prompt = textwrap.dedent("""
你是一个专业的法律文本实体识别助手请从以下文本中抽取出所有需要脱敏的敏感信息并按照指定的类别进行分类请严格按照JSON格式输出结果
实体类别包括:
- 项目名(此处项目特指商业工程合同等项目)
待处理文本:
{text}
输出格式:
{{
"entities": [
{{"text": "原始文本内容", "type": "项目名"}},
...
]
}}
请严格按照JSON格式输出结果
""")
return prompt.format(text=text)
def get_ner_case_number_prompt(text: str) -> str:
"""
Returns a prompt that generates a mapping of original case numbers to their masked versions.
"""
prompt = textwrap.dedent("""
你是一个专业的法律文本实体识别助手请从以下文本中抽取出所有需要脱敏的敏感信息并按照指定的类别进行分类请严格按照JSON格式输出结果
实体类别包括:
- 案号
待处理文本:
{text}
输出格式:
{{
"entities": [
{{"text": "原始文本内容", "type": "案号"}},
...
]
}}
请严格按照JSON格式输出结果
""")
return prompt.format(text=text)
def get_entity_linkage_prompt(entities_text: str) -> str:
"""
Returns a prompt that identifies related entities and groups them together.
Args:
entities_text (str): The list of entities to be analyzed for linkage
Returns:
str: The formatted prompt that will generate entity linkage information
"""
prompt = textwrap.dedent("""
你是一个专业的法律文本实体关联分析助手请分析以下实体列表识别出相互关联的实体如全称与简称中文名与英文名等并将它们分组
关联规则
1. 公司名称关联
- 全称与简称"阿里巴巴集团控股有限公司" "阿里巴巴"
- 中文名与英文名"腾讯科技有限公司" "Tencent Technology Ltd."
- 母公司与子公司"腾讯" "腾讯音乐"
2. 每个组中应指定一个主要实体is_primary: true通常是
- 对于公司选择最正式的全称
- 对于人名选择最常用的称呼
待分析实体列表:
{entities_text}
输出格式:
{{
"entity_groups": [
{{
"group_id": "group_1",
"group_type": "公司名称",
"entities": [
{{
"text": "阿里巴巴集团控股有限公司",
"type": "公司名称",
"is_primary": true
}},
{{
"text": "阿里巴巴",
"type": "公司名称简称",
"is_primary": false
}}
]
}}
]
}}
注意事项
1. 只对确实有关联的实体进行分组
2. 每个实体只能属于一个组
3. 每个组必须有且仅有一个主要实体is_primary: true
4. 如果实体之间没有明显关联不要强制分组
5. group_type 应该是 "公司名称"
请严格按照JSON格式输出结果
""")
return prompt.format(entities_text=entities_text)

View File

@ -1,12 +1,12 @@
import logging
from models.document_factory import DocumentProcessorFactory
from services.ollama_client import OllamaClient
from ..document_handlers.document_factory import DocumentProcessorFactory
from ..services.ollama_client import OllamaClient
logger = logging.getLogger(__name__)
class DocumentService:
def __init__(self, ollama_client: OllamaClient):
self.ollama_client = ollama_client
def __init__(self):
pass
def process_document(self, input_path: str, output_path: str) -> bool:
try:
@ -19,10 +19,10 @@ class DocumentService:
content = processor.read_content()
# Process with Ollama
processed_content = self.ollama_client.process_document(content)
masked_content = processor.process_content(content)
# Save processed content
processor.save_content(processed_content)
processor.save_content(masked_content)
return True
except Exception as e:

View File

@ -0,0 +1,91 @@
import requests
import logging
from typing import Dict, Any
logger = logging.getLogger(__name__)
class OllamaClient:
def __init__(self, model_name: str, base_url: str = "http://localhost:11434"):
"""Initialize Ollama client.
Args:
model_name (str): Name of the Ollama model to use
host (str): Ollama server host address
port (int): Ollama server port
"""
self.model_name = model_name
self.base_url = base_url
self.headers = {"Content-Type": "application/json"}
def generate(self, prompt: str, strip_think: bool = True) -> str:
"""Process a document using the Ollama API.
Args:
document_text (str): The text content to process
Returns:
str: Processed text response from the model
Raises:
RequestException: If the API call fails
"""
try:
url = f"{self.base_url}/api/generate"
payload = {
"model": self.model_name,
"prompt": prompt,
"stream": False
}
logger.debug(f"Sending request to Ollama API: {url}")
response = requests.post(url, json=payload, headers=self.headers)
response.raise_for_status()
result = response.json()
logger.debug(f"Received response from Ollama API: {result}")
if strip_think:
# Remove the "thinking" part from the response
# the response is expected to be <think>...</think>response_text
# Check if the response contains <think> tag
if "<think>" in result.get("response", ""):
# Split the response and take the part after </think>
response_parts = result["response"].split("</think>")
if len(response_parts) > 1:
# Return the part after </think>
return response_parts[1].strip()
else:
# If no closing tag, return the full response
return result.get("response", "").strip()
else:
# If no <think> tag, return the full response
return result.get("response", "").strip()
else:
# If strip_think is False, return the full response
return result.get("response", "")
except requests.exceptions.RequestException as e:
logger.error(f"Error calling Ollama API: {str(e)}")
raise
def get_model_info(self) -> Dict[str, Any]:
"""Get information about the current model.
Returns:
Dict[str, Any]: Model information
Raises:
RequestException: If the API call fails
"""
try:
url = f"{self.base_url}/api/show"
payload = {"name": self.model_name}
response = requests.post(url, json=payload, headers=self.headers)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
logger.error(f"Error getting model info: {str(e)}")
raise

View File

@ -0,0 +1,141 @@
import json
import re
from typing import Any, Optional, Dict, TypeVar, Type
T = TypeVar('T')
class LLMJsonExtractor:
"""Utility class for extracting and parsing JSON from LLM outputs"""
@staticmethod
def extract_json(text: str) -> Optional[str]:
"""
Extracts JSON string from text using regex pattern matching.
Handles both single and multiple JSON objects in text.
Args:
text (str): Raw text containing JSON
Returns:
Optional[str]: Extracted JSON string or None if no valid JSON found
"""
# Pattern to match JSON objects with balanced braces
pattern = r'{[^{}]*(?:{[^{}]*}[^{}]*)*}'
matches = re.findall(pattern, text)
if not matches:
return None
# Return the first valid JSON match
for match in matches:
try:
# Verify it's valid JSON
json.loads(match)
return match
except json.JSONDecodeError:
continue
return None
@staticmethod
def parse_json(text: str) -> Optional[Dict[str, Any]]:
"""
Extracts and parses JSON from text into a Python dictionary.
Args:
text (str): Raw text containing JSON
Returns:
Optional[Dict[str, Any]]: Parsed JSON as dictionary or None if parsing fails
"""
try:
json_str = LLMJsonExtractor.extract_json(text)
if json_str:
return json.loads(json_str)
return None
except json.JSONDecodeError:
return None
@staticmethod
def parse_to_dataclass(text: str, dataclass_type: Type[T]) -> Optional[T]:
"""
Extracts JSON and converts it to a specified dataclass type.
Args:
text (str): Raw text containing JSON
dataclass_type (Type[T]): Target dataclass type
Returns:
Optional[T]: Instance of specified dataclass or None if conversion fails
"""
try:
data = LLMJsonExtractor.parse_json(text)
if data:
return dataclass_type(**data)
return None
except (json.JSONDecodeError, TypeError):
return None
@staticmethod
def parse_raw_json_str(text: str) -> Optional[Dict[str, Any]]:
"""
Extracts and parses JSON from text into a Python dictionary.
Args:
text (str): Raw text containing JSON
Returns:
Optional[Dict[str, Any]]: Parsed JSON as dictionary or None if parsing fails
"""
try:
json_str = LLMJsonExtractor.extract_json_max(text)
if json_str:
return json.loads(json_str)
return None
except json.JSONDecodeError:
return None
@staticmethod
def extract_json_max(text: str) -> Optional[str]:
"""
Extracts the maximum valid JSON object from text using stack-based brace matching.
Args:
text (str): Raw text containing JSON
Returns:
Optional[str]: Maximum valid JSON object as string or None if no valid JSON found
"""
max_json = None
max_length = 0
# Iterate through each character as a potential start of JSON
for start in range(len(text)):
if text[start] != '{':
continue
stack = []
for end in range(start, len(text)):
if text[end] == '{':
stack.append(end)
elif text[end] == '}':
if not stack: # Unmatched closing brace
break
opening_pos = stack.pop()
# If stack is empty, we have a complete JSON object
if not stack:
json_candidate = text[opening_pos:end + 1]
try:
# Verify it's valid JSON
json.loads(json_candidate)
if len(json_candidate) > max_length:
max_length = len(json_candidate)
max_json = json_candidate
except json.JSONDecodeError:
continue
return max_json

View File

@ -0,0 +1,240 @@
import logging
from typing import Any, Dict, Optional
from jsonschema import validate, ValidationError
logger = logging.getLogger(__name__)
class LLMResponseValidator:
"""Validator for LLM JSON responses with different schemas for different entity types"""
# Schema for basic entity extraction responses
ENTITY_EXTRACTION_SCHEMA = {
"type": "object",
"properties": {
"entities": {
"type": "array",
"items": {
"type": "object",
"properties": {
"text": {"type": "string"},
"type": {"type": "string"}
},
"required": ["text", "type"]
}
}
},
"required": ["entities"]
}
# Schema for entity linkage responses
ENTITY_LINKAGE_SCHEMA = {
"type": "object",
"properties": {
"entity_groups": {
"type": "array",
"items": {
"type": "object",
"properties": {
"group_id": {"type": "string"},
"group_type": {"type": "string"},
"entities": {
"type": "array",
"items": {
"type": "object",
"properties": {
"text": {"type": "string"},
"type": {"type": "string"},
"is_primary": {"type": "boolean"}
},
"required": ["text", "type", "is_primary"]
}
}
},
"required": ["group_id", "group_type", "entities"]
}
}
},
"required": ["entity_groups"]
}
# Schema for regex-based entity extraction (from entity_regex.py)
REGEX_ENTITY_SCHEMA = {
"type": "object",
"properties": {
"entities": {
"type": "array",
"items": {
"type": "object",
"properties": {
"text": {"type": "string"},
"type": {"type": "string"}
},
"required": ["text", "type"]
}
}
},
"required": ["entities"]
}
@classmethod
def validate_entity_extraction(cls, response: Dict[str, Any]) -> bool:
"""
Validate entity extraction response from LLM.
Args:
response: The parsed JSON response from LLM
Returns:
bool: True if valid, False otherwise
"""
try:
validate(instance=response, schema=cls.ENTITY_EXTRACTION_SCHEMA)
logger.debug(f"Entity extraction validation passed for response: {response}")
return True
except ValidationError as e:
logger.warning(f"Entity extraction validation failed: {e}")
logger.warning(f"Response that failed validation: {response}")
return False
@classmethod
def validate_entity_linkage(cls, response: Dict[str, Any]) -> bool:
"""
Validate entity linkage response from LLM.
Args:
response: The parsed JSON response from LLM
Returns:
bool: True if valid, False otherwise
"""
try:
validate(instance=response, schema=cls.ENTITY_LINKAGE_SCHEMA)
content_valid = cls._validate_linkage_content(response)
if content_valid:
logger.debug(f"Entity linkage validation passed for response: {response}")
return True
else:
logger.warning(f"Entity linkage content validation failed for response: {response}")
return False
except ValidationError as e:
logger.warning(f"Entity linkage validation failed: {e}")
logger.warning(f"Response that failed validation: {response}")
return False
@classmethod
def validate_regex_entity(cls, response: Dict[str, Any]) -> bool:
"""
Validate regex-based entity extraction response.
Args:
response: The parsed JSON response from regex extractors
Returns:
bool: True if valid, False otherwise
"""
try:
validate(instance=response, schema=cls.REGEX_ENTITY_SCHEMA)
logger.debug(f"Regex entity validation passed for response: {response}")
return True
except ValidationError as e:
logger.warning(f"Regex entity validation failed: {e}")
logger.warning(f"Response that failed validation: {response}")
return False
@classmethod
def _validate_linkage_content(cls, response: Dict[str, Any]) -> bool:
"""
Additional content validation for entity linkage responses.
Args:
response: The parsed JSON response from LLM
Returns:
bool: True if content is valid, False otherwise
"""
entity_groups = response.get('entity_groups', [])
for group in entity_groups:
# Validate group type
group_type = group.get('group_type', '')
if group_type not in ['公司名称', '人名']:
logger.warning(f"Invalid group_type: {group_type}")
return False
# Validate entities in group
entities = group.get('entities', [])
if not entities:
logger.warning("Empty entity group found")
return False
# Check that exactly one entity is marked as primary
primary_count = sum(1 for entity in entities if entity.get('is_primary', False))
if primary_count != 1:
logger.warning(f"Group must have exactly one primary entity, found {primary_count}")
return False
# Validate entity types within group
for entity in entities:
entity_type = entity.get('type', '')
if group_type == '公司名称' and not any(keyword in entity_type for keyword in ['公司', 'Company']):
logger.warning(f"Company group contains non-company entity: {entity_type}")
return False
elif group_type == '人名' and not any(keyword in entity_type for keyword in ['人名', '英文人名']):
logger.warning(f"Person group contains non-person entity: {entity_type}")
return False
return True
@classmethod
def validate_response_by_type(cls, response: Dict[str, Any], response_type: str) -> bool:
"""
Generic validator that routes to appropriate validation method based on response type.
Args:
response: The parsed JSON response from LLM
response_type: Type of response ('entity_extraction', 'entity_linkage', 'regex_entity')
Returns:
bool: True if valid, False otherwise
"""
validators = {
'entity_extraction': cls.validate_entity_extraction,
'entity_linkage': cls.validate_entity_linkage,
'regex_entity': cls.validate_regex_entity
}
validator = validators.get(response_type)
if not validator:
logger.error(f"Unknown response type: {response_type}")
return False
return validator(response)
@classmethod
def get_validation_errors(cls, response: Dict[str, Any], response_type: str) -> Optional[str]:
"""
Get detailed validation errors for debugging.
Args:
response: The parsed JSON response from LLM
response_type: Type of response
Returns:
Optional[str]: Error message or None if valid
"""
try:
if response_type == 'entity_extraction':
validate(instance=response, schema=cls.ENTITY_EXTRACTION_SCHEMA)
elif response_type == 'entity_linkage':
validate(instance=response, schema=cls.ENTITY_LINKAGE_SCHEMA)
if not cls._validate_linkage_content(response):
return "Content validation failed for entity linkage"
elif response_type == 'regex_entity':
validate(instance=response, schema=cls.REGEX_ENTITY_SCHEMA)
else:
return f"Unknown response type: {response_type}"
return None
except ValidationError as e:
return f"Schema validation error: {e}"

33
backend/app/main.py Normal file
View File

@ -0,0 +1,33 @@
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from .core.config import settings
from .api.endpoints import files
from .core.database import engine, Base
# Create database tables
Base.metadata.create_all(bind=engine)
app = FastAPI(
title=settings.PROJECT_NAME,
openapi_url=f"{settings.API_V1_STR}/openapi.json"
)
# Set up CORS
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # In production, replace with specific origins
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Include routers
app.include_router(
files.router,
prefix=f"{settings.API_V1_STR}/files",
tags=["files"]
)
@app.get("/")
async def root():
return {"message": "Welcome to Legal Document Masker API"}

View File

@ -0,0 +1,22 @@
from sqlalchemy import Column, String, DateTime, Text
from datetime import datetime
import uuid
from ..core.database import Base
class FileStatus(str):
NOT_STARTED = "not_started"
PROCESSING = "processing"
SUCCESS = "success"
FAILED = "failed"
class File(Base):
__tablename__ = "files"
id = Column(String(36), primary_key=True, default=lambda: str(uuid.uuid4()))
filename = Column(String(255), nullable=False)
original_path = Column(String(255), nullable=False)
processed_path = Column(String(255))
status = Column(String(20), nullable=False, default=FileStatus.NOT_STARTED)
error_message = Column(Text)
created_at = Column(DateTime, nullable=False, default=datetime.utcnow)
updated_at = Column(DateTime, nullable=False, default=datetime.utcnow, onupdate=datetime.utcnow)

View File

@ -0,0 +1,21 @@
from pydantic import BaseModel
from datetime import datetime
from typing import Optional
from uuid import UUID
class FileBase(BaseModel):
filename: str
status: str
error_message: Optional[str] = None
class FileResponse(FileBase):
id: UUID
created_at: datetime
updated_at: datetime
class Config:
from_attributes = True
class FileList(BaseModel):
files: list[FileResponse]
total: int

View File

@ -0,0 +1,87 @@
from celery import Celery
from ..core.config import settings
from ..models.file import File, FileStatus
from sqlalchemy.orm import Session
from ..core.database import SessionLocal
import sys
import os
from ..core.services.document_service import DocumentService
from pathlib import Path
from fastapi import HTTPException
celery = Celery(
'file_service',
broker=settings.CELERY_BROKER_URL,
backend=settings.CELERY_RESULT_BACKEND
)
def delete_file(file_id: str):
"""
Delete a file and its associated records.
This will:
1. Delete the database record
2. Delete the original uploaded file
3. Delete the processed markdown file (if it exists)
"""
db = SessionLocal()
try:
# Get the file record
file = db.query(File).filter(File.id == file_id).first()
if not file:
raise HTTPException(status_code=404, detail="File not found")
# Delete the original file if it exists
if file.original_path and os.path.exists(file.original_path):
os.remove(file.original_path)
# Delete the processed file if it exists
if file.processed_path and os.path.exists(file.processed_path):
os.remove(file.processed_path)
# Delete the database record
db.delete(file)
db.commit()
except Exception as e:
db.rollback()
raise HTTPException(status_code=500, detail=f"Error deleting file: {str(e)}")
finally:
db.close()
@celery.task
def process_file(file_id: str):
db = SessionLocal()
try:
file = db.query(File).filter(File.id == file_id).first()
if not file:
return
# Update status to processing
file.status = FileStatus.PROCESSING
db.commit()
try:
# Process the file using your existing masking system
process_service = DocumentService()
# Determine output path using file_id with .md extension
output_filename = f"{file_id}.md"
output_path = str(settings.PROCESSED_FOLDER / output_filename)
# Process document with both input and output paths
process_service.process_document(file.original_path, output_path)
# Update file record with processed path
file.processed_path = output_path
file.status = FileStatus.SUCCESS
db.commit()
except Exception as e:
file.status = FileStatus.FAILED
file.error_message = str(e)
db.commit()
raise
finally:
db.close()

View File

@ -0,0 +1,37 @@
version: '3.8'
services:
api:
build: .
ports:
- "8000:8000"
volumes:
- ./storage:/app/storage
- ./legal_doc_masker.db:/app/legal_doc_masker.db
env_file:
- .env
environment:
- CELERY_BROKER_URL=redis://redis:6379/0
- CELERY_RESULT_BACKEND=redis://redis:6379/0
depends_on:
- redis
celery_worker:
build: .
command: celery -A app.services.file_service worker --loglevel=info
volumes:
- ./storage:/app/storage
- ./legal_doc_masker.db:/app/legal_doc_masker.db
env_file:
- .env
environment:
- CELERY_BROKER_URL=redis://redis:6379/0
- CELERY_RESULT_BACKEND=redis://redis:6379/0
depends_on:
- redis
- api
redis:
image: redis:alpine
ports:
- "6379:6379"

127
backend/log Normal file
View File

@ -0,0 +1,127 @@
[2025-07-14 14:20:19,015: INFO/ForkPoolWorker-4] Raw response from LLM: {
celery_worker-1 | "entities": []
celery_worker-1 | }
celery_worker-1 | [2025-07-14 14:20:19,016: INFO/ForkPoolWorker-4] Parsed mapping: {'entities': []}
celery_worker-1 | [2025-07-14 14:20:19,020: INFO/ForkPoolWorker-4] Calling ollama to generate case numbers mapping for chunk (attempt 1/3):
celery_worker-1 | 你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息并按照指定的类别进行分类。请严格按照JSON格式输出结果。
celery_worker-1 |
celery_worker-1 | 实体类别包括:
celery_worker-1 | - 案号
celery_worker-1 |
celery_worker-1 | 待处理文本:
celery_worker-1 |
celery_worker-1 |
celery_worker-1 | 二审案件受理费450892 元,由北京丰复久信营销科技有限公司负担(已交纳)。
celery_worker-1 |
celery_worker-1 | 29. 本判决为终审判决。
celery_worker-1 |
celery_worker-1 | 审 判 长 史晓霞审 判 员 邓青菁审 判 员 李 淼二〇二二年七月七日法 官 助 理 黎 铧书 记 员 郑海兴
celery_worker-1 |
celery_worker-1 | 输出格式:
celery_worker-1 | {
celery_worker-1 | "entities": [
celery_worker-1 | {"text": "原始文本内容", "type": "案号"},
celery_worker-1 | ...
celery_worker-1 | ]
celery_worker-1 | }
celery_worker-1 |
celery_worker-1 | 请严格按照JSON格式输出结果。
celery_worker-1 |
api-1 | INFO: 192.168.65.1:60045 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:34054 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:34054 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:22084 - "GET /api/v1/files/files HTTP/1.1" 200 OK
celery_worker-1 | [2025-07-14 14:20:31,279: INFO/ForkPoolWorker-4] Raw response from LLM: {
celery_worker-1 | "entities": []
celery_worker-1 | }
celery_worker-1 | [2025-07-14 14:20:31,281: INFO/ForkPoolWorker-4] Parsed mapping: {'entities': []}
celery_worker-1 | [2025-07-14 14:20:31,287: INFO/ForkPoolWorker-4] Chunk mapping: [{'entities': []}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': []}, {'entities': []}]
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Final chunk mappings: [{'entities': [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}]}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}]}, {'entities': [{'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}]}, {'entities': [{'text': '服务合同', 'type': '项目名'}]}, {'entities': [{'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '2020京0105 民初69754 号', 'type': '案号'}]}, {'entities': [{'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}]}, {'entities': [{'text': '丰复久信公司', 'type': '公司名称'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}]}, {'entities': [{'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}]}, {'entities': [{'text': '《计算机设备采购合同》', 'type': '项目名'}]}, {'entities': []}, {'entities': []}, {'entities': [{'text': '丰复久信公司', 'type': '公司名称'}, {'text': '中研智创公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': [{'text': '《服务合同书》', 'type': '项目名'}]}, {'entities': []}, {'entities': []}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': []}, {'entities': []}]
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '丰复久信公司', 'type': '公司名称'}
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '丰复久信公司', 'type': '公司名称'}
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '中研智创公司', 'type': '公司名称'}
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Merged 22 unique entities
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Unique entities: [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}, {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}, {'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}, {'text': '服务合同', 'type': '项目名'}, {'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '2020京0105 民初69754 号', 'type': '案号'}, {'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}, {'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}, {'text': '《计算机设备采购合同》', 'type': '项目名'}, {'text': '《服务合同书》', 'type': '项目名'}]
celery_worker-1 | [2025-07-14 14:20:31,289: INFO/ForkPoolWorker-4] Calling ollama to generate entity linkage (attempt 1/3)
api-1 | INFO: 192.168.65.1:52168 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:61426 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:30702 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:48159 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:16860 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:21262 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:45564 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:32142 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:27769 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:21196 - "GET /api/v1/files/files HTTP/1.1" 200 OK
celery_worker-1 | [2025-07-14 14:21:21,436: INFO/ForkPoolWorker-4] Raw entity linkage response from LLM: {
celery_worker-1 | "entity_groups": [
celery_worker-1 | {
celery_worker-1 | "group_id": "group_1",
celery_worker-1 | "group_type": "公司名称",
celery_worker-1 | "entities": [
celery_worker-1 | {
celery_worker-1 | "text": "北京丰复久信营销科技有限公司",
celery_worker-1 | "type": "公司名称",
celery_worker-1 | "is_primary": true
celery_worker-1 | },
celery_worker-1 | {
celery_worker-1 | "text": "丰复久信公司",
celery_worker-1 | "type": "公司名称简称",
celery_worker-1 | "is_primary": false
celery_worker-1 | },
celery_worker-1 | {
celery_worker-1 | "text": "丰复久信",
celery_worker-1 | "type": "公司名称简称",
celery_worker-1 | "is_primary": false
celery_worker-1 | }
celery_worker-1 | ]
celery_worker-1 | },
celery_worker-1 | {
celery_worker-1 | "group_id": "group_2",
celery_worker-1 | "group_type": "公司名称",
celery_worker-1 | "entities": [
celery_worker-1 | {
celery_worker-1 | "text": "中研智创区块链技术有限公司",
celery_worker-1 | "type": "公司名称",
celery_worker-1 | "is_primary": true
celery_worker-1 | },
celery_worker-1 | {
celery_worker-1 | "text": "中研智创公司",
celery_worker-1 | "type": "公司名称简称",
celery_worker-1 | "is_primary": false
celery_worker-1 | },
celery_worker-1 | {
celery_worker-1 | "text": "中研智创",
celery_worker-1 | "type": "公司名称简称",
celery_worker-1 | "is_primary": false
celery_worker-1 | }
celery_worker-1 | ]
celery_worker-1 | }
celery_worker-1 | ]
celery_worker-1 | }
celery_worker-1 | [2025-07-14 14:21:21,437: INFO/ForkPoolWorker-4] Parsed entity linkage: {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
celery_worker-1 | [2025-07-14 14:21:21,445: INFO/ForkPoolWorker-4] Successfully created entity linkage with 2 groups
celery_worker-1 | [2025-07-14 14:21:21,445: INFO/ForkPoolWorker-4] Entity linkage: {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Generated masked mapping for 22 entities
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Combined mapping: {'郭东军': '某', '王欢子': '某甲', '北京丰复久信营销科技有限公司': '某公司', '丰复久信公司': '某公司甲', '中研智创区块链技术有限公司': '某公司乙', '中研智才公司': '某公司丙', '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室': '某乙', '天津市津南区双港镇工业园区优谷产业园5 号楼-1505': '某丙', '服务合同': '某丁', '(2022)京 03 民终 3852 号': '某戊', '2020京0105 民初69754 号': '某己', '李圣艳': '某庚', '闫向东': '某辛', '李敏': '某壬', '布兰登·斯密特': '某癸', '中研智创公司': '某公司丁', '丰复久信': '某公司戊', '中研智创': '某公司己', '上海市': '某11', '北京': '某12', '《计算机设备采购合同》': '某13', '《服务合同书》': '某14'}
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '北京丰复久信营销科技有限公司' to '北京丰复久信营销科技有限公司' with masked name '某公司'
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '丰复久信公司' to '北京丰复久信营销科技有限公司' with masked name '某公司'
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '丰复久信' to '北京丰复久信营销科技有限公司' with masked name '某公司'
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创区块链技术有限公司' to '中研智创区块链技术有限公司' with masked name '某公司乙'
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创公司' to '中研智创区块链技术有限公司' with masked name '某公司乙'
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创' to '中研智创区块链技术有限公司' with masked name '某公司乙'
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Final mapping: {'郭东军': '某', '王欢子': '某甲', '北京丰复久信营销科技有限公司': '某公司', '丰复久信公司': '某公司', '中研智创区块链技术有限公司': '某公司乙', '中研智才公司': '某公司丙', '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室': '某乙', '天津市津南区双港镇工业园区优谷产业园5 号楼-1505': '某丙', '服务合同': '某丁', '(2022)京 03 民终 3852 号': '某戊', '2020京0105 民初69754 号': '某己', '李圣艳': '某庚', '闫向东': '某辛', '李敏': '某壬', '布兰登·斯密特': '某癸', '中研智创公司': '某公司乙', '丰复久信': '某公司', '中研智创': '某公司乙', '上海市': '某11', '北京': '某12', '《计算机设备采购合同》': '某13', '《服务合同书》': '某14'}
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Successfully masked content
celery_worker-1 | [2025-07-14 14:21:21,449: INFO/ForkPoolWorker-4] Successfully saved masked content to /app/storage/processed/47522ea9-c259-4304-bfe4-1d3ed6902ede.md
celery_worker-1 | [2025-07-14 14:21:21,470: INFO/ForkPoolWorker-4] Task app.services.file_service.process_file[5cfbca4c-0f6f-4c71-a66b-b22ee2d28139] succeeded in 311.847165101s: None
api-1 | INFO: 192.168.65.1:33432 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:40073 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:29550 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:61350 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:61755 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:63726 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:43446 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:45624 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:25256 - "GET /api/v1/files/files HTTP/1.1" 200 OK
api-1 | INFO: 192.168.65.1:43464 - "GET /api/v1/files/files HTTP/1.1" 200 OK

6
backend/package-lock.json generated Normal file
View File

@ -0,0 +1,6 @@
{
"name": "backend",
"lockfileVersion": 3,
"requires": true,
"packages": {}
}

32
backend/requirements.txt Normal file
View File

@ -0,0 +1,32 @@
# FastAPI and server
fastapi>=0.104.0
uvicorn>=0.24.0
python-multipart>=0.0.6
websockets>=12.0
# Database
sqlalchemy>=2.0.0
alembic>=1.12.0
# Background tasks
celery>=5.3.0
redis>=5.0.0
# Security
python-jose[cryptography]>=3.3.0
passlib[bcrypt]>=1.7.4
python-dotenv>=1.0.0
# Testing
pytest>=7.4.0
httpx>=0.25.0
# Existing project dependencies
pydantic-settings>=2.0.0
watchdog==2.1.6
requests==2.28.1
python-docx>=0.8.11
PyPDF2>=3.0.0
pandas>=2.0.0
# magic-pdf[full]
jsonschema>=4.20.0

1
backend/tests/test.txt Normal file
View File

@ -0,0 +1 @@
关于张三天和北京易见天树有限公司的劳动纠纷

View File

@ -0,0 +1,62 @@
import pytest
from app.core.document_handlers.ner_processor import NerProcessor
def test_generate_masked_mapping():
processor = NerProcessor()
unique_entities = [
{'text': '李雷', 'type': '人名'},
{'text': '李明', 'type': '人名'},
{'text': '王强', 'type': '人名'},
{'text': 'Acme Manufacturing Inc.', 'type': '英文公司名', 'industry': 'manufacturing'},
{'text': 'Google LLC', 'type': '英文公司名'},
{'text': 'A公司', 'type': '公司名称'},
{'text': 'B公司', 'type': '公司名称'},
{'text': 'John Smith', 'type': '英文人名'},
{'text': 'Elizabeth Windsor', 'type': '英文人名'},
{'text': '华梦龙光伏项目', 'type': '项目名'},
{'text': '案号12345', 'type': '案号'},
{'text': '310101198802080000', 'type': '身份证号'},
{'text': '9133021276453538XT', 'type': '社会信用代码'},
]
linkage = {
'entity_groups': [
{
'group_id': 'g1',
'group_type': '公司名称',
'entities': [
{'text': 'A公司', 'type': '公司名称', 'is_primary': True},
{'text': 'B公司', 'type': '公司名称', 'is_primary': False},
]
},
{
'group_id': 'g2',
'group_type': '人名',
'entities': [
{'text': '李雷', 'type': '人名', 'is_primary': True},
{'text': '李明', 'type': '人名', 'is_primary': False},
]
}
]
}
mapping = processor._generate_masked_mapping(unique_entities, linkage)
# 人名
assert mapping['李雷'].startswith('李某')
assert mapping['李明'].startswith('李某')
assert mapping['王强'].startswith('王某')
# 英文公司名
assert mapping['Acme Manufacturing Inc.'] == 'MANUFACTURING'
assert mapping['Google LLC'] == 'COMPANY'
# 公司名同组
assert mapping['A公司'] == mapping['B公司']
assert mapping['A公司'].endswith('公司')
# 英文人名
assert mapping['John Smith'] == 'J*** S***'
assert mapping['Elizabeth Windsor'] == 'E*** W***'
# 项目名
assert mapping['华梦龙光伏项目'].endswith('项目')
# 案号
assert mapping['案号12345'] == '***'
# 身份证号
assert mapping['310101198802080000'] == 'XXXXXX'
# 社会信用代码
assert mapping['9133021276453538XT'] == 'XXXXXXXX'

105
docker-compose.yml Normal file
View File

@ -0,0 +1,105 @@
version: '3.8'
services:
# Mineru API Service
mineru-api:
build:
context: ./mineru
dockerfile: Dockerfile
platform: linux/arm64
ports:
- "8001:8000"
volumes:
- ./mineru/storage/uploads:/app/storage/uploads
- ./mineru/storage/processed:/app/storage/processed
environment:
- PYTHONUNBUFFERED=1
- MINERU_MODEL_SOURCE=local
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
networks:
- app-network
# Backend API Service
backend-api:
build:
context: ./backend
dockerfile: Dockerfile
ports:
- "8000:8000"
volumes:
- ./backend/storage:/app/storage
- ./backend/legal_doc_masker.db:/app/legal_doc_masker.db
env_file:
- ./backend/.env
environment:
- CELERY_BROKER_URL=redis://redis:6379/0
- CELERY_RESULT_BACKEND=redis://redis:6379/0
- MINERU_API_URL=http://mineru-api:8000
depends_on:
- redis
- mineru-api
networks:
- app-network
# Celery Worker
celery-worker:
build:
context: ./backend
dockerfile: Dockerfile
command: celery -A app.services.file_service worker --loglevel=info
volumes:
- ./backend/storage:/app/storage
- ./backend/legal_doc_masker.db:/app/legal_doc_masker.db
env_file:
- ./backend/.env
environment:
- CELERY_BROKER_URL=redis://redis:6379/0
- CELERY_RESULT_BACKEND=redis://redis:6379/0
- MINERU_API_URL=http://mineru-api:8000
depends_on:
- redis
- backend-api
networks:
- app-network
# Redis Service
redis:
image: redis:alpine
ports:
- "6379:6379"
networks:
- app-network
# Frontend Service
frontend:
build:
context: ./frontend
dockerfile: Dockerfile
args:
- REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
ports:
- "3000:80"
env_file:
- ./frontend/.env
environment:
- NODE_ENV=production
- REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
restart: unless-stopped
depends_on:
- backend-api
networks:
- app-network
networks:
app-network:
driver: bridge
volumes:
uploads:
processed:

67
download_models.py Normal file
View File

@ -0,0 +1,67 @@
import json
import shutil
import os
import requests
from modelscope import snapshot_download
def download_json(url):
# 下载JSON文件
response = requests.get(url)
response.raise_for_status() # 检查请求是否成功
return response.json()
def download_and_modify_json(url, local_filename, modifications):
if os.path.exists(local_filename):
data = json.load(open(local_filename))
config_version = data.get('config_version', '0.0.0')
if config_version < '1.2.0':
data = download_json(url)
else:
data = download_json(url)
# 修改内容
for key, value in modifications.items():
data[key] = value
# 保存修改后的内容
with open(local_filename, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)
if __name__ == '__main__':
mineru_patterns = [
# "models/Layout/LayoutLMv3/*",
"models/Layout/YOLO/*",
"models/MFD/YOLO/*",
"models/MFR/unimernet_hf_small_2503/*",
"models/OCR/paddleocr_torch/*",
# "models/TabRec/TableMaster/*",
# "models/TabRec/StructEqTable/*",
]
model_dir = snapshot_download('opendatalab/PDF-Extract-Kit-1.0', allow_patterns=mineru_patterns)
layoutreader_model_dir = snapshot_download('ppaanngggg/layoutreader')
model_dir = model_dir + '/models'
print(f'model_dir is: {model_dir}')
print(f'layoutreader_model_dir is: {layoutreader_model_dir}')
# paddleocr_model_dir = model_dir + '/OCR/paddleocr'
# user_paddleocr_dir = os.path.expanduser('~/.paddleocr')
# if os.path.exists(user_paddleocr_dir):
# shutil.rmtree(user_paddleocr_dir)
# shutil.copytree(paddleocr_model_dir, user_paddleocr_dir)
json_url = 'https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/magic-pdf.template.json'
config_file_name = 'magic-pdf.json'
home_dir = os.path.expanduser('~')
config_file = os.path.join(home_dir, config_file_name)
json_mods = {
'models-dir': model_dir,
'layoutreader-model-dir': layoutreader_model_dir,
}
download_and_modify_json(json_url, config_file, json_mods)
print(f'The configuration file has been configured successfully, the path is: {config_file}')

168
export-images.sh Normal file
View File

@ -0,0 +1,168 @@
#!/bin/bash
# Docker Image Export Script
# Exports all project Docker images for migration to another environment
set -e
echo "🚀 Legal Document Masker - Docker Image Export"
echo "=============================================="
# Function to check if Docker is running
check_docker() {
if ! docker info > /dev/null 2>&1; then
echo "❌ Docker is not running. Please start Docker and try again."
exit 1
fi
echo "✅ Docker is running"
}
# Function to check if images exist
check_images() {
echo "🔍 Checking for required images..."
local missing_images=()
if ! docker images | grep -q "legal-doc-masker-backend-api"; then
missing_images+=("legal-doc-masker-backend-api")
fi
if ! docker images | grep -q "legal-doc-masker-frontend"; then
missing_images+=("legal-doc-masker-frontend")
fi
if ! docker images | grep -q "legal-doc-masker-mineru-api"; then
missing_images+=("legal-doc-masker-mineru-api")
fi
if ! docker images | grep -q "redis:alpine"; then
missing_images+=("redis:alpine")
fi
if [ ${#missing_images[@]} -ne 0 ]; then
echo "❌ Missing images: ${missing_images[*]}"
echo "Please build the images first using: docker-compose build"
exit 1
fi
echo "✅ All required images found"
}
# Function to create export directory
create_export_dir() {
local export_dir="docker-images-export-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$export_dir"
cd "$export_dir"
echo "📁 Created export directory: $export_dir"
echo "$export_dir"
}
# Function to export images
export_images() {
local export_dir="$1"
echo "📦 Exporting Docker images..."
# Export backend image
echo " 📦 Exporting backend-api image..."
docker save legal-doc-masker-backend-api:latest -o backend-api.tar
# Export frontend image
echo " 📦 Exporting frontend image..."
docker save legal-doc-masker-frontend:latest -o frontend.tar
# Export mineru image
echo " 📦 Exporting mineru-api image..."
docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
# Export redis image
echo " 📦 Exporting redis image..."
docker save redis:alpine -o redis.tar
echo "✅ All images exported successfully!"
}
# Function to show export summary
show_summary() {
echo ""
echo "📊 Export Summary:"
echo "=================="
ls -lh *.tar
echo ""
echo "📋 Files to transfer:"
echo "===================="
for file in *.tar; do
echo " - $file"
done
echo ""
echo "💾 Total size: $(du -sh . | cut -f1)"
}
# Function to create compressed archive
create_archive() {
echo ""
echo "🗜️ Creating compressed archive..."
local archive_name="legal-doc-masker-images-$(date +%Y%m%d-%H%M%S).tar.gz"
tar -czf "$archive_name" *.tar
echo "✅ Created archive: $archive_name"
echo "📊 Archive size: $(du -sh "$archive_name" | cut -f1)"
echo ""
echo "📋 Transfer options:"
echo "==================="
echo "1. Transfer individual .tar files"
echo "2. Transfer compressed archive: $archive_name"
}
# Function to show transfer instructions
show_transfer_instructions() {
echo ""
echo "📤 Transfer Instructions:"
echo "========================"
echo ""
echo "Option 1: Transfer individual files"
echo "-----------------------------------"
echo "scp *.tar user@target-server:/path/to/destination/"
echo ""
echo "Option 2: Transfer compressed archive"
echo "-------------------------------------"
echo "scp legal-doc-masker-images-*.tar.gz user@target-server:/path/to/destination/"
echo ""
echo "Option 3: USB Drive"
echo "-------------------"
echo "cp *.tar /Volumes/USB_DRIVE/docker-images/"
echo "cp legal-doc-masker-images-*.tar.gz /Volumes/USB_DRIVE/"
echo ""
echo "Option 4: Cloud Storage"
echo "----------------------"
echo "aws s3 cp *.tar s3://your-bucket/docker-images/"
echo "aws s3 cp legal-doc-masker-images-*.tar.gz s3://your-bucket/docker-images/"
}
# Main execution
main() {
check_docker
check_images
local export_dir=$(create_export_dir)
export_images "$export_dir"
show_summary
create_archive
show_transfer_instructions
echo ""
echo "🎉 Export completed successfully!"
echo "📁 Export location: $(pwd)"
echo ""
echo "Next steps:"
echo "1. Transfer the files to your target environment"
echo "2. Use import-images.sh on the target environment"
echo "3. Copy docker-compose.yml and other config files"
}
# Run main function
main "$@"

11
frontend/.dockerignore Normal file
View File

@ -0,0 +1,11 @@
node_modules
npm-debug.log
build
.git
.gitignore
README.md
.env
.env.local
.env.development.local
.env.test.local
.env.production.local

2
frontend/.env Normal file
View File

@ -0,0 +1,2 @@
# REACT_APP_API_BASE_URL=http://192.168.2.203:8000/api/v1
REACT_APP_API_BASE_URL=http://localhost:8000/api/v1

33
frontend/Dockerfile Normal file
View File

@ -0,0 +1,33 @@
# Build stage
FROM node:18-alpine as build
WORKDIR /app
# Copy package files
COPY package*.json ./
# Install dependencies
RUN npm ci
# Copy source code
COPY . .
# Build the app with environment variables
ARG REACT_APP_API_BASE_URL
ENV REACT_APP_API_BASE_URL=$REACT_APP_API_BASE_URL
RUN npm run build
# Production stage
FROM nginx:alpine
# Copy built assets from build stage
COPY --from=build /app/build /usr/share/nginx/html
# Copy nginx configuration
COPY nginx.conf /etc/nginx/conf.d/default.conf
# Expose port 80
EXPOSE 80
# Start nginx
CMD ["nginx", "-g", "daemon off;"]

55
frontend/README.md Normal file
View File

@ -0,0 +1,55 @@
# Legal Document Masker Frontend
This is the frontend application for the Legal Document Masker service. It provides a user interface for uploading legal documents, monitoring their processing status, and downloading the masked versions.
## Features
- Drag and drop file upload
- Real-time status updates
- File list with processing status
- Multi-file selection and download
- Modern Material-UI interface
## Prerequisites
- Node.js (v14 or higher)
- npm (v6 or higher)
## Installation
1. Install dependencies:
```bash
npm install
```
2. Start the development server:
```bash
npm start
```
The application will be available at http://localhost:3000
## Development
The frontend is built with:
- React 18
- TypeScript
- Material-UI
- React Query for data fetching
- React Dropzone for file uploads
## Building for Production
To create a production build:
```bash
npm run build
```
The build artifacts will be stored in the `build/` directory.
## Environment Variables
The following environment variables can be configured:
- `REACT_APP_API_URL`: The URL of the backend API (default: http://localhost:8000/api/v1)

View File

@ -0,0 +1,24 @@
version: '3.8'
services:
frontend:
build:
context: .
dockerfile: Dockerfile
args:
- REACT_APP_API_BASE_URL=${REACT_APP_API_BASE_URL}
ports:
- "3000:80"
env_file:
- .env
environment:
- NODE_ENV=production
- REACT_APP_API_BASE_URL=${REACT_APP_API_BASE_URL}
restart: unless-stopped
networks:
- app-network
networks:
app-network:
driver: bridge

25
frontend/nginx.conf Normal file
View File

@ -0,0 +1,25 @@
server {
listen 80;
server_name localhost;
location / {
root /usr/share/nginx/html;
index index.html;
try_files $uri $uri/ /index.html;
}
# Cache static assets
location /static/ {
root /usr/share/nginx/html;
expires 1y;
add_header Cache-Control "public, no-transform";
}
# Enable gzip compression
gzip on;
gzip_vary on;
gzip_min_length 10240;
gzip_proxied expired no-cache no-store private auth;
gzip_types text/plain text/css text/xml text/javascript application/x-javascript application/xml application/javascript;
gzip_disable "MSIE [1-6]\.";
}

16946
frontend/package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

50
frontend/package.json Normal file
View File

@ -0,0 +1,50 @@
{
"name": "legal-doc-masker-frontend",
"version": "0.1.0",
"private": true,
"dependencies": {
"@emotion/react": "^11.11.3",
"@emotion/styled": "^11.11.0",
"@mui/icons-material": "^5.15.10",
"@mui/material": "^5.15.10",
"@testing-library/jest-dom": "^5.17.0",
"@testing-library/react": "^13.4.0",
"@testing-library/user-event": "^13.5.0",
"@types/jest": "^27.5.2",
"@types/node": "^16.18.80",
"@types/react": "^18.2.55",
"@types/react-dom": "^18.2.19",
"axios": "^1.6.7",
"react": "^18.2.0",
"react-dom": "^18.2.0",
"react-dropzone": "^14.2.3",
"react-query": "^3.39.3",
"react-scripts": "5.0.1",
"typescript": "^4.9.5",
"web-vitals": "^2.1.4"
},
"scripts": {
"start": "react-scripts start",
"build": "react-scripts build",
"test": "react-scripts test",
"eject": "react-scripts eject"
},
"eslintConfig": {
"extends": [
"react-app",
"react-app/jest"
]
},
"browserslist": {
"production": [
">0.2%",
"not dead",
"not op_mini all"
],
"development": [
"last 1 chrome version",
"last 1 firefox version",
"last 1 safari version"
]
}
}

View File

@ -0,0 +1,20 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<link rel="icon" href="%PUBLIC_URL%/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="theme-color" content="#000000" />
<meta
name="description"
content="Legal Document Masker - Upload and process legal documents"
/>
<link rel="apple-touch-icon" href="%PUBLIC_URL%/logo192.png" />
<link rel="manifest" href="%PUBLIC_URL%/manifest.json" />
<title>Legal Document Masker</title>
</head>
<body>
<noscript>You need to enable JavaScript to run this app.</noscript>
<div id="root"></div>
</body>
</html>

View File

@ -0,0 +1,15 @@
{
"short_name": "Legal Doc Masker",
"name": "Legal Document Masker",
"icons": [
{
"src": "favicon.ico",
"sizes": "64x64 32x32 24x24 16x16",
"type": "image/x-icon"
}
],
"start_url": ".",
"display": "standalone",
"theme_color": "#000000",
"background_color": "#ffffff"
}

58
frontend/src/App.tsx Normal file
View File

@ -0,0 +1,58 @@
import React, { useEffect, useState } from 'react';
import { Container, Typography, Box } from '@mui/material';
import { useQuery, useQueryClient } from 'react-query';
import FileUpload from './components/FileUpload';
import FileList from './components/FileList';
import { File } from './types/file';
import { api } from './services/api';
function App() {
const queryClient = useQueryClient();
const [files, setFiles] = useState<File[]>([]);
const { data, isLoading, error } = useQuery<File[]>('files', api.listFiles, {
refetchInterval: 5000, // Poll every 5 seconds
});
useEffect(() => {
if (data) {
setFiles(data);
}
}, [data]);
const handleUploadComplete = () => {
queryClient.invalidateQueries('files');
};
if (isLoading) {
return (
<Container>
<Typography>Loading...</Typography>
</Container>
);
}
if (error) {
return (
<Container>
<Typography color="error">Error loading files</Typography>
</Container>
);
}
return (
<Container maxWidth="lg">
<Box sx={{ my: 4 }}>
<Typography variant="h4" component="h1" gutterBottom>
Legal Document Masker
</Typography>
<Box sx={{ mb: 4 }}>
<FileUpload onUploadComplete={handleUploadComplete} />
</Box>
<FileList files={files} onFileStatusChange={handleUploadComplete} />
</Box>
</Container>
);
}
export default App;

View File

@ -0,0 +1,230 @@
import React, { useState } from 'react';
import {
Table,
TableBody,
TableCell,
TableContainer,
TableHead,
TableRow,
Paper,
IconButton,
Checkbox,
Button,
Chip,
Dialog,
DialogTitle,
DialogContent,
DialogActions,
Typography,
} from '@mui/material';
import { Download as DownloadIcon, Delete as DeleteIcon } from '@mui/icons-material';
import { File, FileStatus } from '../types/file';
import { api } from '../services/api';
interface FileListProps {
files: File[];
onFileStatusChange: () => void;
}
const FileList: React.FC<FileListProps> = ({ files, onFileStatusChange }) => {
const [selectedFiles, setSelectedFiles] = useState<string[]>([]);
const [deleteDialogOpen, setDeleteDialogOpen] = useState(false);
const [fileToDelete, setFileToDelete] = useState<string | null>(null);
const handleSelectFile = (fileId: string) => {
setSelectedFiles((prev) =>
prev.includes(fileId)
? prev.filter((id) => id !== fileId)
: [...prev, fileId]
);
};
const handleSelectAll = () => {
setSelectedFiles((prev) =>
prev.length === files.length ? [] : files.map((file) => file.id)
);
};
const handleDownload = async (fileId: string) => {
try {
console.log('=== FRONTEND DOWNLOAD START ===');
console.log('File ID:', fileId);
const file = files.find((f) => f.id === fileId);
console.log('File object:', file);
const blob = await api.downloadFile(fileId);
console.log('Blob received:', blob);
console.log('Blob type:', blob.type);
console.log('Blob size:', blob.size);
const url = window.URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
// Match backend behavior: change extension to .md
const originalFilename = file?.filename || 'downloaded-file';
const filenameWithoutExt = originalFilename.replace(/\.[^/.]+$/, ''); // Remove extension
const downloadFilename = `${filenameWithoutExt}.md`;
console.log('Original filename:', originalFilename);
console.log('Filename without extension:', filenameWithoutExt);
console.log('Download filename:', downloadFilename);
a.download = downloadFilename;
document.body.appendChild(a);
a.click();
window.URL.revokeObjectURL(url);
document.body.removeChild(a);
console.log('=== FRONTEND DOWNLOAD END ===');
} catch (error) {
console.error('Error downloading file:', error);
}
};
const handleDownloadSelected = async () => {
for (const fileId of selectedFiles) {
await handleDownload(fileId);
}
};
const handleDeleteClick = (fileId: string) => {
setFileToDelete(fileId);
setDeleteDialogOpen(true);
};
const handleDeleteConfirm = async () => {
if (fileToDelete) {
try {
await api.deleteFile(fileToDelete);
onFileStatusChange();
} catch (error) {
console.error('Error deleting file:', error);
}
}
setDeleteDialogOpen(false);
setFileToDelete(null);
};
const handleDeleteCancel = () => {
setDeleteDialogOpen(false);
setFileToDelete(null);
};
const getStatusColor = (status: FileStatus) => {
switch (status) {
case FileStatus.SUCCESS:
return 'success';
case FileStatus.FAILED:
return 'error';
case FileStatus.PROCESSING:
return 'warning';
default:
return 'default';
}
};
return (
<div>
<div style={{ marginBottom: '1rem' }}>
<Button
variant="contained"
color="primary"
onClick={handleDownloadSelected}
disabled={selectedFiles.length === 0}
sx={{ mr: 1 }}
>
Download Selected
</Button>
</div>
<TableContainer component={Paper}>
<Table>
<TableHead>
<TableRow>
<TableCell padding="checkbox">
<Checkbox
checked={selectedFiles.length === files.length}
indeterminate={selectedFiles.length > 0 && selectedFiles.length < files.length}
onChange={handleSelectAll}
/>
</TableCell>
<TableCell>Filename</TableCell>
<TableCell>Status</TableCell>
<TableCell>Created At</TableCell>
<TableCell>Finished At</TableCell>
<TableCell>Actions</TableCell>
</TableRow>
</TableHead>
<TableBody>
{files.map((file) => (
<TableRow key={file.id}>
<TableCell padding="checkbox">
<Checkbox
checked={selectedFiles.includes(file.id)}
onChange={() => handleSelectFile(file.id)}
/>
</TableCell>
<TableCell>{file.filename}</TableCell>
<TableCell>
<Chip
label={file.status}
color={getStatusColor(file.status) as any}
size="small"
/>
</TableCell>
<TableCell>
{new Date(file.created_at).toLocaleString()}
</TableCell>
<TableCell>
{(file.status === FileStatus.SUCCESS || file.status === FileStatus.FAILED)
? new Date(file.updated_at).toLocaleString()
: '—'}
</TableCell>
<TableCell>
<IconButton
onClick={() => handleDeleteClick(file.id)}
size="small"
color="error"
sx={{ mr: 1 }}
>
<DeleteIcon />
</IconButton>
{file.status === FileStatus.SUCCESS && (
<IconButton
onClick={() => handleDownload(file.id)}
size="small"
color="primary"
>
<DownloadIcon />
</IconButton>
)}
</TableCell>
</TableRow>
))}
</TableBody>
</Table>
</TableContainer>
<Dialog
open={deleteDialogOpen}
onClose={handleDeleteCancel}
>
<DialogTitle>Confirm Delete</DialogTitle>
<DialogContent>
<Typography>
Are you sure you want to delete this file? This action cannot be undone.
</Typography>
</DialogContent>
<DialogActions>
<Button onClick={handleDeleteCancel}>Cancel</Button>
<Button onClick={handleDeleteConfirm} color="error" variant="contained">
Delete
</Button>
</DialogActions>
</Dialog>
</div>
);
};
export default FileList;

View File

@ -0,0 +1,66 @@
import React, { useCallback } from 'react';
import { useDropzone } from 'react-dropzone';
import { Box, Typography, CircularProgress } from '@mui/material';
import { api } from '../services/api';
interface FileUploadProps {
onUploadComplete: () => void;
}
const FileUpload: React.FC<FileUploadProps> = ({ onUploadComplete }) => {
const [isUploading, setIsUploading] = React.useState(false);
const onDrop = useCallback(async (acceptedFiles: File[]) => {
setIsUploading(true);
try {
for (const file of acceptedFiles) {
await api.uploadFile(file);
}
onUploadComplete();
} catch (error) {
console.error('Error uploading files:', error);
} finally {
setIsUploading(false);
}
}, [onUploadComplete]);
const { getRootProps, getInputProps, isDragActive } = useDropzone({
onDrop,
accept: {
'application/pdf': ['.pdf'],
'application/msword': ['.doc'],
'application/vnd.openxmlformats-officedocument.wordprocessingml.document': ['.docx'],
'text/markdown': ['.md'],
},
});
return (
<Box
{...getRootProps()}
sx={{
border: '2px dashed #ccc',
borderRadius: 2,
p: 3,
textAlign: 'center',
cursor: 'pointer',
bgcolor: isDragActive ? 'action.hover' : 'background.paper',
'&:hover': {
bgcolor: 'action.hover',
},
}}
>
<input {...getInputProps()} />
{isUploading ? (
<CircularProgress />
) : (
<Typography>
{isDragActive
? 'Drop the files here...'
: 'Drag and drop files here, or click to select files'}
</Typography>
)}
</Box>
);
};
export default FileUpload;

8
frontend/src/env.d.ts vendored Normal file
View File

@ -0,0 +1,8 @@
/// <reference types="react-scripts" />
declare namespace NodeJS {
interface ProcessEnv {
readonly REACT_APP_API_BASE_URL: string;
// Add other environment variables here
}
}

29
frontend/src/index.tsx Normal file
View File

@ -0,0 +1,29 @@
import React from 'react';
import ReactDOM from 'react-dom/client';
import { QueryClient, QueryClientProvider } from 'react-query';
import { ThemeProvider, createTheme } from '@mui/material';
import CssBaseline from '@mui/material/CssBaseline';
import App from './App';
const queryClient = new QueryClient();
const theme = createTheme({
palette: {
mode: 'light',
},
});
const root = ReactDOM.createRoot(
document.getElementById('root') as HTMLElement
);
root.render(
<React.StrictMode>
<QueryClientProvider client={queryClient}>
<ThemeProvider theme={theme}>
<CssBaseline />
<App />
</ThemeProvider>
</QueryClientProvider>
</React.StrictMode>
);

View File

@ -0,0 +1,44 @@
import axios from 'axios';
import { File, FileUploadResponse } from '../types/file';
const API_BASE_URL = process.env.REACT_APP_API_BASE_URL || 'http://localhost:8000/api/v1';
// Create axios instance with default config
const axiosInstance = axios.create({
baseURL: API_BASE_URL,
timeout: 30000, // 30 seconds timeout
});
export const api = {
uploadFile: async (file: globalThis.File): Promise<FileUploadResponse> => {
const formData = new FormData();
formData.append('file', file);
const response = await axiosInstance.post('/files/upload', formData, {
headers: {
'Content-Type': 'multipart/form-data',
},
});
return response.data;
},
listFiles: async (): Promise<File[]> => {
const response = await axiosInstance.get('/files/files');
return response.data;
},
getFile: async (fileId: string): Promise<File> => {
const response = await axiosInstance.get(`/files/files/${fileId}`);
return response.data;
},
downloadFile: async (fileId: string): Promise<Blob> => {
const response = await axiosInstance.get(`/files/files/${fileId}/download`, {
responseType: 'blob',
});
return response.data;
},
deleteFile: async (fileId: string): Promise<void> => {
await axiosInstance.delete(`/files/files/${fileId}`);
},
};

View File

@ -0,0 +1,23 @@
export enum FileStatus {
NOT_STARTED = "not_started",
PROCESSING = "processing",
SUCCESS = "success",
FAILED = "failed"
}
export interface File {
id: string;
filename: string;
status: FileStatus;
error_message?: string;
created_at: string;
updated_at: string;
}
export interface FileUploadResponse {
id: string;
filename: string;
status: FileStatus;
created_at: string;
updated_at: string;
}

26
frontend/tsconfig.json Normal file
View File

@ -0,0 +1,26 @@
{
"compilerOptions": {
"target": "es5",
"lib": [
"dom",
"dom.iterable",
"esnext"
],
"allowJs": true,
"skipLibCheck": true,
"esModuleInterop": true,
"allowSyntheticDefaultImports": true,
"strict": true,
"forceConsistentCasingInFileNames": true,
"noFallthroughCasesInSwitch": true,
"module": "esnext",
"moduleResolution": "node",
"resolveJsonModule": true,
"isolatedModules": true,
"noEmit": true,
"jsx": "react-jsx"
},
"include": [
"src"
]
}

232
import-images.sh Normal file
View File

@ -0,0 +1,232 @@
#!/bin/bash
# Docker Image Import Script
# Imports Docker images on target environment for migration
set -e
echo "🚀 Legal Document Masker - Docker Image Import"
echo "=============================================="
# Function to check if Docker is running
check_docker() {
if ! docker info > /dev/null 2>&1; then
echo "❌ Docker is not running. Please start Docker and try again."
exit 1
fi
echo "✅ Docker is running"
}
# Function to check for tar files
check_tar_files() {
echo "🔍 Checking for Docker image files..."
local missing_files=()
if [ ! -f "backend-api.tar" ]; then
missing_files+=("backend-api.tar")
fi
if [ ! -f "frontend.tar" ]; then
missing_files+=("frontend.tar")
fi
if [ ! -f "mineru-api.tar" ]; then
missing_files+=("mineru-api.tar")
fi
if [ ! -f "redis.tar" ]; then
missing_files+=("redis.tar")
fi
if [ ${#missing_files[@]} -ne 0 ]; then
echo "❌ Missing files: ${missing_files[*]}"
echo ""
echo "Please ensure all .tar files are in the current directory."
echo "If you have a compressed archive, extract it first:"
echo " tar -xzf legal-doc-masker-images-*.tar.gz"
exit 1
fi
echo "✅ All required files found"
}
# Function to check available disk space
check_disk_space() {
echo "💾 Checking available disk space..."
local required_space=0
for file in *.tar; do
local file_size=$(stat -f%z "$file" 2>/dev/null || stat -c%s "$file" 2>/dev/null || echo 0)
required_space=$((required_space + file_size))
done
local available_space=$(df . | awk 'NR==2 {print $4}')
available_space=$((available_space * 1024)) # Convert to bytes
if [ $required_space -gt $available_space ]; then
echo "❌ Insufficient disk space"
echo "Required: $(numfmt --to=iec $required_space)"
echo "Available: $(numfmt --to=iec $available_space)"
exit 1
fi
echo "✅ Sufficient disk space available"
}
# Function to import images
import_images() {
echo "📦 Importing Docker images..."
# Import backend image
echo " 📦 Importing backend-api image..."
docker load -i backend-api.tar
# Import frontend image
echo " 📦 Importing frontend image..."
docker load -i frontend.tar
# Import mineru image
echo " 📦 Importing mineru-api image..."
docker load -i mineru-api.tar
# Import redis image
echo " 📦 Importing redis image..."
docker load -i redis.tar
echo "✅ All images imported successfully!"
}
# Function to verify imported images
verify_images() {
echo "🔍 Verifying imported images..."
local missing_images=()
if ! docker images | grep -q "legal-doc-masker-backend-api"; then
missing_images+=("legal-doc-masker-backend-api")
fi
if ! docker images | grep -q "legal-doc-masker-frontend"; then
missing_images+=("legal-doc-masker-frontend")
fi
if ! docker images | grep -q "legal-doc-masker-mineru-api"; then
missing_images+=("legal-doc-masker-mineru-api")
fi
if ! docker images | grep -q "redis:alpine"; then
missing_images+=("redis:alpine")
fi
if [ ${#missing_images[@]} -ne 0 ]; then
echo "❌ Missing imported images: ${missing_images[*]}"
exit 1
fi
echo "✅ All images verified successfully!"
}
# Function to show imported images
show_imported_images() {
echo ""
echo "📊 Imported Images:"
echo "==================="
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep legal-doc-masker
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep redis
}
# Function to create necessary directories
create_directories() {
echo ""
echo "📁 Creating necessary directories..."
mkdir -p backend/storage
mkdir -p mineru/storage/uploads
mkdir -p mineru/storage/processed
echo "✅ Directories created"
}
# Function to check for required files
check_required_files() {
echo ""
echo "🔍 Checking for required configuration files..."
local missing_files=()
if [ ! -f "docker-compose.yml" ]; then
missing_files+=("docker-compose.yml")
fi
if [ ! -f "DOCKER_COMPOSE_README.md" ]; then
missing_files+=("DOCKER_COMPOSE_README.md")
fi
if [ ${#missing_files[@]} -ne 0 ]; then
echo "⚠️ Missing files: ${missing_files[*]}"
echo "Please copy these files from the source environment:"
echo " - docker-compose.yml"
echo " - DOCKER_COMPOSE_README.md"
echo " - backend/.env (if exists)"
echo " - frontend/.env (if exists)"
echo " - mineru/.env (if exists)"
else
echo "✅ All required configuration files found"
fi
}
# Function to show next steps
show_next_steps() {
echo ""
echo "🎉 Import completed successfully!"
echo ""
echo "📋 Next Steps:"
echo "=============="
echo ""
echo "1. Copy configuration files (if not already present):"
echo " - docker-compose.yml"
echo " - backend/.env"
echo " - frontend/.env"
echo " - mineru/.env"
echo ""
echo "2. Start the services:"
echo " docker-compose up -d"
echo ""
echo "3. Verify services are running:"
echo " docker-compose ps"
echo ""
echo "4. Test the endpoints:"
echo " - Frontend: http://localhost:3000"
echo " - Backend API: http://localhost:8000"
echo " - Mineru API: http://localhost:8001"
echo ""
echo "5. View logs if needed:"
echo " docker-compose logs -f [service-name]"
}
# Function to handle compressed archive
handle_compressed_archive() {
if ls legal-doc-masker-images-*.tar.gz 1> /dev/null 2>&1; then
echo "🗜️ Found compressed archive, extracting..."
tar -xzf legal-doc-masker-images-*.tar.gz
echo "✅ Archive extracted"
fi
}
# Main execution
main() {
check_docker
handle_compressed_archive
check_tar_files
check_disk_space
import_images
verify_images
show_imported_images
create_directories
check_required_files
show_next_steps
}
# Run main function
main "$@"

46
mineru/Dockerfile Normal file
View File

@ -0,0 +1,46 @@
FROM python:3.12-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
libreoffice \
wget \
&& rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade pip
RUN pip install uv
# Configure uv and install mineru
ENV UV_SYSTEM_PYTHON=1
RUN uv pip install --system -U "mineru[core]"
# Copy requirements first to leverage Docker cache
# COPY requirements.txt .
# RUN pip install huggingface_hub
# RUN wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
# RUN wget https://raw.githubusercontent.com/opendatalab/MinerU/refs/heads/release-1.3.1/scripts/download_models_hf.py -O download_models_hf.py
# RUN python download_models_hf.py
RUN mineru-models-download -s modelscope -m pipeline
# RUN pip install --no-cache-dir -r requirements.txt
# RUN pip install -U magic-pdf[full]
# Copy the rest of the application
# COPY . .
# Create storage directories
# RUN mkdir -p storage/uploads storage/processed
# Expose the port the app runs on
EXPOSE 8000
# Command to run the application
CMD ["mineru-api", "--host", "0.0.0.0", "--port", "8000"]

27
mineru/docker-compose.yml Normal file
View File

@ -0,0 +1,27 @@
version: '3.8'
services:
mineru-api:
build:
context: .
dockerfile: Dockerfile
platform: linux/arm64
ports:
- "8001:8000"
volumes:
- ./storage/uploads:/app/storage/uploads
- ./storage/processed:/app/storage/processed
environment:
- PYTHONUNBUFFERED=1
- MINERU_MODEL_SOURCE=local
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
volumes:
uploads:
processed:

View File

@ -1,10 +0,0 @@
# Base dependencies
pydantic-settings>=2.0.0
python-dotenv==1.0.0
watchdog==2.1.6
requests==2.26.0
# Document processing
python-docx>=0.8.11
PyPDF2>=3.0.0
pandas>=2.0.0

Binary file not shown.

View File

@ -0,0 +1,101 @@
# 北京市第三中级人民法院民事判决书
(2022)京 03 民终 3852 号
上诉人原审原告北京丰复久信营销科技有限公司住所地北京市海淀区北小马厂6 号1 号楼华天大厦1306 室。
法定代表人:郭东军,执行董事、经理。委托诉讼代理人:周大海,北京市康达律师事务所律师。委托诉讼代理人:王乃哲,北京市康达律师事务所律师。
被上诉人原审被告中研智创区块链技术有限公司住所地天津市津南区双港镇工业园区优谷产业园5 号楼-1505。
法定代表人:王欢子,总经理。
委托诉讼代理人:魏鑫,北京市昊衡律师事务所律师。
1.上诉人北京丰复久信营销科技有限公司以下简称丰复久信公司因与被上诉人中研智创区块链技术有限公司以下简称中研智创公司服务合同纠纷一案不服北京市朝阳区人民法院2020京0105 民初69754 号民事判决,向本院提起上诉。本院立案后,依法组成合议庭开庭进行了审理。上诉人丰复久信公司之委托诉讼代理人周大海、王乃哲,被上诉人中研智创公司之委托诉讼代理人魏鑫到庭参加诉讼。本案现已审理终结。
2.丰复久信公司上诉请求1.撤销一审判决发回重审或依法改判支持丰复久信公司一审全部诉讼请求2.或在维持原判的同时判令中研智创公司向丰复久信公司返还 1000 万元款项,并赔偿丰复久信公司因此支付的律师费 220 万元3.判令中研智创公司承担本案一审、二审全部诉讼费用。事实与理由一、根据2019 年的政策导向丰复久信公司的投资行为并无任何法律或政策瑕疵。丰复久信公司仅投资挖矿没有购买比特币故在当时国家、政府层面有相关政策支持甚至鼓励的前提下一审法院仅凭“挖矿”行为就得出丰复久信公司扰乱金融秩序的结论是错误的。二、一审法院没有全面、深入审查相关事实且遗漏了最核心的数据调查工作。三、本案一审判决适用法律错误。涉案合同成立及履行期间并无合同无效的情形当属有效。一审法院以挖矿活动耗能巨大、不利于我国产业结构调整为依据之一作出合同无效的判决实属牵强。最高人民法院发布的全国法院系统2020 年度优秀案例分析评选活动获奖名单中,由上海市第一中级人民法院刘江法官编写的“李圣艳、布兰登·斯密特诉闫向东、李敏等财产损害赔偿纠纷案— —比特币的法律属性及其司法救济”一案入选,该案同样发生在丰复久信公司与中研智创公司合同履行过程中,一审法院认定同时期同类型的涉案合同无效,与上述最高人民法院的优秀案例相悖。四、一审法院径行认定合同无效,未向丰复久信公司进行释明构成程序违法。
3.中研智创公司辩称,同意一审判决,不同意丰复久信公司的上诉请求。首先,一审法院曾在庭审中询问丰复久信公司关于机器返还的问题,一审法院进行了释明。其次,如二审法院对其该项上诉请求进行判决,会剥夺中研智创公司针对该部分请求再行上诉的权利。
4.丰复久信公司向一审法院起诉请求1.中研智创公司交付278.1654976 个比特币,或者按照 2021 年 1 月 25 日比特币的价格交付9550812.36 美元2.中研智创公司赔偿丰复久信公司服务期到期后占用微型存储空间服务器的损失(自2020 年7 月1日起至实际返还服务器时止按照bitinfocharts 网站公布的相关日产比特币数据计算应赔偿比特币数量或按照2021 年1 月25 日比特币的价格交付美元)。
5.一审法院查明事实2019 年5 月6 日,丰复久信公司作为甲方(买方)与乙方(卖方)中研智创公司签订《计算机设备采购合
同》约定货物名称为计算机设备型号规格及数量为T2T-30T 规格型号的微型存储空间服务器1542 台单价5040/ 台合同金额为 7 771 680 元;交货期 2019 年 8 月 31 日前;交货方式为乙方自行送货到甲方所在地,并提供安装服务,运输工具及运费由乙方负责;交货地点北京;签订购货合同,设备安装完毕后一次性支付项目总货款;乙方提供货物的质量保证期为自交货验收结束之日起不少于十二个月(具体按清单要求);乙方交货前应对产品作出全面检查和对验收文件进行整理,并列出清单,作为甲方收货验收和使用的技术条件依据,检验的结果应随货物交甲方,甲方对乙方提供的货物在使用前进行调试时,乙方协助甲方一起调试,直到符合技术要求,甲方才做最终验收,验收时乙方必须在现场,验收完毕后作出验收结果报告,并经双方签字生效。
6.同日丰复久信公司作为甲方客户方与乙方中研智创公司服务方签订《服务合同书》约定乙方同意就采购合同中的微型存储空间服务器向甲方提供特定服务服务的内容包括质保、维修、服务器设备代为运行管理、代为缴纳服务器相关用度花费如电费等详细内容见附件一如果乙方在工作中因自身过错而发生任何错误或遗漏应无条件更正不另外收费并对因此而对甲方造成的损失承担赔偿责任赔偿额以本合同约定的服务费为限若因甲方原因造成工作延误将由甲方承担相应的损失服务费总金额为2 228 320 元甲乙双方一致同意项目服务费以人民币形式于本合同签订后3 日内一次性支付甲方可以提前10 个工作日以书面形式要求变更或增加所提供的服务该等变更最终应由双方商定认可其中包括与该等变更有关的任何费用调整等。合同后附附件一以表格形式列明1.1542 台T2T-30T 微型存储空间服务器的质保、维修时限12 个月完成标准为完成甲方指定的运行量2.服务器的日常运行管理时限12 个月3.代扣代缴电费4.其他(空白)。
7.2019 年5 月双方签订《增值服务协议》约定甲方将自有的T2T-30 规格的微型存储空间服务器1542 台委托乙方管理由甲方向乙方支付一定管理费用由乙方向甲方提供相关数据增值服务对于增值服务产生的收益扣除运行成本后甲乙双方按照一定比例进行分配备注增值服务收益与微型存储空间服务的单位TH/s 相关分配收益方式不限于人民币支付甲方最多可将托管的云数据服务器的单位TH/s的 $50 \%$ 进行拆分委托乙方代为出售用户购买后的单位TH/s所产生的收益归购买用户所有结算价格按照当天实际的市场价格进行结算扣除市场销售成本后实时转入甲方提供的收益地址相关费用及支付数据增值服务的电费成本由甲方自行承担按日计算具体价格根据实际上架的数据中心的价格进行计算由后续的《数据增值服务电费计价协议作为补充》云数据服务器上架后2 天内甲方应当向乙方预付498 196 元用于预付部分云数据服务器的电费后续每日的电费支出按当天24 时云数据服务器的增值部分的价值扣除扣除完成后的增值服务收益部分当日划入甲方提供的收益地址单台云数据服务器的放置建设成本为300 元,由甲方承担;数据增值服务产生的收益,按照 $7 \%$ 的比例分配给乙方作为云数据服务器托管过程中乙方的管理和运营收益数据增值服务产生的收益当天进行结算转入甲方提供的接收地址乙方保证将按照厂家提供的环境标准包括但不限于电压、用电环境、温度、湿度、网络带宽、机房密度使用、维护本合同项下云数据服务器在正常使用过程中因不可归责于乙方的原因导致服务器损坏的乙方不承担责任乙方应协助甲方维修或更换服务器设备相关费用由甲方承担甲方云数据服务器根据机型按实测功耗计算电费各服务器机型到现场进行测量功耗后乙方告知甲方经双方认可后固定每月耗电量未经甲方同意乙方不得将托管的云数据服务器挪作他用且不得将同类设备进行调换如云数据服务器出现宕机或TH/s 为零的情况下,乙方必须 30分钟内对云数据服务器设备进行充气或其他处理以保障甲方的利益若检查发现系硬件原因无法解决乙方负责将故障设备进行打包、返场维修产生的费用由甲方承担如因突发情况如供电公司线路检修、机组维护、网络运营商意外断网等导致数据服务器中断运行乙方负责协调处理故障在乙方可控范围内甲方云数据服务器中断运行时间原则上每月不超过48 小时,如停电超出约定时间,乙方将在合同约定的管理时间基础上延长服务时间,并承担服务器的机器放置费用;乙方因自身原因导致托管的甲方服务器损害或者灭失的,应当向甲方承担赔偿责任;合同期限为 2019 年 6 月 30 日至 2020 年 6 月 30日。
8.上述合同签订后,中研智创公司购买并委托第三方矿场实际运营“矿机”。
019 年7 月15 日甲方中研智创公司与乙方成都毛球华数科技合伙企业有限合伙签订两份《矿机托管服务合同包运维约定甲方将其所拥有的“矿机”置于乙方算力服务中心乙方对甲方矿机提供运维管理服务“矿机”名称为芯动T2T数量分别为1350 台、502 台全新算力26T、30T功耗2200W托管期限以甲方矿机到达乙方算力服务中心并开始运行之日起算分别暂定自2019 年7 月15 日至2019 年10月 25 日止、自 2019 年 6 月 28 日至 2019 年 10 月 25 日止,乙方算力服务中心地址分别为四川省凉山州木里县水洛乡、沙湾乡;托管服务费计量方式均为,按照乙方上架运行机型实测功耗进行核算耗电量 $+ 3 \%$ 电损电费单价按0.239 元/度计算甲方应在本合同签订之日起两个工作日内向乙方支付半个月托管服务费作为本合同履约保证金分别为人民币251 242 元、
94000 元,履约保证金可用于抵扣协议最后一个结算周期的托管服务费,托管服务费支付周期为每半月支付。
10. 合同实际履行过程中2019 年5 月20 日丰复久信公司向中研智创公司支付1000 万元用途备注为货款。中研智创公司曾于2019 年向丰复久信公司交付了18.3463 个比特币。此后未再进行比特币交付,双方故此产生争议,并产生大量微信沟通记录。
微信聊天记录中关于核实设备及比特币产量情况。2019年11 月8 日丰复久信公司称“我们是应该自己有个矿池账号了吧这样是不是我们也可远程监控管理”“现在可以登不中研智创公司称“可以的”“之前走的是另外的体系我们看看怎样把矿池的账号直接对接给你这边吧”“现在不行因为所有机器都是统一管理放在同一个大的账号里面需要切割出来”“两天吧周一可以给你搞好”。11 月12 日,中研智创公司微信称“郭总,请你升级一下注册一下普惠矿场 App挖矿收益以后都在这里查看和提取原来的APP 不再更新了”丰复久信公司回复称“我不清楚你们要干什么现在不能你们让我干嘛我干嘛基本信任已经不存在了。所以任何一个动作我现在都需要做尽调”。11 月25 日双方微信群聊天记录显示中研智创公司员工介绍“这是我们在四川木里那边的现场管理人员”此后双方互相交换了联系电话并沟通丰复久信公司应以何种交通方式前往四川木里。11 月27 日丰复久信公司称“我到了矿场我要看下后台需要个链接”中研智创公司给出网页链接称“这是矿场的链接”。该链接地址网站名称为“币印”现点击链接显示“该观察者链接已失效请重新创建”。12 月7 日丰复久信公司称“什么时候可以提供你的原始资料抓紧核实”。12 月19 日双方沟通矿机搬动情况中研智创公司称“在昭通这边等通知进场”2020 年3 月20 日称矿机在“乐山这边”,丰复久信公司称“这几个月挖了多少币了?郭总的软件登不上去没有信息!什么时间通个电话”。中研智创公司未在微信聊天中明确答复。
12. 关于丰复久信公司向中研智创公司催要比特币情况。微信聊天记录显示2020 年4 月9 日,丰复久信公司询问中研智创公司,“我想知道一下我的机器到底挖了几个币,就这么难吗?”,中研智创公司回复“放心吧,我已经打电话给唐宇了,他会安排老潘落实,我今天没有联系到潘,我答应你的事情算数的”。
4 月 10 日、4 月 17 日、4 月 30 日、6 月 22 日、6 月 23 日、6 月27 日丰复久信公司分别再次询问称“还是这情况呀APP 还是 $0 ^ { \mathfrak { s } }$ “咱们这事您准备怎么收场啊币也不给钱也不还算力也不卖各种理由您是逼我报官了吧”等中研智创公司未回复6 月28 日回复称“稍晚一会给你打电话”。
13. 关于比特币等虚拟货币及“挖矿”活动的风险防范、整治,国家相关部门曾多次发布《通知》《公告》《风险提示》等政策文件:
1.2013 年12 月,中国人民银行等五部委发布《关于防范比特币风险的通知》指出,比特币不是由货币当局发行,不具有法偿性与强制性等货币属性,并不是真正意义的货币。从性质上看,比特币应当是一种特定的虚拟商品,不具有与货币等同的法律地位,不能且不应作为货币在市场上流通使用。各金融机构和支付机构不得以比特币为产品或服务定价,不得买卖或作为中央对手买卖比特币,不得承保与比特币相关的保险业务或将比特币纳入保险责任范围,不得直接或间接为客户提供其他与比特币相关的服务。
2.2017 年9 月,《中国人民银行、中央网信办、工业和信息化部、工商总局、银监会、证监会、保监会、关于防范代币发行融资风险的公告》,再次强调比特币不具有法偿性与强制性等货币属性,不具有与货币等同的法律地位,不能也不应作为货币在市场上流通使用,并提示,代币发行融资与交易存在多重风险,包括虚假资产风险、经营失败风险、投资炒作风险等,投资者须自行承担投资风险,希望广大投资者谨防上当受骗。
32018 年8 月《中国银行保险监督管理委员会、中央网络安全和信息化领导小组办公室、公安部、中国人民银行、国家市场监督管理总局关于防范以“虚拟货币”“区块链”名义进行非法集资的风险提示》也再次明确作出风险提示。
4.2021 年5 月18 日,中国互联网金融协会、中国银行业协会、中国支付清算协会联合发布《关于防范虚拟货币交易炒作风险的公告》,再次强调正确认识虚拟货币及相关业务活动的本质属性,有关机构不得开展与虚拟货币相关的业务,并特别指出,消费者要提高风险防范意识,“从我国现有司法实践看,虚拟货币交易合同不受法律保护,投资交易造成的后果和引发的损失由相关方自行承担”。
5.2021 年9 月3 日国家发展和改革委员会等部门发布《关于整治虚拟货币“挖矿”活动的通知》发改运行20211283号指出“虚拟货币挖矿活动指通过专用矿机计算生产虚拟货币的过程能源消耗和碳排放量大对国民经济贡献度低对产业发展、科技进步等带动作用有限加之虚拟货币生产、交易环节衍生的风险越发突出其盲目无序发展对推动经济社会高质量发展和节能减排带来不利影响。整治虚拟货币挖矿活动对促进我国产业结构优化、推动节能减排、如期实现碳达峰、碳中和目标具有重要意义。”“严禁投资建设增量项目禁止以任何名义发展虚拟货币挖矿项目加快有序退出存量项目。”
“严格执行有关法律法规和规章制度,严肃查处整治各地违规虚拟货币‘挖矿’活动”。
6.2021 年9 月15 日,中国人民银行、中央网信办、最高人民法院等部门联合发布《关于进一步防范和处置虚拟货币交易炒作风险的通知》指出,虚拟货币相关业务活动属于非法金融活动,境外虚拟货币交易所通过互联网向我国境内居民提供服务同样属于非法金融活动,并再次提示,参与虚拟货币投资交易活动存在法律风险,任何法人、非法人组织和自然人投资虚拟货币及相关衍生品,违背公序良俗的,相关民事法律行为无效,由此引发的损失由其自行承担。
14. 上述事实,有丰复久信公司提交的《计算机设备采购合同》《服务合同书》《增值服务协议》、银行转账记录、网页截图、微信聊天记录,有中研智创公司提交的《矿机托管服务合同(包运维)》、微信聊天记录等证据及当事人陈述等在案佐证。
一审法院认为,本案事实发生于民法典实施前,根据《最高人民法院关于适用<中华人民共和国民法典>时间效力的若干规定》,民法典施行前的法律事实引起的民事纠纷案件,适用当时的法律、司法解释的规定,因此本案应适用《中华人民共和国合同法》的相关规定。
15. 根据2021 年9 月3 日国家发展和改革委员会等部门《关于整治虚拟货币“挖矿”活动的通知》虚拟货币“挖矿”活动指通过专用“矿机”计算生产虚拟货币的过程。本案中丰复久信公司与中研智创公司签订《计算机设备采购合同》《服务合同书》《增值服务协议》约定丰复久信公司委托中研智创公司采购微型存储空间服务器并由中研智创公司对计算机服务器进行管理丰复久信公司向中研智创公司支付管理费用中研智创公司提供相关数据增值服务支付增值服务收益。诉讼中中研智创公司陈述其按照三份合同约定代丰复久信公司购买了“矿机”并与第三方公司即“矿场”签订委托合同将“矿机”在“矿场”运行并曾向丰复久信公司交付过18.3463 个比特币。根据上述履约过程及三份合同约定的主要内容,双方的交易模式实际上即为丰复久信公司委托中研智创公司购买并管理专用“矿机”计算生产比特币的“挖矿”行为。三份合同系有机整体,合同目的均系双方为了最终进行“挖矿”活动而签订,双方成立合同关系。该比特币“挖矿”的交易模式,属于国家相关行政机关管控范围,需要严格遵守相关法律法规和规章制度。
《中华人民共和国合同法》第七条规定,“当事人订立、履行合同,应当遵守法律、行政法规,尊重社会公德,不得扰乱社会经济秩序,损害社会公共利益”;第五十二条规定,“有下列情形之一的,合同无效:(一)一方以欺诈、胁迫的手段订立合同,损害国家利益;(二)恶意串通,损害国家、集体或者第三人利益;(三)以合法形式掩盖非法目的;(四)损害社会公共利益;(五)违反法律、行政法规的强制性规定。” 社会公共利益一般指关系到全体社会成员或者社会不特定多数人的利益,主要包括社会公共秩序以及社会善良风俗等,是明确国家和个人权利的行使边界、判断民事法律行为正当性和合法性的重要标准之一。能源安全、金融安全、经济安全等都是国家安全的重要组成部分,防范化解相关风险、深化整治相关市场乱象,均关系到我国的产业结构优化、金融秩序稳定、社会经济平稳运行和高质量发展,故社会经济秩序、金融秩序等均涉及社会公共利益。
17. 根据上述虚拟货币相关《通知》《公告》《风险提示》等文件,本案涉及的比特币为网络虚拟货币,并非国家有权机关发行的法定货币,不具有与法定货币等同的法律地位,不具有法偿性,不应且不能作为货币在市场上流通使用,相关部门多次发布《风险公告》《通知》等文件,提示消费者提高风险防范意识,投资交易虚拟货币造成的后果和引发的损失由相关方自行承担。且本案的交易模式系“挖矿”活动,随着虚拟货币交易的发展,“挖矿”行为的危害日渐凸显。
“挖矿”活动能源消耗和碳排放量大不利于我国产业结构优化、节能减排不利于我国实现碳达峰、碳中和目标。加之虚拟货币相关交易活动无真实价值支撑价格极易被操纵“挖矿”行为也进一步衍生虚假资产风险、经营失败风险、投资炒作风险等相关金融风险危害外汇管理秩序、金融秩序甚至容易引发违法犯罪活动、影响社会稳定。正因“挖矿”行为危害大、风险高其盲目无序发展对推动经济社会高质量发展和节能减排带来不利影响相关政策明确拟将虚拟货币“挖矿”活动增补列入《产业结构调整指导目录2019 年本)》“淘汰类”目录,要求采取有效措施,全面整治虚拟货币“挖矿”活动。本案中,丰复久信公司和中研智创公司在明知“挖矿”及比特币交易存在风险,且相关部门明确禁止比特币相关交易的情况下,仍然签订协议形成委托“挖矿”关系。“挖矿”活动及虚拟货币的相关交易行为存在上文论述的诸多风险和危害,干扰了正常的金融秩序、经济发展秩序,故该“挖矿”合同损害社会公共利益,应属无效。
18. 《中华人民共和国合同法》第五十八条规定,“合同无效或者被撤销后,因该合同取得的财产,应当予以返还;不能返还或者没有必要返还的,应当折价补偿。有过错的一方应当赔偿对方因此所受到的损失,双方都有过错的,应当各自承担相应的责任。”本案中,丰复久信公司第一项诉讼请求系基于合同项下权利义务要求中研智创公司支付比特币收益,因“挖矿” 合同自始无效,丰复久信公司通过履行无效合同主张获得的利益不应受到法律保护,对其相应诉讼请求,一审法院不予支持。丰复久信公司第二项诉讼请求主张占用“矿机”设备期间的比特币损失,该损失系丰复久信公司基于持续利用“矿机”从事 “挖矿”活动产生比特币的损失,不应受到法律保护,对其相应诉讼请求,一审法院亦不予支持。
19. 关于本案中“矿机”的处理,因现相关计算机设备仍由中研智创公司保管,但诉讼中,丰复久信公司明确表示其将另行主张,不在本案中要求处理“矿机”返还问题。故一审法院在本案中不再予以处理。但同时需要提醒双方当事人,均应遵守国家相关法律规定和产业政策,案涉计算机等设备不得继续用于比特币等虚拟货币“挖矿”活动,当事人应防范虚拟货币交易风险,自觉维护市场秩序和社会公共利益。
20. 综上,一审法院判决驳回北京丰复久信营销科技有限公司的全部诉讼请求。
21. 二审中,各方均未提交新的证据。本院对一审查明的事实予以确认。
22. 本院认为,比特币及相关经济活动是新型、复杂的,我国监管机构对比特币生产、交易等方面的监管措施建立在对其客观认识的基础上,并不断完善。本案双方挖矿合同从签订至履行后发生争议,纠纷延续至今,亦处于这一过程中。对合同效力的认定,应建立在当下对挖矿活动的客观认识基础上。
'、一·、( √DT4D H上23. 2013 年中国人民银行等五部委发布通知禁止金融机构对比特币进行定价不得买卖或作为中央对手买卖比特币不得直接或间接为客户提供其他与比特币相关的服务。2017 年中国人民银行等七部门联合发布《关于防范代币发行融资风险的公告》进一步提出任何所谓的代币融资交易平台不得从事法定货币与代币、“虚拟货币”相互之间的兑换业务不得买卖或作为中央对手方买卖代币或“虚拟货币”不得为代币或“虚拟货币”提供定价、信息中介等服务。上述两个文件实质上禁止了比特币在我国相关平台的兑付、交易。2021 年,中国人民银行等部门《关于进一步防范和处置虚拟货币交易炒作风险的通知》显示,虚拟货币交易炒作活动扰乱经济金融秩序,滋生赌博、非法集资、诈骗、传销、洗钱等违法犯罪活动,严重危害人民群众财产安全和国家金融安全。
24. 2021 年9 月3 日国家发展和改革委员会等部门《关于整治虚拟货币“挖矿”活动的通知》显示,虚拟货币挖矿活动能源消耗和碳排放量大,对国民经济贡献度低,对产业发展、科技进步等带动作用有限,加之虚拟货币生产、交易环节衍生的风险越发突出,其盲目无序发展对推动经济社会高质量发展和节能减排带来不利影响。故以电力资源、碳排放量为代价的“挖矿”行为,与经济社会高质量发展和碳达峰、碳中和目标相悖,与公共利益相悖。
5. 丰复久信公司主张双方合同签订时并无明确的法律规范禁止比特币“挖矿”活动,故应保障当事人的信赖利益,认定涉案合同有效一节,本院认为,当事人之间基于投资目的进行“挖矿”,并通过电子方式转让、储存以及交易的行为,实际经济追求是为了通过比特币与法定货币的兑换直接获取法定货币体系下的利益。丰复久信公司作为营利法人,在庭审中表示投资比特币仅系持有,本院难以采信。在监管机构禁止了比特币在我国相关平台的兑付、交易,且数次提示比特币投资风险的情况下,双方为获取高额利润,仍从事“挖矿”行为,现丰复久信公司以保障其信赖利益主张合同有效依据不足,本院不予采纳。
26. 综上,相关部门整治虚拟货币“挖矿”活动、认定虚拟货币相关业务活动属于非法金融活动,有利于保障我国发展利益和金融安全。从“挖矿”行为的高能耗以及比特币交易活动对国家金融秩序和社会秩序的影响来看,一审法院认定涉案合同无效是正确的。双方作为社会主义市场经济主体,既应遵守市场经济规则,亦应承担起相应的社会责任,推动经济社会高质量发展、可持续发展。
27. 关于合同无效后的返还问题,一审法院未予处理,双方可另行解决。
28. 综上所述,丰复久信公司的上诉请求不能成立,应予驳回;一审判决并无不当,应予维持。依照《中华人民共和国民事诉讼法》第一百七十七条第一款第一项规定,判决如下:
驳回上诉,维持原判。
二审案件受理费450892 元,由北京丰复久信营销科技有限公司负担(已交纳)。
29. 本判决为终审判决。
审 判 长 史晓霞审 判 员 邓青菁审 判 员 李 淼二〇二二年七月七日法 官 助 理 黎 铧书 记 员 郑海兴

Binary file not shown.

43
sample_doc/short_doc.md Normal file
View File

@ -0,0 +1,43 @@
# 北京市第三中级人民法院民事判决书
(2022)京 03 民终 3852 号
上诉人原审原告北京丰复久信营销科技有限公司住所地北京市海淀区北小马厂6 号1 号楼华天大厦1306 室。
法定代表人:郭东军,执行董事、经理。委托诉讼代理人:周大海,北京市康达律师事务所律师。委托诉讼代理人:王乃哲,北京市康达律师事务所律师。
被上诉人原审被告中研智创区块链技术有限公司住所地天津市津南区双港镇工业园区优谷产业园5 号楼-1505。
法定代表人:王欢子,总经理。
委托诉讼代理人:魏鑫,北京市昊衡律师事务所律师。
1.上诉人北京丰复久信营销科技有限公司以下简称丰复久信公司因与被上诉人中研智创区块链技术有限公司以下简称中研智创公司服务合同纠纷一案不服北京市朝阳区人民法院2020京0105 民初69754 号民事判决,向本院提起上诉。本院立案后,依法组成合议庭开庭进行了审理。上诉人丰复久信公司之委托诉讼代理人周大海、王乃哲,被上诉人中研智创公司之委托诉讼代理人魏鑫到庭参加诉讼。本案现已审理终结。
2.丰复久信公司上诉请求1.撤销一审判决发回重审或依法改判支持丰复久信公司一审全部诉讼请求2.或在维持原判的同时判令中研智创公司向丰复久信公司返还 1000 万元款项,并赔偿丰复久信公司因此支付的律师费 220 万元3.判令中研智创公司承担本案一审、二审全部诉讼费用。事实与理由一、根据2019 年的政策导向丰复久信公司的投资行为并无任何法律或政策瑕疵。丰复久信公司仅投资挖矿没有购买比特币故在当时国家、政府层面有相关政策支持甚至鼓励的前提下一审法院仅凭“挖矿”行为就得出丰复久信公司扰乱金融秩序的结论是错误的。二、一审法院没有全面、深入审查相关事实且遗漏了最核心的数据调查工作。三、本案一审判决适用法律错误。涉案合同成立及履行期间并无合同无效的情形当属有效。一审法院以挖矿活动耗能巨大、不利于我国产业结构调整为依据之一作出合同无效的判决实属牵强。最高人民法院发布的全国法院系统2020 年度优秀案例分析评选活动获奖名单中,由上海市第一中级人民法院刘江法官编写的“李圣艳、布兰登·斯密特诉闫向东、李敏等财产损害赔偿纠纷案— —比特币的法律属性及其司法救济”一案入选,该案同样发生在丰复久信公司与中研智创公司合同履行过程中,一审法院认定同时期同类型的涉案合同无效,与上述最高人民法院的优秀案例相悖。四、一审法院径行认定合同无效,未向丰复久信公司进行释明构成程序违法。
3.中研智创公司辩称,同意一审判决,不同意丰复久信公司的上诉请求。首先,一审法院曾在庭审中询问丰复久信公司关于机器返还的问题,一审法院进行了释明。其次,如二审法院对其该项上诉请求进行判决,会剥夺中研智创公司针对该部分请求再行上诉的权利。
4.丰复久信公司向一审法院起诉请求1.中研智创公司交付278.1654976 个比特币,或者按照 2021 年 1 月 25 日比特币的价格交付9550812.36 美元2.中研智创公司赔偿丰复久信公司服务期到期后占用微型存储空间服务器的损失(自2020 年7 月1日起至实际返还服务器时止按照bitinfocharts 网站公布的相关日产比特币数据计算应赔偿比特币数量或按照2021 年1 月25 日比特币的价格交付美元)。
5.一审法院查明事实2019 年5 月6 日,丰复久信公司作为甲方(买方)与乙方(卖方)中研智创公司签订《计算机设备采购合
同》约定货物名称为计算机设备型号规格及数量为T2T-30T 规格型号的微型存储空间服务器1542 台单价5040/ 台合同金额为 7 771 680 元;交货期 2019 年 8 月 31 日前;交货方式为乙方自行送货到甲方所在地,并提供安装服务,运输工具及运费由乙方负责;交货地点北京;签订购货合同,设备安装完毕后一次性支付项目总货款;乙方提供货物的质量保证期为自交货验收结束之日起不少于十二个月(具体按清单要求);乙方交货前应对产品作出全面检查和对验收文件进行整理,并列出清单,作为甲方收货验收和使用的技术条件依据,检验的结果应随货物交甲方,甲方对乙方提供的货物在使用前进行调试时,乙方协助甲方一起调试,直到符合技术要求,甲方才做最终验收,验收时乙方必须在现场,验收完毕后作出验收结果报告,并经双方签字生效。
6.同日丰复久信公司作为甲方客户方与乙方中研智创公司服务方签订《服务合同书》约定乙方同意就采购合同中的微型存储空间服务器向甲方提供特定服务服务的内容包括质保、维修、服务器设备代为运行管理、代为缴纳服务器相关用度花费如电费等详细内容见附件一如果乙方在工作中因自身过错而发生任何错误或遗漏应无条件更正不另外收费并对因此而对甲方造成的损失承担赔偿责任赔偿额以本合同约定的服务费为限若因甲方原因造成工作延误将由甲方承担相应的损失服务费总金额为2 228 320 元甲乙双方一致同意项目服务费以人民币形式于本合同签订后3 日内一次性支付甲方可以提前10 个工作日以书面形式要求变更或增加所提供的服务该等变更最终应由双方商定认可其中包括与该等变更有关的任何费用调整等。合同后附附件一以表格形式列明1.1542 台T2T-30T 微型存储空间服务器的质保、维修时限12 个月完成标准为完成甲方指定的运行量2.服务器的日常运行管理时限12 个月3.代扣代缴电费4.其他(空白)。
24. 2021 年9 月3 日国家发展和改革委员会等部门《关于整治虚拟货币“挖矿”活动的通知》显示,虚拟货币挖矿活动能源消耗和碳排放量大,对国民经济贡献度低,对产业发展、科技进步等带动作用有限,加之虚拟货币生产、交易环节衍生的风险越发突出,其盲目无序发展对推动经济社会高质量发展和节能减排带来不利影响。故以电力资源、碳排放量为代价的“挖矿”行为,与经济社会高质量发展和碳达峰、碳中和目标相悖,与公共利益相悖。
26. 综上,相关部门整治虚拟货币“挖矿”活动、认定虚拟货币相关业务活动属于非法金融活动,有利于保障我国发展利益和金融安全。从“挖矿”行为的高能耗以及比特币交易活动对国家金融秩序和社会秩序的影响来看,一审法院认定涉案合同无效是正确的。双方作为社会主义市场经济主体,既应遵守市场经济规则,亦应承担起相应的社会责任,推动经济社会高质量发展、可持续发展。
27. 关于合同无效后的返还问题,一审法院未予处理,双方可另行解决。
28. 综上所述,丰复久信公司的上诉请求不能成立,应予驳回;一审判决并无不当,应予维持。依照《中华人民共和国民事诉讼法》第一百七十七条第一款第一项规定,判决如下:
驳回上诉,维持原判。
二审案件受理费450892 元,由北京丰复久信营销科技有限公司负担(已交纳)。
29. 本判决为终审判决。
审 判 长 史晓霞审 判 员 邓青菁审 判 员 李 淼二〇二二年七月七日法 官 助 理 黎 铧书 记 员 郑海兴

110
setup-unified-docker.sh Normal file
View File

@ -0,0 +1,110 @@
#!/bin/bash
# Unified Docker Compose Setup Script
# This script helps set up the unified Docker Compose environment
set -e
echo "🚀 Setting up Unified Docker Compose Environment"
# Function to check if Docker is running
check_docker() {
if ! docker info > /dev/null 2>&1; then
echo "❌ Docker is not running. Please start Docker and try again."
exit 1
fi
echo "✅ Docker is running"
}
# Function to stop existing individual services
stop_individual_services() {
echo "🛑 Stopping individual Docker Compose services..."
if [ -f "backend/docker-compose.yml" ]; then
echo "Stopping backend services..."
cd backend && docker-compose down 2>/dev/null || true && cd ..
fi
if [ -f "frontend/docker-compose.yml" ]; then
echo "Stopping frontend services..."
cd frontend && docker-compose down 2>/dev/null || true && cd ..
fi
if [ -f "mineru/docker-compose.yml" ]; then
echo "Stopping mineru services..."
cd mineru && docker-compose down 2>/dev/null || true && cd ..
fi
echo "✅ Individual services stopped"
}
# Function to create necessary directories
create_directories() {
echo "📁 Creating necessary directories..."
mkdir -p backend/storage
mkdir -p mineru/storage/uploads
mkdir -p mineru/storage/processed
echo "✅ Directories created"
}
# Function to check if unified docker-compose.yml exists
check_unified_compose() {
if [ ! -f "docker-compose.yml" ]; then
echo "❌ Unified docker-compose.yml not found in current directory"
echo "Please run this script from the project root directory"
exit 1
fi
echo "✅ Unified docker-compose.yml found"
}
# Function to build and start services
start_unified_services() {
echo "🔨 Building and starting unified services..."
# Build all services
docker-compose build
# Start services
docker-compose up -d
echo "✅ Unified services started"
}
# Function to check service status
check_service_status() {
echo "📊 Checking service status..."
docker-compose ps
echo ""
echo "🌐 Service URLs:"
echo "Frontend: http://localhost:3000"
echo "Backend API: http://localhost:8000"
echo "Mineru API: http://localhost:8001"
echo ""
echo "📝 To view logs: docker-compose logs -f [service-name]"
echo "📝 To stop services: docker-compose down"
}
# Main execution
main() {
echo "=========================================="
echo "Unified Docker Compose Setup"
echo "=========================================="
check_docker
check_unified_compose
stop_individual_services
create_directories
start_unified_services
check_service_status
echo ""
echo "🎉 Setup complete! Your unified Docker environment is ready."
echo "Check the DOCKER_COMPOSE_README.md for more information."
}
# Run main function
main "$@"

View File

@ -1,31 +0,0 @@
# settings.py
from pydantic_settings import BaseSettings
from typing import Optional
class Settings(BaseSettings):
# Storage paths
OBJECT_STORAGE_PATH: str = "/Users/tigeren/Dev/digisky/legal-doc-masker/src_folder"
TARGET_DIRECTORY_PATH: str = "/Users/tigeren/Dev/digisky/legal-doc-masker/target_folder"
# Ollama API settings
OLLAMA_API_URL: str = "https://api.ollama.com"
OLLAMA_API_KEY: str = ""
OLLAMA_MODEL: str = "llama2"
# File monitoring settings
MONITOR_INTERVAL: int = 5
# Logging settings
LOG_LEVEL: str = "INFO"
LOG_FORMAT: str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
LOG_DATE_FORMAT: str = "%Y-%m-%d %H:%M:%S"
LOG_FILE: str = "app.log"
class Config:
env_file = ".env"
env_file_encoding = "utf-8"
extra = "allow"
# Create settings instance
settings = Settings()

View File

@ -1,17 +0,0 @@
from config.logging_config import setup_logging
def main():
# Setup logging first
setup_logging()
from services.file_monitor import FileMonitor
from config.settings import settings
# Initialize the file monitor
file_monitor = FileMonitor(settings.OBJECT_STORAGE_PATH, settings.TARGET_DIRECTORY_PATH)
# Start monitoring the directory for new files
file_monitor.start_monitoring()
if __name__ == "__main__":
main()

View File

@ -1,18 +0,0 @@
from abc import ABC, abstractmethod
from typing import Any
class DocumentProcessor(ABC):
@abstractmethod
def read_content(self) -> str:
"""Read document content"""
pass
@abstractmethod
def process_content(self, content: str) -> str:
"""Process document content"""
pass
@abstractmethod
def save_content(self, content: str) -> None:
"""Save processed content"""
pass

View File

@ -1,5 +0,0 @@
from models.processors.txt_processor import TxtDocumentProcessor
from models.processors.docx_processor import DocxDocumentProcessor
from models.processors.pdf_processor import PdfDocumentProcessor
__all__ = ['TxtDocumentProcessor', 'DocxDocumentProcessor', 'PdfDocumentProcessor']

View File

@ -1,20 +0,0 @@
import docx
from models.document_processor import DocumentProcessor
class DocxDocumentProcessor(DocumentProcessor):
def __init__(self, input_path: str, output_path: str):
self.input_path = input_path
self.output_path = output_path
def read_content(self) -> str:
doc = docx.Document(self.input_path)
return '\n'.join([paragraph.text for paragraph in doc.paragraphs])
def process_content(self, content: str) -> str:
# Implementation for processing docx content
return content
def save_content(self, content: str) -> None:
doc = docx.Document()
doc.add_paragraph(content)
doc.save(self.output_path)

View File

@ -1,20 +0,0 @@
import PyPDF2
from models.document_processor import DocumentProcessor
class PdfDocumentProcessor(DocumentProcessor):
def __init__(self, input_path: str, output_path: str):
self.input_path = input_path
self.output_path = output_path
def read_content(self) -> str:
with open(self.input_path, 'rb') as file:
pdf_reader = PyPDF2.PdfReader(file)
return ' '.join([page.extract_text() for page in pdf_reader.pages])
def process_content(self, content: str) -> str:
# Implementation for processing PDF content
return content
def save_content(self, content: str) -> None:
# Implementation for saving as PDF
pass

View File

@ -1,18 +0,0 @@
from models.document_processor import DocumentProcessor
class TxtDocumentProcessor(DocumentProcessor):
def __init__(self, input_path: str, output_path: str):
self.input_path = input_path
self.output_path = output_path
def read_content(self) -> str:
with open(self.input_path, 'r', encoding='utf-8') as file:
return file.read()
def process_content(self, content: str) -> str:
# Implementation for processing text content
return content
def save_content(self, content: str) -> None:
with open(self.output_path, 'w', encoding='utf-8') as file:
file.write(content)

View File

@ -1,24 +0,0 @@
import logging
logger = logging.getLogger(__name__)
class FileMonitor:
def __init__(self, directory, callback):
self.directory = directory
self.callback = callback
def start_monitoring(self):
import time
import os
already_seen = set(os.listdir(self.directory))
while True:
time.sleep(1) # Check every second
current_files = set(os.listdir(self.directory))
new_files = current_files - already_seen
for new_file in new_files:
logger.info(f"monitor: new file found: {new_file}")
self.callback(os.path.join(self.directory, new_file))
already_seen = current_files

View File

@ -1,15 +0,0 @@
class OllamaClient:
def __init__(self, model_name):
self.model_name = model_name
def process_document(self, document_text):
# Here you would implement the logic to interact with the Ollama API
# and process the document text using the specified model.
# This is a placeholder for the actual API call.
processed_text = self._mock_api_call(document_text)
return processed_text
def _mock_api_call(self, document_text):
# Mock processing: In a real implementation, this would call the Ollama API.
# For now, it just returns the input text with a note indicating it was processed.
return f"Processed with {self.model_name}: {document_text}"

View File

@ -1,58 +0,0 @@
# README.md
# Document Processing App
This project is designed to process legal documents by hiding sensitive information such as names and company names. It utilizes the Ollama API with selected models for text processing. The application monitors a specified directory for new files, processes them automatically, and saves the results to a target path.
## Project Structure
```
doc-processing-app
├── src
│ ├── main.py # Entry point of the application
│ ├── config
│ │ └── settings.py # Configuration settings for paths
│ ├── services
│ │ ├── file_monitor.py # Monitors directory for new files
│ │ ├── document_processor.py # Handles document processing logic
│ │ └── ollama_client.py # Interacts with the Ollama API
│ ├── utils
│ │ └── file_utils.py # Utility functions for file operations
│ └── models
│ └── document.py # Represents the structure of a document
├── tests
│ └── test_document_processor.py # Unit tests for DocumentProcessor
├── requirements.txt # Project dependencies
├── .env.example # Example environment variables
└── README.md # Project documentation
```
## Setup Instructions
1. Clone the repository:
```
git clone <repository-url>
cd doc-processing-app
```
2. Install the required dependencies:
```
pip install -r requirements.txt
```
3. Configure the application by editing the `src/config/settings.py` file to set the paths for the object storage and target directory.
4. Create a `.env` file based on the `.env.example` file to set up necessary environment variables.
## Usage
To run the application, execute the following command:
```
python src/main.py
```
The application will start monitoring the specified directory for new documents. Once a new document is added, it will be processed automatically.
## Contributing
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.