Merge pull request 'feature-ner-keyword-detect' (#1) from feature-ner-keyword-detect into main
Reviewed-on: #1
This commit is contained in:
commit
56c718d658
19
.env.example
19
.env.example
|
|
@ -1,19 +0,0 @@
|
|||
# Storage paths
|
||||
OBJECT_STORAGE_PATH=/path/to/mounted/object/storage
|
||||
TARGET_DIRECTORY_PATH=/path/to/target/directory
|
||||
|
||||
# Ollama API Configuration
|
||||
OLLAMA_API_URL=https://api.ollama.com
|
||||
OLLAMA_API_KEY=your_api_key_here
|
||||
OLLAMA_MODEL=llama2
|
||||
|
||||
# Application Settings
|
||||
MONITOR_INTERVAL=5
|
||||
|
||||
# Logging Configuration
|
||||
LOG_LEVEL=INFO
|
||||
LOG_FILE=app.log
|
||||
|
||||
# Optional: Additional security settings
|
||||
# MAX_FILE_SIZE=10485760 # 10MB in bytes
|
||||
# ALLOWED_FILE_TYPES=.txt,.doc,.docx,.pdf
|
||||
|
|
@ -70,4 +70,7 @@ app.log
|
|||
__pycache__
|
||||
data/doc_dest
|
||||
data/doc_src
|
||||
data/doc_intermediate
|
||||
data/doc_intermediate
|
||||
|
||||
node_modules
|
||||
backend/storage/
|
||||
|
|
@ -0,0 +1,206 @@
|
|||
# Unified Docker Compose Setup
|
||||
|
||||
This project now includes a unified Docker Compose configuration that allows all services (mineru, backend, frontend) to run together and communicate using service names.
|
||||
|
||||
## Architecture
|
||||
|
||||
The unified setup includes the following services:
|
||||
|
||||
- **mineru-api**: Document processing service (port 8001)
|
||||
- **backend-api**: Main API service (port 8000)
|
||||
- **celery-worker**: Background task processor
|
||||
- **redis**: Message broker for Celery
|
||||
- **frontend**: React frontend application (port 3000)
|
||||
|
||||
## Network Configuration
|
||||
|
||||
All services are connected through a custom bridge network called `app-network`, allowing them to communicate using service names:
|
||||
|
||||
- Backend → Mineru: `http://mineru-api:8000`
|
||||
- Frontend → Backend: `http://localhost:8000/api/v1` (external access)
|
||||
- Backend → Redis: `redis://redis:6379/0`
|
||||
|
||||
## Usage
|
||||
|
||||
### Starting all services
|
||||
|
||||
```bash
|
||||
# From the root directory
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
### Starting specific services
|
||||
|
||||
```bash
|
||||
# Start only backend and mineru
|
||||
docker-compose up -d backend-api mineru-api redis
|
||||
|
||||
# Start only frontend and backend
|
||||
docker-compose up -d frontend backend-api redis
|
||||
```
|
||||
|
||||
### Stopping services
|
||||
|
||||
```bash
|
||||
# Stop all services
|
||||
docker-compose down
|
||||
|
||||
# Stop and remove volumes
|
||||
docker-compose down -v
|
||||
```
|
||||
|
||||
### Viewing logs
|
||||
|
||||
```bash
|
||||
# View all logs
|
||||
docker-compose logs -f
|
||||
|
||||
# View specific service logs
|
||||
docker-compose logs -f backend-api
|
||||
docker-compose logs -f mineru-api
|
||||
docker-compose logs -f frontend
|
||||
```
|
||||
|
||||
## Building Services
|
||||
|
||||
### Building all services
|
||||
|
||||
```bash
|
||||
# Build all services
|
||||
docker-compose build
|
||||
|
||||
# Build and start all services
|
||||
docker-compose up -d --build
|
||||
```
|
||||
|
||||
### Building individual services
|
||||
|
||||
```bash
|
||||
# Build only backend
|
||||
docker-compose build backend-api
|
||||
|
||||
# Build only frontend
|
||||
docker-compose build frontend
|
||||
|
||||
# Build only mineru
|
||||
docker-compose build mineru-api
|
||||
|
||||
# Build multiple specific services
|
||||
docker-compose build backend-api frontend
|
||||
```
|
||||
|
||||
### Building and restarting specific services
|
||||
|
||||
```bash
|
||||
# Build and restart only backend
|
||||
docker-compose build backend-api
|
||||
docker-compose up -d backend-api
|
||||
|
||||
# Or combine in one command
|
||||
docker-compose up -d --build backend-api
|
||||
|
||||
# Build and restart backend and celery worker
|
||||
docker-compose up -d --build backend-api celery-worker
|
||||
```
|
||||
|
||||
### Force rebuild (no cache)
|
||||
|
||||
```bash
|
||||
# Force rebuild all services
|
||||
docker-compose build --no-cache
|
||||
|
||||
# Force rebuild specific service
|
||||
docker-compose build --no-cache backend-api
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
The unified setup uses environment variables from the individual service `.env` files:
|
||||
|
||||
- `./backend/.env` - Backend configuration
|
||||
- `./frontend/.env` - Frontend configuration
|
||||
- `./mineru/.env` - Mineru configuration (if exists)
|
||||
|
||||
### Key Configuration Changes
|
||||
|
||||
1. **Backend Configuration** (`backend/app/core/config.py`):
|
||||
```python
|
||||
MINERU_API_URL: str = "http://mineru-api:8000"
|
||||
```
|
||||
|
||||
2. **Frontend Configuration**:
|
||||
```javascript
|
||||
REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
|
||||
```
|
||||
|
||||
## Service Dependencies
|
||||
|
||||
- `backend-api` depends on `redis` and `mineru-api`
|
||||
- `celery-worker` depends on `redis` and `backend-api`
|
||||
- `frontend` depends on `backend-api`
|
||||
|
||||
## Port Mapping
|
||||
|
||||
- **Frontend**: `http://localhost:3000`
|
||||
- **Backend API**: `http://localhost:8000`
|
||||
- **Mineru API**: `http://localhost:8001`
|
||||
- **Redis**: `localhost:6379`
|
||||
|
||||
## Health Checks
|
||||
|
||||
The mineru-api service includes a health check that verifies the service is running properly.
|
||||
|
||||
## Development vs Production
|
||||
|
||||
For development, you can still use the individual docker-compose files in each service directory. The unified setup is ideal for:
|
||||
|
||||
- Production deployments
|
||||
- End-to-end testing
|
||||
- Simplified development environment
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Service Communication Issues
|
||||
|
||||
If services can't communicate:
|
||||
|
||||
1. Check if all services are running: `docker-compose ps`
|
||||
2. Verify network connectivity: `docker network ls`
|
||||
3. Check service logs: `docker-compose logs [service-name]`
|
||||
|
||||
### Port Conflicts
|
||||
|
||||
If you get port conflicts, you can modify the port mappings in the `docker-compose.yml` file:
|
||||
|
||||
```yaml
|
||||
ports:
|
||||
- "8002:8000" # Change external port
|
||||
```
|
||||
|
||||
### Volume Issues
|
||||
|
||||
Make sure the storage directories exist:
|
||||
|
||||
```bash
|
||||
mkdir -p backend/storage
|
||||
mkdir -p mineru/storage/uploads
|
||||
mkdir -p mineru/storage/processed
|
||||
```
|
||||
|
||||
## Migration from Individual Compose Files
|
||||
|
||||
If you were previously using individual docker-compose files:
|
||||
|
||||
1. Stop all individual services:
|
||||
```bash
|
||||
cd backend && docker-compose down
|
||||
cd ../frontend && docker-compose down
|
||||
cd ../mineru && docker-compose down
|
||||
```
|
||||
|
||||
2. Start the unified setup:
|
||||
```bash
|
||||
cd .. && docker-compose up -d
|
||||
```
|
||||
|
||||
The unified setup maintains the same functionality while providing better service discovery and networking.
|
||||
|
|
@ -0,0 +1,399 @@
|
|||
# Docker Image Migration Guide
|
||||
|
||||
This guide explains how to export your built Docker images, transfer them to another environment, and run them without rebuilding.
|
||||
|
||||
## Overview
|
||||
|
||||
The migration process involves:
|
||||
1. **Export**: Save built images to tar files
|
||||
2. **Transfer**: Copy tar files to target environment
|
||||
3. **Import**: Load images on target environment
|
||||
4. **Run**: Start services with imported images
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Source Environment (where images are built)
|
||||
- Docker installed and running
|
||||
- All services built and working
|
||||
- Sufficient disk space for image export
|
||||
|
||||
### Target Environment (where images will run)
|
||||
- Docker installed and running
|
||||
- Sufficient disk space for image import
|
||||
- Network access to source environment (or USB drive)
|
||||
|
||||
## Step 1: Export Docker Images
|
||||
|
||||
### 1.1 List Current Images
|
||||
|
||||
First, check what images you have:
|
||||
|
||||
```bash
|
||||
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.ID}}\t{{.Size}}"
|
||||
```
|
||||
|
||||
You should see images like:
|
||||
- `legal-doc-masker-backend-api`
|
||||
- `legal-doc-masker-frontend`
|
||||
- `legal-doc-masker-mineru-api`
|
||||
- `redis:alpine`
|
||||
|
||||
### 1.2 Export Individual Images
|
||||
|
||||
Create a directory for exports:
|
||||
|
||||
```bash
|
||||
mkdir -p docker-images-export
|
||||
cd docker-images-export
|
||||
```
|
||||
|
||||
Export each image:
|
||||
|
||||
```bash
|
||||
# Export backend image
|
||||
docker save legal-doc-masker-backend-api:latest -o backend-api.tar
|
||||
|
||||
# Export frontend image
|
||||
docker save legal-doc-masker-frontend:latest -o frontend.tar
|
||||
|
||||
# Export mineru image
|
||||
docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
|
||||
|
||||
# Export redis image (if not using official)
|
||||
docker save redis:alpine -o redis.tar
|
||||
```
|
||||
|
||||
### 1.3 Export All Images at Once (Alternative)
|
||||
|
||||
If you want to export all images in one command:
|
||||
|
||||
```bash
|
||||
# Export all project images
|
||||
docker save \
|
||||
legal-doc-masker-backend-api:latest \
|
||||
legal-doc-masker-frontend:latest \
|
||||
legal-doc-masker-mineru-api:latest \
|
||||
redis:alpine \
|
||||
-o legal-doc-masker-all.tar
|
||||
```
|
||||
|
||||
### 1.4 Verify Export Files
|
||||
|
||||
Check the exported files:
|
||||
|
||||
```bash
|
||||
ls -lh *.tar
|
||||
```
|
||||
|
||||
You should see files like:
|
||||
- `backend-api.tar` (~200-500MB)
|
||||
- `frontend.tar` (~100-300MB)
|
||||
- `mineru-api.tar` (~1-3GB)
|
||||
- `redis.tar` (~30-50MB)
|
||||
|
||||
## Step 2: Transfer Images
|
||||
|
||||
### 2.1 Transfer via Network (SCP/RSYNC)
|
||||
|
||||
```bash
|
||||
# Transfer to remote server
|
||||
scp *.tar user@remote-server:/path/to/destination/
|
||||
|
||||
# Or using rsync (more efficient for large files)
|
||||
rsync -avz --progress *.tar user@remote-server:/path/to/destination/
|
||||
```
|
||||
|
||||
### 2.2 Transfer via USB Drive
|
||||
|
||||
```bash
|
||||
# Copy to USB drive
|
||||
cp *.tar /Volumes/USB_DRIVE/docker-images/
|
||||
|
||||
# Or create a compressed archive
|
||||
tar -czf legal-doc-masker-images.tar.gz *.tar
|
||||
cp legal-doc-masker-images.tar.gz /Volumes/USB_DRIVE/
|
||||
```
|
||||
|
||||
### 2.3 Transfer via Cloud Storage
|
||||
|
||||
```bash
|
||||
# Upload to cloud storage (example with AWS S3)
|
||||
aws s3 cp *.tar s3://your-bucket/docker-images/
|
||||
|
||||
# Or using Google Cloud Storage
|
||||
gsutil cp *.tar gs://your-bucket/docker-images/
|
||||
```
|
||||
|
||||
## Step 3: Import Images on Target Environment
|
||||
|
||||
### 3.1 Prepare Target Environment
|
||||
|
||||
```bash
|
||||
# Create directory for images
|
||||
mkdir -p docker-images-import
|
||||
cd docker-images-import
|
||||
|
||||
# Copy images from transfer method
|
||||
# (SCP, USB, or download from cloud storage)
|
||||
```
|
||||
|
||||
### 3.2 Import Individual Images
|
||||
|
||||
```bash
|
||||
# Import backend image
|
||||
docker load -i backend-api.tar
|
||||
|
||||
# Import frontend image
|
||||
docker load -i frontend.tar
|
||||
|
||||
# Import mineru image
|
||||
docker load -i mineru-api.tar
|
||||
|
||||
# Import redis image
|
||||
docker load -i redis.tar
|
||||
```
|
||||
|
||||
### 3.3 Import All Images at Once (if exported together)
|
||||
|
||||
```bash
|
||||
docker load -i legal-doc-masker-all.tar
|
||||
```
|
||||
|
||||
### 3.4 Verify Imported Images
|
||||
|
||||
```bash
|
||||
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.ID}}\t{{.Size}}"
|
||||
```
|
||||
|
||||
## Step 4: Prepare Target Environment
|
||||
|
||||
### 4.1 Copy Project Files
|
||||
|
||||
Transfer the following files to target environment:
|
||||
|
||||
```bash
|
||||
# Essential files to copy
|
||||
docker-compose.yml
|
||||
DOCKER_COMPOSE_README.md
|
||||
setup-unified-docker.sh
|
||||
|
||||
# Environment files (if they exist)
|
||||
backend/.env
|
||||
frontend/.env
|
||||
mineru/.env
|
||||
|
||||
# Storage directories (if you want to preserve data)
|
||||
backend/storage/
|
||||
mineru/storage/
|
||||
backend/legal_doc_masker.db
|
||||
```
|
||||
|
||||
### 4.2 Create Directory Structure
|
||||
|
||||
```bash
|
||||
# Create necessary directories
|
||||
mkdir -p backend/storage
|
||||
mkdir -p mineru/storage/uploads
|
||||
mkdir -p mineru/storage/processed
|
||||
```
|
||||
|
||||
## Step 5: Run Services
|
||||
|
||||
### 5.1 Start All Services
|
||||
|
||||
```bash
|
||||
# Start all services using imported images
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
### 5.2 Verify Services
|
||||
|
||||
```bash
|
||||
# Check service status
|
||||
docker-compose ps
|
||||
|
||||
# Check service logs
|
||||
docker-compose logs -f
|
||||
```
|
||||
|
||||
### 5.3 Test Endpoints
|
||||
|
||||
```bash
|
||||
# Test frontend
|
||||
curl -I http://localhost:3000
|
||||
|
||||
# Test backend API
|
||||
curl -I http://localhost:8000/api/v1
|
||||
|
||||
# Test mineru API
|
||||
curl -I http://localhost:8001/health
|
||||
```
|
||||
|
||||
## Automation Scripts
|
||||
|
||||
### Export Script
|
||||
|
||||
Create `export-images.sh`:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
echo "🚀 Exporting Docker Images"
|
||||
|
||||
# Create export directory
|
||||
mkdir -p docker-images-export
|
||||
cd docker-images-export
|
||||
|
||||
# Export images
|
||||
echo "📦 Exporting backend-api image..."
|
||||
docker save legal-doc-masker-backend-api:latest -o backend-api.tar
|
||||
|
||||
echo "📦 Exporting frontend image..."
|
||||
docker save legal-doc-masker-frontend:latest -o frontend.tar
|
||||
|
||||
echo "📦 Exporting mineru-api image..."
|
||||
docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
|
||||
|
||||
echo "📦 Exporting redis image..."
|
||||
docker save redis:alpine -o redis.tar
|
||||
|
||||
# Show file sizes
|
||||
echo "📊 Export complete. File sizes:"
|
||||
ls -lh *.tar
|
||||
|
||||
echo "✅ Images exported successfully!"
|
||||
```
|
||||
|
||||
### Import Script
|
||||
|
||||
Create `import-images.sh`:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
echo "🚀 Importing Docker Images"
|
||||
|
||||
# Check if tar files exist
|
||||
if [ ! -f "backend-api.tar" ]; then
|
||||
echo "❌ backend-api.tar not found"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Import images
|
||||
echo "📦 Importing backend-api image..."
|
||||
docker load -i backend-api.tar
|
||||
|
||||
echo "📦 Importing frontend image..."
|
||||
docker load -i frontend.tar
|
||||
|
||||
echo "📦 Importing mineru-api image..."
|
||||
docker load -i mineru-api.tar
|
||||
|
||||
echo "📦 Importing redis image..."
|
||||
docker load -i redis.tar
|
||||
|
||||
# Verify imports
|
||||
echo "📊 Imported images:"
|
||||
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep legal-doc-masker
|
||||
|
||||
echo "✅ Images imported successfully!"
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Image not found during import**
|
||||
```bash
|
||||
# Check if image exists
|
||||
docker images | grep image-name
|
||||
|
||||
# Re-export if needed
|
||||
docker save image-name:tag -o image-name.tar
|
||||
```
|
||||
|
||||
2. **Port conflicts on target environment**
|
||||
```bash
|
||||
# Check what's using the ports
|
||||
lsof -i :8000
|
||||
lsof -i :8001
|
||||
lsof -i :3000
|
||||
|
||||
# Modify docker-compose.yml if needed
|
||||
ports:
|
||||
- "8002:8000" # Change external port
|
||||
```
|
||||
|
||||
3. **Permission issues**
|
||||
```bash
|
||||
# Fix file permissions
|
||||
chmod +x setup-unified-docker.sh
|
||||
chmod +x export-images.sh
|
||||
chmod +x import-images.sh
|
||||
```
|
||||
|
||||
4. **Storage directory issues**
|
||||
```bash
|
||||
# Create directories with proper permissions
|
||||
sudo mkdir -p backend/storage
|
||||
sudo mkdir -p mineru/storage/uploads
|
||||
sudo mkdir -p mineru/storage/processed
|
||||
sudo chown -R $USER:$USER backend/storage mineru/storage
|
||||
```
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
1. **Compress images for transfer**
|
||||
```bash
|
||||
# Compress before transfer
|
||||
gzip *.tar
|
||||
|
||||
# Decompress on target
|
||||
gunzip *.tar.gz
|
||||
```
|
||||
|
||||
2. **Use parallel transfer**
|
||||
```bash
|
||||
# Transfer multiple files in parallel
|
||||
parallel scp {} user@server:/path/ ::: *.tar
|
||||
```
|
||||
|
||||
3. **Use Docker registry (alternative)**
|
||||
```bash
|
||||
# Push to registry
|
||||
docker tag legal-doc-masker-backend-api:latest your-registry/backend-api:latest
|
||||
docker push your-registry/backend-api:latest
|
||||
|
||||
# Pull on target
|
||||
docker pull your-registry/backend-api:latest
|
||||
```
|
||||
|
||||
## Complete Migration Checklist
|
||||
|
||||
- [ ] Export all Docker images
|
||||
- [ ] Transfer image files to target environment
|
||||
- [ ] Transfer project configuration files
|
||||
- [ ] Import images on target environment
|
||||
- [ ] Create necessary directories
|
||||
- [ ] Start services
|
||||
- [ ] Verify all services are running
|
||||
- [ ] Test all endpoints
|
||||
- [ ] Update any environment-specific configurations
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **Secure transfer**: Use encrypted transfer methods (SCP, SFTP)
|
||||
2. **Image verification**: Verify image integrity after transfer
|
||||
3. **Environment isolation**: Ensure target environment is properly secured
|
||||
4. **Access control**: Limit access to Docker daemon on target environment
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
1. **Image size**: Remove unnecessary layers before export
|
||||
2. **Compression**: Use compression for large images
|
||||
3. **Selective transfer**: Only transfer images you need
|
||||
4. **Cleanup**: Remove old images after successful migration
|
||||
48
Dockerfile
48
Dockerfile
|
|
@ -1,48 +0,0 @@
|
|||
# Build stage
|
||||
FROM python:3.12-slim AS builder
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install build dependencies
|
||||
RUN apt-get update && \
|
||||
apt-get install -y --no-install-recommends \
|
||||
build-essential \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Copy requirements first to leverage Docker cache
|
||||
COPY requirements.txt .
|
||||
RUN pip wheel --no-cache-dir --no-deps --wheel-dir /app/wheels -r requirements.txt
|
||||
|
||||
# Final stage
|
||||
FROM python:3.12-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Create non-root user
|
||||
RUN useradd -m -r appuser && \
|
||||
chown appuser:appuser /app
|
||||
|
||||
# Copy wheels from builder
|
||||
COPY --from=builder /app/wheels /wheels
|
||||
COPY --from=builder /app/requirements.txt .
|
||||
|
||||
# Install dependencies
|
||||
RUN pip install --no-cache /wheels/*
|
||||
|
||||
# Copy application code
|
||||
COPY src/ ./src/
|
||||
|
||||
# Create directories for mounted volumes
|
||||
RUN mkdir -p /data/input /data/output && \
|
||||
chown -R appuser:appuser /data
|
||||
|
||||
# Switch to non-root user
|
||||
USER appuser
|
||||
|
||||
# Environment variables
|
||||
ENV PYTHONPATH=/app \
|
||||
OBJECT_STORAGE_PATH=/data/input \
|
||||
TARGET_DIRECTORY_PATH=/data/output
|
||||
|
||||
# Run the application
|
||||
CMD ["python", "src/main.py"]
|
||||
|
|
@ -0,0 +1,178 @@
|
|||
# Docker Migration Quick Reference
|
||||
|
||||
## 🚀 Quick Migration Process
|
||||
|
||||
### Source Environment (Export)
|
||||
|
||||
```bash
|
||||
# 1. Build images first (if not already built)
|
||||
docker-compose build
|
||||
|
||||
# 2. Export all images
|
||||
./export-images.sh
|
||||
|
||||
# 3. Transfer files to target environment
|
||||
# Option A: SCP
|
||||
scp -r docker-images-export-*/ user@target-server:/path/to/destination/
|
||||
|
||||
# Option B: USB Drive
|
||||
cp -r docker-images-export-*/ /Volumes/USB_DRIVE/
|
||||
|
||||
# Option C: Compressed archive
|
||||
scp legal-doc-masker-images-*.tar.gz user@target-server:/path/to/destination/
|
||||
```
|
||||
|
||||
### Target Environment (Import)
|
||||
|
||||
```bash
|
||||
# 1. Copy project files
|
||||
scp docker-compose.yml user@target-server:/path/to/destination/
|
||||
scp DOCKER_COMPOSE_README.md user@target-server:/path/to/destination/
|
||||
|
||||
# 2. Import images
|
||||
./import-images.sh
|
||||
|
||||
# 3. Start services
|
||||
docker-compose up -d
|
||||
|
||||
# 4. Verify
|
||||
docker-compose ps
|
||||
```
|
||||
|
||||
## 📋 Essential Files to Transfer
|
||||
|
||||
### Required Files
|
||||
- `docker-compose.yml` - Unified compose configuration
|
||||
- `DOCKER_COMPOSE_README.md` - Documentation
|
||||
- `backend/.env` - Backend environment variables
|
||||
- `frontend/.env` - Frontend environment variables
|
||||
- `mineru/.env` - Mineru environment variables (if exists)
|
||||
|
||||
### Optional Files (for data preservation)
|
||||
- `backend/storage/` - Backend storage directory
|
||||
- `mineru/storage/` - Mineru storage directory
|
||||
- `backend/legal_doc_masker.db` - Database file
|
||||
|
||||
## 🔧 Common Commands
|
||||
|
||||
### Export Commands
|
||||
```bash
|
||||
# Manual export
|
||||
docker save legal-doc-masker-backend-api:latest -o backend-api.tar
|
||||
docker save legal-doc-masker-frontend:latest -o frontend.tar
|
||||
docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
|
||||
docker save redis:alpine -o redis.tar
|
||||
|
||||
# Compress for transfer
|
||||
tar -czf legal-doc-masker-images.tar.gz *.tar
|
||||
```
|
||||
|
||||
### Import Commands
|
||||
```bash
|
||||
# Manual import
|
||||
docker load -i backend-api.tar
|
||||
docker load -i frontend.tar
|
||||
docker load -i mineru-api.tar
|
||||
docker load -i redis.tar
|
||||
|
||||
# Extract compressed archive
|
||||
tar -xzf legal-doc-masker-images.tar.gz
|
||||
```
|
||||
|
||||
### Service Management
|
||||
```bash
|
||||
# Start all services
|
||||
docker-compose up -d
|
||||
|
||||
# Stop all services
|
||||
docker-compose down
|
||||
|
||||
# View logs
|
||||
docker-compose logs -f [service-name]
|
||||
|
||||
# Check status
|
||||
docker-compose ps
|
||||
```
|
||||
|
||||
### Building Individual Services
|
||||
```bash
|
||||
# Build specific service only
|
||||
docker-compose build backend-api
|
||||
docker-compose build frontend
|
||||
docker-compose build mineru-api
|
||||
|
||||
# Build and restart specific service
|
||||
docker-compose up -d --build backend-api
|
||||
|
||||
# Force rebuild (no cache)
|
||||
docker-compose build --no-cache backend-api
|
||||
|
||||
# Using the build script
|
||||
./build-service.sh backend-api --restart
|
||||
./build-service.sh frontend --no-cache
|
||||
./build-service.sh backend-api celery-worker
|
||||
```
|
||||
|
||||
## 🌐 Service URLs
|
||||
|
||||
After successful migration:
|
||||
- **Frontend**: http://localhost:3000
|
||||
- **Backend API**: http://localhost:8000
|
||||
- **Mineru API**: http://localhost:8001
|
||||
|
||||
## ⚠️ Troubleshooting
|
||||
|
||||
### Port Conflicts
|
||||
```bash
|
||||
# Check what's using ports
|
||||
lsof -i :8000
|
||||
lsof -i :8001
|
||||
lsof -i :3000
|
||||
|
||||
# Modify docker-compose.yml if needed
|
||||
ports:
|
||||
- "8002:8000" # Change external port
|
||||
```
|
||||
|
||||
### Permission Issues
|
||||
```bash
|
||||
# Fix script permissions
|
||||
chmod +x export-images.sh
|
||||
chmod +x import-images.sh
|
||||
chmod +x setup-unified-docker.sh
|
||||
|
||||
# Fix directory permissions
|
||||
sudo chown -R $USER:$USER backend/storage mineru/storage
|
||||
```
|
||||
|
||||
### Disk Space Issues
|
||||
```bash
|
||||
# Check available space
|
||||
df -h
|
||||
|
||||
# Clean up Docker
|
||||
docker system prune -a
|
||||
```
|
||||
|
||||
## 📊 Expected File Sizes
|
||||
|
||||
- `backend-api.tar`: ~200-500MB
|
||||
- `frontend.tar`: ~100-300MB
|
||||
- `mineru-api.tar`: ~1-3GB
|
||||
- `redis.tar`: ~30-50MB
|
||||
- `legal-doc-masker-images.tar.gz`: ~1-2GB (compressed)
|
||||
|
||||
## 🔒 Security Notes
|
||||
|
||||
1. Use encrypted transfer (SCP, SFTP) for sensitive environments
|
||||
2. Verify image integrity after transfer
|
||||
3. Update environment variables for target environment
|
||||
4. Ensure proper network security on target environment
|
||||
|
||||
## 📞 Support
|
||||
|
||||
If you encounter issues:
|
||||
1. Check the full `DOCKER_MIGRATION_GUIDE.md`
|
||||
2. Verify all required files are present
|
||||
3. Check Docker logs: `docker-compose logs -f`
|
||||
4. Ensure sufficient disk space and permissions
|
||||
|
|
@ -0,0 +1,20 @@
|
|||
# Storage paths
|
||||
OBJECT_STORAGE_PATH=/Users/tigeren/Dev/digisky/legal-doc-masker/data/doc_src
|
||||
TARGET_DIRECTORY_PATH=/Users/tigeren/Dev/digisky/legal-doc-masker/data/doc_dest
|
||||
INTERMEDIATE_DIR_PATH=/Users/tigeren/Dev/digisky/legal-doc-masker/data/doc_intermediate
|
||||
|
||||
# Ollama API Configuration
|
||||
OLLAMA_API_URL=http://192.168.2.245:11434
|
||||
# OLLAMA_API_KEY=your_api_key_here
|
||||
OLLAMA_MODEL=qwen3:8b
|
||||
|
||||
# Application Settings
|
||||
MONITOR_INTERVAL=5
|
||||
|
||||
# Logging Configuration
|
||||
LOG_LEVEL=INFO
|
||||
LOG_FILE=app.log
|
||||
|
||||
# Optional: Additional security settings
|
||||
# MAX_FILE_SIZE=10485760 # 10MB in bytes
|
||||
# ALLOWED_FILE_TYPES=.txt,.doc,.docx,.pdf
|
||||
|
|
@ -0,0 +1,36 @@
|
|||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install system dependencies
|
||||
RUN apt-get update && apt-get install -y \
|
||||
build-essential \
|
||||
libreoffice \
|
||||
wget \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
|
||||
# Copy requirements first to leverage Docker cache
|
||||
COPY requirements.txt .
|
||||
# RUN pip install huggingface_hub
|
||||
# RUN wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
|
||||
# RUN wget https://raw.githubusercontent.com/opendatalab/MinerU/refs/heads/release-1.3.1/scripts/download_models_hf.py -O download_models_hf.py
|
||||
|
||||
# RUN python download_models_hf.py
|
||||
|
||||
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
# RUN pip install -U magic-pdf[full]
|
||||
|
||||
|
||||
# Copy the rest of the application
|
||||
COPY . .
|
||||
|
||||
# Create storage directories
|
||||
RUN mkdir -p storage/uploads storage/processed
|
||||
|
||||
# Expose the port the app runs on
|
||||
EXPOSE 8000
|
||||
|
||||
# Command to run the application
|
||||
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||
|
|
@ -0,0 +1,202 @@
|
|||
# PDF Processor with Mineru API
|
||||
|
||||
## Overview
|
||||
|
||||
The PDF processor has been rewritten to use Mineru's REST API instead of the magic_pdf library. This provides better separation of concerns and allows for more flexible deployment options.
|
||||
|
||||
## Changes Made
|
||||
|
||||
### 1. Removed Dependencies
|
||||
- Removed all `magic_pdf` imports and dependencies
|
||||
- Removed `PyPDF2` direct usage (though kept in requirements for potential other uses)
|
||||
|
||||
### 2. New Implementation
|
||||
- **REST API Integration**: Uses HTTP requests to call Mineru's API
|
||||
- **Configurable Settings**: Mineru API URL and timeout are configurable
|
||||
- **Error Handling**: Comprehensive error handling for network issues, timeouts, and API errors
|
||||
- **Flexible Response Parsing**: Handles multiple possible response formats from Mineru API
|
||||
|
||||
### 3. Configuration
|
||||
|
||||
Add the following settings to your environment or `.env` file:
|
||||
|
||||
```bash
|
||||
# Mineru API Configuration
|
||||
MINERU_API_URL=http://mineru-api:8000
|
||||
MINERU_TIMEOUT=300
|
||||
MINERU_LANG_LIST=["ch"]
|
||||
MINERU_BACKEND=pipeline
|
||||
MINERU_PARSE_METHOD=auto
|
||||
MINERU_FORMULA_ENABLE=true
|
||||
MINERU_TABLE_ENABLE=true
|
||||
```
|
||||
|
||||
### 4. API Endpoint
|
||||
|
||||
The processor expects Mineru to provide a REST API endpoint at `/file_parse` that accepts PDF files via multipart form data and returns JSON with markdown content.
|
||||
|
||||
#### Expected Request Format:
|
||||
```
|
||||
POST /file_parse
|
||||
Content-Type: multipart/form-data
|
||||
|
||||
files: [PDF file]
|
||||
output_dir: ./output
|
||||
lang_list: ["ch"]
|
||||
backend: pipeline
|
||||
parse_method: auto
|
||||
formula_enable: true
|
||||
table_enable: true
|
||||
return_md: true
|
||||
return_middle_json: false
|
||||
return_model_output: false
|
||||
return_content_list: false
|
||||
return_images: false
|
||||
start_page_id: 0
|
||||
end_page_id: 99999
|
||||
```
|
||||
|
||||
#### Expected Response Format:
|
||||
The processor can handle multiple response formats:
|
||||
|
||||
```json
|
||||
{
|
||||
"markdown": "# Document Title\n\nContent here..."
|
||||
}
|
||||
```
|
||||
|
||||
OR
|
||||
|
||||
```json
|
||||
{
|
||||
"md": "# Document Title\n\nContent here..."
|
||||
}
|
||||
```
|
||||
|
||||
OR
|
||||
|
||||
```json
|
||||
{
|
||||
"content": "# Document Title\n\nContent here..."
|
||||
}
|
||||
```
|
||||
|
||||
OR
|
||||
|
||||
```json
|
||||
{
|
||||
"result": {
|
||||
"markdown": "# Document Title\n\nContent here..."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from app.core.document_handlers.processors.pdf_processor import PdfDocumentProcessor
|
||||
|
||||
# Create processor instance
|
||||
processor = PdfDocumentProcessor("input.pdf", "output.md")
|
||||
|
||||
# Read and convert PDF to markdown
|
||||
content = processor.read_content()
|
||||
|
||||
# Process content (apply masking)
|
||||
processed_content = processor.process_content(content)
|
||||
|
||||
# Save processed content
|
||||
processor.save_content(processed_content)
|
||||
```
|
||||
|
||||
### Through Document Service
|
||||
|
||||
```python
|
||||
from app.core.services.document_service import DocumentService
|
||||
|
||||
service = DocumentService()
|
||||
success = service.process_document("input.pdf", "output.md")
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
Run the test script to verify the implementation:
|
||||
|
||||
```bash
|
||||
cd backend
|
||||
python test_pdf_processor.py
|
||||
```
|
||||
|
||||
Make sure you have:
|
||||
1. A sample PDF file in the `sample_doc/` directory
|
||||
2. Mineru API service running and accessible
|
||||
3. Proper network connectivity between services
|
||||
|
||||
## Error Handling
|
||||
|
||||
The processor handles various error scenarios:
|
||||
|
||||
- **Network Timeouts**: Configurable timeout (default: 5 minutes)
|
||||
- **API Errors**: HTTP status code errors are logged and handled
|
||||
- **Response Parsing**: Multiple fallback strategies for extracting markdown content
|
||||
- **File Operations**: Proper error handling for file reading/writing
|
||||
|
||||
## Logging
|
||||
|
||||
The processor provides detailed logging for debugging:
|
||||
|
||||
- API call attempts and responses
|
||||
- Content extraction results
|
||||
- Error conditions and stack traces
|
||||
- Processing statistics
|
||||
|
||||
## Deployment
|
||||
|
||||
### Docker Compose
|
||||
|
||||
Ensure your Mineru service is running and accessible. The default configuration expects it at `http://mineru-api:8000`.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
Set the following environment variables in your deployment:
|
||||
|
||||
```bash
|
||||
MINERU_API_URL=http://your-mineru-service:8000
|
||||
MINERU_TIMEOUT=300
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Connection Refused**: Check if Mineru service is running and accessible
|
||||
2. **Timeout Errors**: Increase `MINERU_TIMEOUT` for large PDF files
|
||||
3. **Empty Content**: Check Mineru API response format and logs
|
||||
4. **Network Issues**: Verify network connectivity between services
|
||||
|
||||
### Debug Mode
|
||||
|
||||
Enable debug logging to see detailed API interactions:
|
||||
|
||||
```python
|
||||
import logging
|
||||
logging.getLogger('app.core.document_handlers.processors.pdf_processor').setLevel(logging.DEBUG)
|
||||
```
|
||||
|
||||
## Migration from magic_pdf
|
||||
|
||||
If you were previously using magic_pdf:
|
||||
|
||||
1. **No Code Changes Required**: The interface remains the same
|
||||
2. **Configuration Update**: Add Mineru API settings
|
||||
3. **Service Dependencies**: Ensure Mineru service is running
|
||||
4. **Testing**: Run the test script to verify functionality
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Timeout**: Large PDFs may require longer timeouts
|
||||
- **Memory**: The processor loads the entire PDF into memory for API calls
|
||||
- **Network**: API calls add network latency to processing time
|
||||
- **Caching**: Consider implementing caching for frequently processed documents
|
||||
|
|
@ -0,0 +1,103 @@
|
|||
# Legal Document Masker API
|
||||
|
||||
This is the backend API for the Legal Document Masking system. It provides endpoints for file upload, processing status tracking, and file download.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.8+
|
||||
- Redis (for Celery)
|
||||
|
||||
## File Storage
|
||||
|
||||
Files are stored in the following structure:
|
||||
```
|
||||
backend/
|
||||
├── storage/
|
||||
│ ├── uploads/ # Original uploaded files
|
||||
│ └── processed/ # Masked/processed files
|
||||
```
|
||||
|
||||
## Setup
|
||||
|
||||
### Option 1: Local Development
|
||||
|
||||
1. Create a virtual environment:
|
||||
```bash
|
||||
python -m venv venv
|
||||
source venv/bin/activate # On Windows: venv\Scripts\activate
|
||||
```
|
||||
|
||||
2. Install dependencies:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
3. Set up environment variables:
|
||||
Create a `.env` file in the backend directory with the following variables:
|
||||
```env
|
||||
SECRET_KEY=your-secret-key-here
|
||||
```
|
||||
|
||||
The database (SQLite) will be automatically created when you first run the application.
|
||||
|
||||
4. Start Redis (required for Celery):
|
||||
```bash
|
||||
redis-server
|
||||
```
|
||||
|
||||
5. Start Celery worker:
|
||||
```bash
|
||||
celery -A app.services.file_service worker --loglevel=info
|
||||
```
|
||||
|
||||
6. Start the FastAPI server:
|
||||
```bash
|
||||
uvicorn app.main:app --reload
|
||||
```
|
||||
|
||||
### Option 2: Docker Deployment
|
||||
|
||||
1. Build and start the services:
|
||||
```bash
|
||||
docker-compose up --build
|
||||
```
|
||||
|
||||
This will start:
|
||||
- FastAPI server on port 8000
|
||||
- Celery worker for background processing
|
||||
- Redis for task queue
|
||||
|
||||
## API Documentation
|
||||
|
||||
Once the server is running, you can access:
|
||||
- Swagger UI: `http://localhost:8000/docs`
|
||||
- ReDoc: `http://localhost:8000/redoc`
|
||||
|
||||
## API Endpoints
|
||||
|
||||
- `POST /api/v1/files/upload` - Upload a new file
|
||||
- `GET /api/v1/files` - List all files
|
||||
- `GET /api/v1/files/{file_id}` - Get file details
|
||||
- `GET /api/v1/files/{file_id}/download` - Download processed file
|
||||
- `WS /api/v1/files/ws/status/{file_id}` - WebSocket for real-time status updates
|
||||
|
||||
## Development
|
||||
|
||||
### Running Tests
|
||||
```bash
|
||||
pytest
|
||||
```
|
||||
|
||||
### Code Style
|
||||
The project uses Black for code formatting:
|
||||
```bash
|
||||
black .
|
||||
```
|
||||
|
||||
### Docker Commands
|
||||
|
||||
- Start services: `docker-compose up`
|
||||
- Start in background: `docker-compose up -d`
|
||||
- Stop services: `docker-compose down`
|
||||
- View logs: `docker-compose logs -f`
|
||||
- Rebuild: `docker-compose up --build`
|
||||
|
|
@ -0,0 +1,166 @@
|
|||
from fastapi import APIRouter, Depends, HTTPException, UploadFile, File, WebSocket, Response
|
||||
from fastapi.responses import FileResponse
|
||||
from sqlalchemy.orm import Session
|
||||
from typing import List
|
||||
import os
|
||||
from ...core.config import settings
|
||||
from ...core.database import get_db
|
||||
from ...models.file import File as FileModel, FileStatus
|
||||
from ...services.file_service import process_file, delete_file
|
||||
from ...schemas.file import FileResponse as FileResponseSchema, FileList
|
||||
import asyncio
|
||||
from fastapi import WebSocketDisconnect
|
||||
import uuid
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
@router.post("/upload", response_model=FileResponseSchema)
|
||||
async def upload_file(
|
||||
file: UploadFile = File(...),
|
||||
db: Session = Depends(get_db)
|
||||
):
|
||||
if not file.filename:
|
||||
raise HTTPException(status_code=400, detail="No file provided")
|
||||
|
||||
if not any(file.filename.lower().endswith(ext) for ext in settings.ALLOWED_EXTENSIONS):
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"File type not allowed. Allowed types: {', '.join(settings.ALLOWED_EXTENSIONS)}"
|
||||
)
|
||||
|
||||
# Generate unique file ID
|
||||
file_id = str(uuid.uuid4())
|
||||
file_extension = os.path.splitext(file.filename)[1]
|
||||
unique_filename = f"{file_id}{file_extension}"
|
||||
|
||||
# Save file with unique name
|
||||
file_path = settings.UPLOAD_FOLDER / unique_filename
|
||||
with open(file_path, "wb") as buffer:
|
||||
content = await file.read()
|
||||
buffer.write(content)
|
||||
|
||||
# Create database entry
|
||||
db_file = FileModel(
|
||||
id=file_id,
|
||||
filename=file.filename,
|
||||
original_path=str(file_path),
|
||||
status=FileStatus.NOT_STARTED
|
||||
)
|
||||
db.add(db_file)
|
||||
db.commit()
|
||||
db.refresh(db_file)
|
||||
|
||||
# Start processing
|
||||
process_file.delay(str(db_file.id))
|
||||
|
||||
return db_file
|
||||
|
||||
@router.get("/files", response_model=List[FileResponseSchema])
|
||||
def list_files(
|
||||
skip: int = 0,
|
||||
limit: int = 100,
|
||||
db: Session = Depends(get_db)
|
||||
):
|
||||
files = db.query(FileModel).offset(skip).limit(limit).all()
|
||||
return files
|
||||
|
||||
@router.get("/files/{file_id}", response_model=FileResponseSchema)
|
||||
def get_file(
|
||||
file_id: str,
|
||||
db: Session = Depends(get_db)
|
||||
):
|
||||
file = db.query(FileModel).filter(FileModel.id == file_id).first()
|
||||
if not file:
|
||||
raise HTTPException(status_code=404, detail="File not found")
|
||||
return file
|
||||
|
||||
@router.get("/files/{file_id}/download")
|
||||
async def download_file(
|
||||
file_id: str,
|
||||
db: Session = Depends(get_db)
|
||||
):
|
||||
print(f"=== DOWNLOAD REQUEST ===")
|
||||
print(f"File ID: {file_id}")
|
||||
|
||||
file = db.query(FileModel).filter(FileModel.id == file_id).first()
|
||||
if not file:
|
||||
print(f"❌ File not found for ID: {file_id}")
|
||||
raise HTTPException(status_code=404, detail="File not found")
|
||||
|
||||
print(f"✅ File found: {file.filename}")
|
||||
print(f"File status: {file.status}")
|
||||
print(f"Original path: {file.original_path}")
|
||||
print(f"Processed path: {file.processed_path}")
|
||||
|
||||
if file.status != FileStatus.SUCCESS:
|
||||
print(f"❌ File not ready for download. Status: {file.status}")
|
||||
raise HTTPException(status_code=400, detail="File is not ready for download")
|
||||
|
||||
if not os.path.exists(file.processed_path):
|
||||
print(f"❌ Processed file not found at: {file.processed_path}")
|
||||
raise HTTPException(status_code=404, detail="Processed file not found")
|
||||
|
||||
print(f"✅ Processed file exists at: {file.processed_path}")
|
||||
|
||||
# Get the original filename without extension and add .md extension
|
||||
original_filename = file.filename
|
||||
filename_without_ext = os.path.splitext(original_filename)[0]
|
||||
download_filename = f"{filename_without_ext}.md"
|
||||
|
||||
print(f"Original filename: {original_filename}")
|
||||
print(f"Filename without extension: {filename_without_ext}")
|
||||
print(f"Download filename: {download_filename}")
|
||||
|
||||
|
||||
response = FileResponse(
|
||||
path=file.processed_path,
|
||||
filename=download_filename,
|
||||
media_type="text/markdown"
|
||||
)
|
||||
|
||||
print(f"Response headers: {dict(response.headers)}")
|
||||
print(f"=== END DOWNLOAD REQUEST ===")
|
||||
|
||||
return response
|
||||
|
||||
@router.websocket("/ws/status/{file_id}")
|
||||
async def websocket_endpoint(websocket: WebSocket, file_id: str, db: Session = Depends(get_db)):
|
||||
await websocket.accept()
|
||||
try:
|
||||
while True:
|
||||
file = db.query(FileModel).filter(FileModel.id == file_id).first()
|
||||
if not file:
|
||||
await websocket.send_json({"error": "File not found"})
|
||||
break
|
||||
|
||||
await websocket.send_json({
|
||||
"status": file.status,
|
||||
"error": file.error_message
|
||||
})
|
||||
|
||||
if file.status in [FileStatus.SUCCESS, FileStatus.FAILED]:
|
||||
break
|
||||
|
||||
await asyncio.sleep(1)
|
||||
except WebSocketDisconnect:
|
||||
pass
|
||||
|
||||
@router.delete("/files/{file_id}")
|
||||
async def delete_file_endpoint(
|
||||
file_id: str,
|
||||
db: Session = Depends(get_db)
|
||||
):
|
||||
"""
|
||||
Delete a file and its associated records.
|
||||
This will remove:
|
||||
1. The database record
|
||||
2. The original uploaded file
|
||||
3. The processed markdown file (if it exists)
|
||||
"""
|
||||
try:
|
||||
delete_file(file_id)
|
||||
return {"message": "File deleted successfully"}
|
||||
except HTTPException as e:
|
||||
raise e
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
|
@ -0,0 +1,65 @@
|
|||
from pydantic_settings import BaseSettings
|
||||
from typing import Optional
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
class Settings(BaseSettings):
|
||||
# API Settings
|
||||
API_V1_STR: str = "/api/v1"
|
||||
PROJECT_NAME: str = "Legal Document Masker API"
|
||||
|
||||
# Security
|
||||
SECRET_KEY: str = "your-secret-key-here" # Change in production
|
||||
ACCESS_TOKEN_EXPIRE_MINUTES: int = 60 * 24 * 8 # 8 days
|
||||
|
||||
# Database
|
||||
BASE_DIR: Path = Path(__file__).parent.parent.parent
|
||||
DATABASE_URL: str = f"sqlite:///{BASE_DIR}/storage/legal_doc_masker.db"
|
||||
|
||||
# File Storage
|
||||
UPLOAD_FOLDER: Path = BASE_DIR / "storage" / "uploads"
|
||||
PROCESSED_FOLDER: Path = BASE_DIR / "storage" / "processed"
|
||||
MAX_FILE_SIZE: int = 50 * 1024 * 1024 # 50MB
|
||||
ALLOWED_EXTENSIONS: set = {"pdf", "docx", "doc", "md"}
|
||||
|
||||
# Celery
|
||||
CELERY_BROKER_URL: str = "redis://redis:6379/0"
|
||||
CELERY_RESULT_BACKEND: str = "redis://redis:6379/0"
|
||||
|
||||
# Ollama API settings
|
||||
OLLAMA_API_URL: str = "https://api.ollama.com"
|
||||
OLLAMA_API_KEY: str = ""
|
||||
OLLAMA_MODEL: str = "llama2"
|
||||
|
||||
# Mineru API settings
|
||||
MINERU_API_URL: str = "http://mineru-api:8000"
|
||||
# MINERU_API_URL: str = "http://host.docker.internal:8001"
|
||||
|
||||
MINERU_TIMEOUT: int = 300 # 5 minutes timeout
|
||||
MINERU_LANG_LIST: list = ["ch"] # Language list for parsing
|
||||
MINERU_BACKEND: str = "pipeline" # Backend to use
|
||||
MINERU_PARSE_METHOD: str = "auto" # Parse method
|
||||
MINERU_FORMULA_ENABLE: bool = True # Enable formula parsing
|
||||
MINERU_TABLE_ENABLE: bool = True # Enable table parsing
|
||||
|
||||
# Logging settings
|
||||
LOG_LEVEL: str = "INFO"
|
||||
LOG_FORMAT: str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
||||
LOG_DATE_FORMAT: str = "%Y-%m-%d %H:%M:%S"
|
||||
LOG_FILE: str = "app.log"
|
||||
|
||||
class Config:
|
||||
case_sensitive = True
|
||||
env_file = ".env"
|
||||
env_file_encoding = "utf-8"
|
||||
extra = "allow"
|
||||
|
||||
def __init__(self, **kwargs):
|
||||
super().__init__(**kwargs)
|
||||
# Create storage directories if they don't exist
|
||||
self.UPLOAD_FOLDER.mkdir(parents=True, exist_ok=True)
|
||||
self.PROCESSED_FOLDER.mkdir(parents=True, exist_ok=True)
|
||||
# Create storage directory for database
|
||||
(self.BASE_DIR / "storage").mkdir(parents=True, exist_ok=True)
|
||||
|
||||
settings = Settings()
|
||||
|
|
@ -1,5 +1,6 @@
|
|||
import logging.config
|
||||
from config.settings import settings
|
||||
# from config.settings import settings
|
||||
from .settings import settings
|
||||
|
||||
LOGGING_CONFIG = {
|
||||
"version": 1,
|
||||
|
|
@ -0,0 +1,21 @@
|
|||
from sqlalchemy import create_engine
|
||||
from sqlalchemy.ext.declarative import declarative_base
|
||||
from sqlalchemy.orm import sessionmaker
|
||||
from .config import settings
|
||||
|
||||
# Create SQLite engine with check_same_thread=False for FastAPI
|
||||
engine = create_engine(
|
||||
settings.DATABASE_URL,
|
||||
connect_args={"check_same_thread": False}
|
||||
)
|
||||
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
|
||||
|
||||
Base = declarative_base()
|
||||
|
||||
# Dependency
|
||||
def get_db():
|
||||
db = SessionLocal()
|
||||
try:
|
||||
yield db
|
||||
finally:
|
||||
db.close()
|
||||
|
|
@ -1,9 +1,9 @@
|
|||
import os
|
||||
from typing import Optional
|
||||
from document_handlers.document_processor import DocumentProcessor
|
||||
from document_handlers.processors import (
|
||||
from .document_processor import DocumentProcessor
|
||||
from .processors import (
|
||||
TxtDocumentProcessor,
|
||||
DocxDocumentProcessor,
|
||||
# DocxDocumentProcessor,
|
||||
PdfDocumentProcessor,
|
||||
MarkdownDocumentProcessor
|
||||
)
|
||||
|
|
@ -15,8 +15,8 @@ class DocumentProcessorFactory:
|
|||
|
||||
processors = {
|
||||
'.txt': TxtDocumentProcessor,
|
||||
'.docx': DocxDocumentProcessor,
|
||||
'.doc': DocxDocumentProcessor,
|
||||
# '.docx': DocxDocumentProcessor,
|
||||
# '.doc': DocxDocumentProcessor,
|
||||
'.pdf': PdfDocumentProcessor,
|
||||
'.md': MarkdownDocumentProcessor,
|
||||
'.markdown': MarkdownDocumentProcessor
|
||||
|
|
@ -0,0 +1,71 @@
|
|||
from abc import ABC, abstractmethod
|
||||
from typing import Any, Dict
|
||||
import logging
|
||||
from .ner_processor import NerProcessor
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class DocumentProcessor(ABC):
|
||||
|
||||
def __init__(self):
|
||||
self.max_chunk_size = 1000 # Maximum number of characters per chunk
|
||||
self.ner_processor = NerProcessor()
|
||||
|
||||
@abstractmethod
|
||||
def read_content(self) -> str:
|
||||
"""Read document content"""
|
||||
pass
|
||||
|
||||
def _split_into_chunks(self, sentences: list[str]) -> list[str]:
|
||||
"""Split sentences into chunks that don't exceed max_chunk_size"""
|
||||
chunks = []
|
||||
current_chunk = ""
|
||||
|
||||
for sentence in sentences:
|
||||
if not sentence.strip():
|
||||
continue
|
||||
|
||||
if len(current_chunk) + len(sentence) > self.max_chunk_size and current_chunk:
|
||||
chunks.append(current_chunk)
|
||||
current_chunk = sentence
|
||||
else:
|
||||
if current_chunk:
|
||||
current_chunk += "。" + sentence
|
||||
else:
|
||||
current_chunk = sentence
|
||||
|
||||
if current_chunk:
|
||||
chunks.append(current_chunk)
|
||||
logger.info(f"Split content into {len(chunks)} chunks")
|
||||
|
||||
return chunks
|
||||
|
||||
def _apply_mapping(self, text: str, mapping: Dict[str, str]) -> str:
|
||||
"""Apply the mapping to replace sensitive information"""
|
||||
masked_text = text
|
||||
for original, masked in mapping.items():
|
||||
if isinstance(masked, dict):
|
||||
masked = next(iter(masked.values()), "某")
|
||||
elif not isinstance(masked, str):
|
||||
masked = str(masked) if masked is not None else "某"
|
||||
masked_text = masked_text.replace(original, masked)
|
||||
return masked_text
|
||||
|
||||
def process_content(self, content: str) -> str:
|
||||
"""Process document content by masking sensitive information"""
|
||||
sentences = content.split("。")
|
||||
|
||||
chunks = self._split_into_chunks(sentences)
|
||||
logger.info(f"Split content into {len(chunks)} chunks")
|
||||
|
||||
final_mapping = self.ner_processor.process(chunks)
|
||||
|
||||
masked_content = self._apply_mapping(content, final_mapping)
|
||||
logger.info("Successfully masked content")
|
||||
|
||||
return masked_content
|
||||
|
||||
@abstractmethod
|
||||
def save_content(self, content: str) -> None:
|
||||
"""Save processed content"""
|
||||
pass
|
||||
|
|
@ -0,0 +1,305 @@
|
|||
from typing import Any, Dict
|
||||
from ..prompts.masking_prompts import get_ner_name_prompt, get_ner_company_prompt, get_ner_address_prompt, get_ner_project_prompt, get_ner_case_number_prompt, get_entity_linkage_prompt
|
||||
import logging
|
||||
import json
|
||||
from ..services.ollama_client import OllamaClient
|
||||
from ...core.config import settings
|
||||
from ..utils.json_extractor import LLMJsonExtractor
|
||||
from ..utils.llm_validator import LLMResponseValidator
|
||||
import re
|
||||
from .regs.entity_regex import extract_id_number_entities, extract_social_credit_code_entities
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class NerProcessor:
|
||||
def __init__(self):
|
||||
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
|
||||
self.max_retries = 3
|
||||
|
||||
def _validate_mapping_format(self, mapping: Dict[str, Any]) -> bool:
|
||||
return LLMResponseValidator.validate_entity_extraction(mapping)
|
||||
|
||||
def _process_entity_type(self, chunk: str, prompt_func, entity_type: str) -> Dict[str, str]:
|
||||
for attempt in range(self.max_retries):
|
||||
try:
|
||||
formatted_prompt = prompt_func(chunk)
|
||||
logger.info(f"Calling ollama to generate {entity_type} mapping for chunk (attempt {attempt + 1}/{self.max_retries}): {formatted_prompt}")
|
||||
response = self.ollama_client.generate(formatted_prompt)
|
||||
logger.info(f"Raw response from LLM: {response}")
|
||||
|
||||
mapping = LLMJsonExtractor.parse_raw_json_str(response)
|
||||
logger.info(f"Parsed mapping: {mapping}")
|
||||
|
||||
if mapping and self._validate_mapping_format(mapping):
|
||||
return mapping
|
||||
else:
|
||||
logger.warning(f"Invalid mapping format received on attempt {attempt + 1}, retrying...")
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating {entity_type} mapping on attempt {attempt + 1}: {e}")
|
||||
if attempt < self.max_retries - 1:
|
||||
logger.info("Retrying...")
|
||||
else:
|
||||
logger.error(f"Max retries reached for {entity_type}, returning empty mapping")
|
||||
|
||||
return {}
|
||||
|
||||
def build_mapping(self, chunk: str) -> list[Dict[str, str]]:
|
||||
mapping_pipeline = []
|
||||
|
||||
entity_configs = [
|
||||
(get_ner_name_prompt, "people names"),
|
||||
(get_ner_company_prompt, "company names"),
|
||||
(get_ner_address_prompt, "addresses"),
|
||||
(get_ner_project_prompt, "project names"),
|
||||
(get_ner_case_number_prompt, "case numbers")
|
||||
]
|
||||
for prompt_func, entity_type in entity_configs:
|
||||
mapping = self._process_entity_type(chunk, prompt_func, entity_type)
|
||||
if mapping:
|
||||
mapping_pipeline.append(mapping)
|
||||
|
||||
regex_entity_extractors = [
|
||||
extract_id_number_entities,
|
||||
extract_social_credit_code_entities
|
||||
]
|
||||
for extractor in regex_entity_extractors:
|
||||
mapping = extractor(chunk)
|
||||
if mapping and LLMResponseValidator.validate_regex_entity(mapping):
|
||||
mapping_pipeline.append(mapping)
|
||||
elif mapping:
|
||||
logger.warning(f"Invalid regex entity mapping format: {mapping}")
|
||||
|
||||
return mapping_pipeline
|
||||
|
||||
def _merge_entity_mappings(self, chunk_mappings: list[Dict[str, Any]]) -> list[Dict[str, str]]:
|
||||
all_entities = []
|
||||
for mapping in chunk_mappings:
|
||||
if isinstance(mapping, dict) and 'entities' in mapping:
|
||||
entities = mapping['entities']
|
||||
if isinstance(entities, list):
|
||||
all_entities.extend(entities)
|
||||
|
||||
unique_entities = []
|
||||
seen_texts = set()
|
||||
|
||||
for entity in all_entities:
|
||||
if isinstance(entity, dict) and 'text' in entity:
|
||||
text = entity['text'].strip()
|
||||
if text and text not in seen_texts:
|
||||
seen_texts.add(text)
|
||||
unique_entities.append(entity)
|
||||
elif text and text in seen_texts:
|
||||
# 暂时记录下可能存在冲突的entity
|
||||
logger.info(f"Duplicate entity found: {entity}")
|
||||
continue
|
||||
|
||||
logger.info(f"Merged {len(unique_entities)} unique entities")
|
||||
return unique_entities
|
||||
|
||||
def _generate_masked_mapping(self, unique_entities: list[Dict[str, str]], linkage: Dict[str, Any]) -> Dict[str, str]:
|
||||
"""
|
||||
结合 linkage 信息,按实体分组映射同一脱敏名,并实现如下规则:
|
||||
1. 人名/简称:保留姓,名变为某,同姓编号;
|
||||
2. 公司名:同组公司名映射为大写字母公司(A公司、B公司...);
|
||||
3. 英文人名:每个单词首字母+***;
|
||||
4. 英文公司名:替换为所属行业名称,英文大写(如无行业信息,默认 COMPANY);
|
||||
5. 项目名:项目名称变为小写英文字母(如 a项目、b项目...);
|
||||
6. 案号:只替换案号中的数字部分为***,保留前后结构和“号”字,支持中间有空格;
|
||||
7. 身份证号:6位X;
|
||||
8. 社会信用代码:8位X;
|
||||
9. 地址:保留区级及以上行政区划,去除详细位置;
|
||||
10. 其他类型按原有逻辑。
|
||||
"""
|
||||
import re
|
||||
entity_mapping = {}
|
||||
used_masked_names = set()
|
||||
group_mask_map = {}
|
||||
surname_counter = {}
|
||||
company_letter = ord('A')
|
||||
project_letter = ord('a')
|
||||
# 优先区县级单位,后市、省等
|
||||
admin_keywords = [
|
||||
'市辖区', '自治县', '自治旗', '林区', '区', '县', '旗', '州', '盟', '地区', '自治州',
|
||||
'市', '省', '自治区', '特别行政区'
|
||||
]
|
||||
admin_pattern = r"^(.*?(?:" + '|'.join(admin_keywords) + r"))"
|
||||
for group in linkage.get('entity_groups', []):
|
||||
group_type = group.get('group_type', '')
|
||||
entities = group.get('entities', [])
|
||||
if '公司' in group_type or 'Company' in group_type:
|
||||
masked = chr(company_letter) + '公司'
|
||||
company_letter += 1
|
||||
for entity in entities:
|
||||
group_mask_map[entity['text']] = masked
|
||||
elif '人名' in group_type:
|
||||
surname_local_counter = {}
|
||||
for entity in entities:
|
||||
name = entity['text']
|
||||
if not name:
|
||||
continue
|
||||
surname = name[0]
|
||||
surname_local_counter.setdefault(surname, 0)
|
||||
surname_local_counter[surname] += 1
|
||||
if surname_local_counter[surname] == 1:
|
||||
masked = f"{surname}某"
|
||||
else:
|
||||
masked = f"{surname}某{surname_local_counter[surname]}"
|
||||
group_mask_map[name] = masked
|
||||
elif '英文人名' in group_type:
|
||||
for entity in entities:
|
||||
name = entity['text']
|
||||
if not name:
|
||||
continue
|
||||
masked = ' '.join([n[0] + '***' if n else '' for n in name.split()])
|
||||
group_mask_map[name] = masked
|
||||
for entity in unique_entities:
|
||||
text = entity['text']
|
||||
entity_type = entity.get('type', '')
|
||||
if text in group_mask_map:
|
||||
entity_mapping[text] = group_mask_map[text]
|
||||
used_masked_names.add(group_mask_map[text])
|
||||
elif '英文公司名' in entity_type or 'English Company' in entity_type:
|
||||
industry = entity.get('industry', 'COMPANY')
|
||||
masked = industry.upper()
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '项目名' in entity_type:
|
||||
masked = chr(project_letter) + '项目'
|
||||
project_letter += 1
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '案号' in entity_type:
|
||||
masked = re.sub(r'(\d[\d\s]*)(号)', r'***\2', text)
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '身份证号' in entity_type:
|
||||
masked = 'X' * 6
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '社会信用代码' in entity_type:
|
||||
masked = 'X' * 8
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '地址' in entity_type:
|
||||
# 保留区级及以上行政区划,去除详细位置
|
||||
match = re.match(admin_pattern, text)
|
||||
if match:
|
||||
masked = match.group(1)
|
||||
else:
|
||||
masked = text # fallback
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '人名' in entity_type:
|
||||
name = text
|
||||
if not name:
|
||||
masked = '某'
|
||||
else:
|
||||
surname = name[0]
|
||||
surname_counter.setdefault(surname, 0)
|
||||
surname_counter[surname] += 1
|
||||
if surname_counter[surname] == 1:
|
||||
masked = f"{surname}某"
|
||||
else:
|
||||
masked = f"{surname}某{surname_counter[surname]}"
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '公司' in entity_type or 'Company' in entity_type:
|
||||
masked = chr(company_letter) + '公司'
|
||||
company_letter += 1
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '英文人名' in entity_type:
|
||||
name = text
|
||||
masked = ' '.join([n[0] + '***' if n else '' for n in name.split()])
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
else:
|
||||
base_name = '某'
|
||||
masked = base_name
|
||||
counter = 1
|
||||
while masked in used_masked_names:
|
||||
if counter <= 10:
|
||||
suffixes = ['甲', '乙', '丙', '丁', '戊', '己', '庚', '辛', '壬', '癸']
|
||||
masked = base_name + suffixes[counter - 1]
|
||||
else:
|
||||
masked = f"{base_name}{counter}"
|
||||
counter += 1
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
return entity_mapping
|
||||
|
||||
def _validate_linkage_format(self, linkage: Dict[str, Any]) -> bool:
|
||||
return LLMResponseValidator.validate_entity_linkage(linkage)
|
||||
|
||||
def _create_entity_linkage(self, unique_entities: list[Dict[str, str]]) -> Dict[str, Any]:
|
||||
linkable_entities = []
|
||||
for entity in unique_entities:
|
||||
entity_type = entity.get('type', '')
|
||||
if any(keyword in entity_type for keyword in ['公司', 'Company', '人名', '英文人名']):
|
||||
linkable_entities.append(entity)
|
||||
|
||||
if not linkable_entities:
|
||||
logger.info("No linkable entities found")
|
||||
return {"entity_groups": []}
|
||||
|
||||
entities_text = "\n".join([
|
||||
f"- {entity['text']} (类型: {entity['type']})"
|
||||
for entity in linkable_entities
|
||||
])
|
||||
|
||||
for attempt in range(self.max_retries):
|
||||
try:
|
||||
formatted_prompt = get_entity_linkage_prompt(entities_text)
|
||||
logger.info(f"Calling ollama to generate entity linkage (attempt {attempt + 1}/{self.max_retries})")
|
||||
response = self.ollama_client.generate(formatted_prompt)
|
||||
logger.info(f"Raw entity linkage response from LLM: {response}")
|
||||
|
||||
linkage = LLMJsonExtractor.parse_raw_json_str(response)
|
||||
logger.info(f"Parsed entity linkage: {linkage}")
|
||||
|
||||
if linkage and self._validate_linkage_format(linkage):
|
||||
logger.info(f"Successfully created entity linkage with {len(linkage.get('entity_groups', []))} groups")
|
||||
return linkage
|
||||
else:
|
||||
logger.warning(f"Invalid entity linkage format received on attempt {attempt + 1}, retrying...")
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating entity linkage on attempt {attempt + 1}: {e}")
|
||||
if attempt < self.max_retries - 1:
|
||||
logger.info("Retrying...")
|
||||
else:
|
||||
logger.error("Max retries reached for entity linkage, returning empty linkage")
|
||||
|
||||
return {"entity_groups": []}
|
||||
|
||||
def _apply_entity_linkage_to_mapping(self, entity_mapping: Dict[str, str], entity_linkage: Dict[str, Any]) -> Dict[str, str]:
|
||||
"""
|
||||
linkage 已在 _generate_masked_mapping 中处理,此处直接返回 entity_mapping。
|
||||
"""
|
||||
return entity_mapping
|
||||
|
||||
def process(self, chunks: list[str]) -> Dict[str, str]:
|
||||
chunk_mappings = []
|
||||
for i, chunk in enumerate(chunks):
|
||||
logger.info(f"Processing chunk {i+1}/{len(chunks)}")
|
||||
chunk_mapping = self.build_mapping(chunk)
|
||||
logger.info(f"Chunk mapping: {chunk_mapping}")
|
||||
chunk_mappings.extend(chunk_mapping)
|
||||
|
||||
logger.info(f"Final chunk mappings: {chunk_mappings}")
|
||||
|
||||
unique_entities = self._merge_entity_mappings(chunk_mappings)
|
||||
logger.info(f"Unique entities: {unique_entities}")
|
||||
|
||||
entity_linkage = self._create_entity_linkage(unique_entities)
|
||||
logger.info(f"Entity linkage: {entity_linkage}")
|
||||
|
||||
# for quick test
|
||||
# unique_entities = [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}, {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}, {'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}, {'text': '服务合同', 'type': '项目名'}, {'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '(2020)京0105 民初69754 号', 'type': '案号'}, {'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}, {'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}, {'text': '《计算机设备采购合同》', 'type': '项目名'}, {'text': '《服务合同书》', 'type': '项目名'}]
|
||||
# entity_linkage = {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
|
||||
combined_mapping = self._generate_masked_mapping(unique_entities, entity_linkage)
|
||||
logger.info(f"Combined mapping: {combined_mapping}")
|
||||
|
||||
final_mapping = self._apply_entity_linkage_to_mapping(combined_mapping, entity_linkage)
|
||||
logger.info(f"Final mapping: {final_mapping}")
|
||||
|
||||
return final_mapping
|
||||
|
|
@ -0,0 +1,7 @@
|
|||
from .txt_processor import TxtDocumentProcessor
|
||||
# from .docx_processor import DocxDocumentProcessor
|
||||
from .pdf_processor import PdfDocumentProcessor
|
||||
from .md_processor import MarkdownDocumentProcessor
|
||||
|
||||
# __all__ = ['TxtDocumentProcessor', 'DocxDocumentProcessor', 'PdfDocumentProcessor', 'MarkdownDocumentProcessor']
|
||||
__all__ = ['TxtDocumentProcessor', 'PdfDocumentProcessor', 'MarkdownDocumentProcessor']
|
||||
|
|
@ -1,13 +1,13 @@
|
|||
import os
|
||||
import docx
|
||||
from document_handlers.document_processor import DocumentProcessor
|
||||
from ...document_handlers.document_processor import DocumentProcessor
|
||||
from magic_pdf.data.data_reader_writer import FileBasedDataWriter
|
||||
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
|
||||
from magic_pdf.data.read_api import read_local_office
|
||||
import logging
|
||||
from services.ollama_client import OllamaClient
|
||||
from config.settings import settings
|
||||
from prompts.masking_prompts import get_masking_mapping_prompt
|
||||
from ...services.ollama_client import OllamaClient
|
||||
from ...config import settings
|
||||
from ...prompts.masking_prompts import get_masking_mapping_prompt
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
|
@ -1,8 +1,8 @@
|
|||
import os
|
||||
from document_handlers.document_processor import DocumentProcessor
|
||||
from services.ollama_client import OllamaClient
|
||||
from ...document_handlers.document_processor import DocumentProcessor
|
||||
from ...services.ollama_client import OllamaClient
|
||||
import logging
|
||||
from config.settings import settings
|
||||
from ...config import settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
|
@ -0,0 +1,204 @@
|
|||
import os
|
||||
import requests
|
||||
import logging
|
||||
from typing import Dict, Any, Optional
|
||||
from ...document_handlers.document_processor import DocumentProcessor
|
||||
from ...services.ollama_client import OllamaClient
|
||||
from ...config import settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class PdfDocumentProcessor(DocumentProcessor):
|
||||
def __init__(self, input_path: str, output_path: str):
|
||||
super().__init__() # Call parent class's __init__
|
||||
self.input_path = input_path
|
||||
self.output_path = output_path
|
||||
self.output_dir = os.path.dirname(output_path)
|
||||
self.name_without_suff = os.path.splitext(os.path.basename(input_path))[0]
|
||||
|
||||
# Setup work directory for temporary files
|
||||
self.work_dir = os.path.join(
|
||||
os.path.dirname(output_path),
|
||||
".work",
|
||||
os.path.splitext(os.path.basename(input_path))[0]
|
||||
)
|
||||
os.makedirs(self.work_dir, exist_ok=True)
|
||||
|
||||
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
|
||||
|
||||
# Mineru API configuration
|
||||
self.mineru_base_url = getattr(settings, 'MINERU_API_URL', 'http://mineru-api:8000')
|
||||
self.mineru_timeout = getattr(settings, 'MINERU_TIMEOUT', 300) # 5 minutes timeout
|
||||
self.mineru_lang_list = getattr(settings, 'MINERU_LANG_LIST', ['ch'])
|
||||
self.mineru_backend = getattr(settings, 'MINERU_BACKEND', 'pipeline')
|
||||
self.mineru_parse_method = getattr(settings, 'MINERU_PARSE_METHOD', 'auto')
|
||||
self.mineru_formula_enable = getattr(settings, 'MINERU_FORMULA_ENABLE', True)
|
||||
self.mineru_table_enable = getattr(settings, 'MINERU_TABLE_ENABLE', True)
|
||||
|
||||
def _call_mineru_api(self, file_path: str) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Call Mineru API to convert PDF to markdown
|
||||
|
||||
Args:
|
||||
file_path: Path to the PDF file
|
||||
|
||||
Returns:
|
||||
API response as dictionary or None if failed
|
||||
"""
|
||||
try:
|
||||
url = f"{self.mineru_base_url}/file_parse"
|
||||
|
||||
with open(file_path, 'rb') as file:
|
||||
files = {'files': (os.path.basename(file_path), file, 'application/pdf')}
|
||||
|
||||
# Prepare form data according to Mineru API specification
|
||||
data = {
|
||||
'output_dir': './output',
|
||||
'lang_list': self.mineru_lang_list,
|
||||
'backend': self.mineru_backend,
|
||||
'parse_method': self.mineru_parse_method,
|
||||
'formula_enable': self.mineru_formula_enable,
|
||||
'table_enable': self.mineru_table_enable,
|
||||
'return_md': True,
|
||||
'return_middle_json': False,
|
||||
'return_model_output': False,
|
||||
'return_content_list': False,
|
||||
'return_images': False,
|
||||
'start_page_id': 0,
|
||||
'end_page_id': 99999
|
||||
}
|
||||
|
||||
logger.info(f"Calling Mineru API at {url}")
|
||||
response = requests.post(
|
||||
url,
|
||||
files=files,
|
||||
data=data,
|
||||
timeout=self.mineru_timeout
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
result = response.json()
|
||||
logger.info("Successfully received response from Mineru API")
|
||||
return result
|
||||
else:
|
||||
logger.error(f"Mineru API returned status code {response.status_code}: {response.text}")
|
||||
return None
|
||||
|
||||
except requests.exceptions.Timeout:
|
||||
logger.error(f"Mineru API request timed out after {self.mineru_timeout} seconds")
|
||||
return None
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error calling Mineru API: {str(e)}")
|
||||
return None
|
||||
except Exception as e:
|
||||
logger.error(f"Unexpected error calling Mineru API: {str(e)}")
|
||||
return None
|
||||
|
||||
def _extract_markdown_from_response(self, response: Dict[str, Any]) -> str:
|
||||
"""
|
||||
Extract markdown content from Mineru API response
|
||||
|
||||
Args:
|
||||
response: Mineru API response dictionary
|
||||
|
||||
Returns:
|
||||
Extracted markdown content as string
|
||||
"""
|
||||
try:
|
||||
logger.debug(f"Mineru API response structure: {response}")
|
||||
|
||||
# Try different possible response formats based on Mineru API
|
||||
if 'markdown' in response:
|
||||
return response['markdown']
|
||||
elif 'md' in response:
|
||||
return response['md']
|
||||
elif 'content' in response:
|
||||
return response['content']
|
||||
elif 'text' in response:
|
||||
return response['text']
|
||||
elif 'result' in response and isinstance(response['result'], dict):
|
||||
result = response['result']
|
||||
if 'markdown' in result:
|
||||
return result['markdown']
|
||||
elif 'md' in result:
|
||||
return result['md']
|
||||
elif 'content' in result:
|
||||
return result['content']
|
||||
elif 'text' in result:
|
||||
return result['text']
|
||||
elif 'data' in response and isinstance(response['data'], dict):
|
||||
data = response['data']
|
||||
if 'markdown' in data:
|
||||
return data['markdown']
|
||||
elif 'md' in data:
|
||||
return data['md']
|
||||
elif 'content' in data:
|
||||
return data['content']
|
||||
elif 'text' in data:
|
||||
return data['text']
|
||||
elif isinstance(response, list) and len(response) > 0:
|
||||
# If response is a list, try to extract from first item
|
||||
first_item = response[0]
|
||||
if isinstance(first_item, dict):
|
||||
return self._extract_markdown_from_response(first_item)
|
||||
elif isinstance(first_item, str):
|
||||
return first_item
|
||||
else:
|
||||
# If no standard format found, try to extract from the response structure
|
||||
logger.warning("Could not find standard markdown field in Mineru response")
|
||||
|
||||
# Return the response as string if it's simple, or empty string
|
||||
if isinstance(response, str):
|
||||
return response
|
||||
elif isinstance(response, dict):
|
||||
# Try to find any text-like content
|
||||
for key, value in response.items():
|
||||
if isinstance(value, str) and len(value) > 100: # Likely content
|
||||
return value
|
||||
elif isinstance(value, dict):
|
||||
# Recursively search in nested dictionaries
|
||||
nested_content = self._extract_markdown_from_response(value)
|
||||
if nested_content:
|
||||
return nested_content
|
||||
|
||||
return ""
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error extracting markdown from Mineru response: {str(e)}")
|
||||
return ""
|
||||
|
||||
def read_content(self) -> str:
|
||||
logger.info("Starting PDF content processing with Mineru API")
|
||||
|
||||
# Call Mineru API to convert PDF to markdown
|
||||
mineru_response = self._call_mineru_api(self.input_path)
|
||||
|
||||
if not mineru_response:
|
||||
raise Exception("Failed to get response from Mineru API")
|
||||
|
||||
# Extract markdown content from the response
|
||||
markdown_content = self._extract_markdown_from_response(mineru_response)
|
||||
|
||||
if not markdown_content:
|
||||
raise Exception("No markdown content found in Mineru API response")
|
||||
|
||||
logger.info(f"Successfully extracted {len(markdown_content)} characters of markdown content")
|
||||
|
||||
# Save the raw markdown content to work directory for reference
|
||||
md_output_path = os.path.join(self.work_dir, f"{self.name_without_suff}.md")
|
||||
with open(md_output_path, 'w', encoding='utf-8') as file:
|
||||
file.write(markdown_content)
|
||||
|
||||
logger.info(f"Saved raw markdown content to {md_output_path}")
|
||||
|
||||
return markdown_content
|
||||
|
||||
def save_content(self, content: str) -> None:
|
||||
# Ensure output path has .md extension
|
||||
output_dir = os.path.dirname(self.output_path)
|
||||
base_name = os.path.splitext(os.path.basename(self.output_path))[0]
|
||||
md_output_path = os.path.join(output_dir, f"{base_name}.md")
|
||||
|
||||
logger.info(f"Saving masked content to: {md_output_path}")
|
||||
with open(md_output_path, 'w', encoding='utf-8') as file:
|
||||
file.write(content)
|
||||
|
|
@ -1,8 +1,8 @@
|
|||
from document_handlers.document_processor import DocumentProcessor
|
||||
from services.ollama_client import OllamaClient
|
||||
from ...document_handlers.document_processor import DocumentProcessor
|
||||
from ...services.ollama_client import OllamaClient
|
||||
import logging
|
||||
from prompts.masking_prompts import get_masking_prompt
|
||||
from config.settings import settings
|
||||
# from ...prompts.masking_prompts import get_masking_prompt
|
||||
from ...config import settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
class TxtDocumentProcessor(DocumentProcessor):
|
||||
|
|
@ -0,0 +1,18 @@
|
|||
import re
|
||||
|
||||
def extract_id_number_entities(chunk: str) -> dict:
|
||||
"""Extract Chinese ID numbers and return in entity mapping format."""
|
||||
id_pattern = r'\b\d{17}[\dXx]\b'
|
||||
entities = []
|
||||
for match in re.findall(id_pattern, chunk):
|
||||
entities.append({"text": match, "type": "身份证号"})
|
||||
return {"entities": entities} if entities else {}
|
||||
|
||||
|
||||
def extract_social_credit_code_entities(chunk: str) -> dict:
|
||||
"""Extract social credit codes and return in entity mapping format."""
|
||||
credit_pattern = r'\b[0-9A-Z]{18}\b'
|
||||
entities = []
|
||||
for match in re.findall(credit_pattern, chunk):
|
||||
entities.append({"text": match, "type": "统一社会信用代码"})
|
||||
return {"entities": entities} if entities else {}
|
||||
|
|
@ -0,0 +1,225 @@
|
|||
import textwrap
|
||||
|
||||
|
||||
def get_ner_name_prompt(text: str) -> str:
|
||||
"""
|
||||
Returns a prompt that generates a mapping of original names/companies to their masked versions.
|
||||
|
||||
Args:
|
||||
text (str): The input text to be analyzed for masking
|
||||
|
||||
Returns:
|
||||
str: The formatted prompt that will generate a mapping dictionary
|
||||
"""
|
||||
prompt = textwrap.dedent("""
|
||||
你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息,并按照指定的类别进行分类。请严格按照JSON格式输出结果。
|
||||
|
||||
实体类别包括:
|
||||
- 人名 (不包括律师、法官、书记员、检察官等公职人员)
|
||||
- 英文人名
|
||||
|
||||
|
||||
待处理文本:
|
||||
{text}
|
||||
|
||||
输出格式:
|
||||
{{
|
||||
"entities": [
|
||||
{{"text": "原始文本内容", "type": "人名"}},
|
||||
{{"text": "原始文本内容", "type": "英文人名"}},
|
||||
...
|
||||
]
|
||||
}}
|
||||
|
||||
|
||||
|
||||
请严格按照JSON格式输出结果。
|
||||
|
||||
""")
|
||||
return prompt.format(text=text)
|
||||
|
||||
|
||||
def get_ner_company_prompt(text: str) -> str:
|
||||
"""
|
||||
Returns a prompt that generates a mapping of original companies to their masked versions.
|
||||
|
||||
Args:
|
||||
text (str): The input text to be analyzed for masking
|
||||
|
||||
Returns:
|
||||
str: The formatted prompt that will generate a mapping dictionary
|
||||
"""
|
||||
prompt = textwrap.dedent("""
|
||||
你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息,并按照指定的类别进行分类。请严格按照JSON格式输出结果。
|
||||
|
||||
实体类别包括:
|
||||
- 公司名称
|
||||
- 英文公司名称
|
||||
- Company with English name
|
||||
- 公司名称简称
|
||||
- 公司英文名称简称
|
||||
|
||||
|
||||
待处理文本:
|
||||
{text}
|
||||
|
||||
输出格式:
|
||||
{{
|
||||
"entities": [
|
||||
{{"text": "原始文本内容", "type": "公司名称"}},
|
||||
{{"text": "原始文本内容", "type": "英文公司名称"}},
|
||||
{{"text": "原始文本内容", "type": "公司名称简称"}},
|
||||
{{"text": "原始文本内容", "type": "公司英文名称简称"}},
|
||||
...
|
||||
]
|
||||
}}
|
||||
|
||||
请严格按照JSON格式输出结果。
|
||||
""")
|
||||
return prompt.format(text=text)
|
||||
|
||||
|
||||
def get_ner_address_prompt(text: str) -> str:
|
||||
"""
|
||||
Returns a prompt that generates a mapping of original addresses to their masked versions.
|
||||
|
||||
Args:
|
||||
text (str): The input text to be analyzed for masking
|
||||
|
||||
Returns:
|
||||
str: The formatted prompt that will generate a mapping dictionary
|
||||
"""
|
||||
prompt = textwrap.dedent("""
|
||||
你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息,并按照指定的类别进行分类。请严格按照JSON格式输出结果。
|
||||
|
||||
实体类别包括:
|
||||
- 地址
|
||||
|
||||
|
||||
待处理文本:
|
||||
{text}
|
||||
|
||||
输出格式:
|
||||
{{
|
||||
"entities": [
|
||||
{{"text": "原始文本内容", "type": "地址"}},
|
||||
...
|
||||
]
|
||||
}}
|
||||
|
||||
请严格按照JSON格式输出结果。
|
||||
""")
|
||||
return prompt.format(text=text)
|
||||
|
||||
|
||||
def get_ner_project_prompt(text: str) -> str:
|
||||
"""
|
||||
Returns a prompt that generates a mapping of original project names to their masked versions.
|
||||
"""
|
||||
prompt = textwrap.dedent("""
|
||||
你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息,并按照指定的类别进行分类。请严格按照JSON格式输出结果。
|
||||
|
||||
实体类别包括:
|
||||
- 项目名(此处项目特指商业、工程、合同等项目)
|
||||
|
||||
待处理文本:
|
||||
{text}
|
||||
|
||||
输出格式:
|
||||
{{
|
||||
"entities": [
|
||||
{{"text": "原始文本内容", "type": "项目名"}},
|
||||
...
|
||||
]
|
||||
}}
|
||||
|
||||
请严格按照JSON格式输出结果。
|
||||
""")
|
||||
return prompt.format(text=text)
|
||||
|
||||
|
||||
def get_ner_case_number_prompt(text: str) -> str:
|
||||
"""
|
||||
Returns a prompt that generates a mapping of original case numbers to their masked versions.
|
||||
"""
|
||||
prompt = textwrap.dedent("""
|
||||
你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息,并按照指定的类别进行分类。请严格按照JSON格式输出结果。
|
||||
|
||||
实体类别包括:
|
||||
- 案号
|
||||
|
||||
待处理文本:
|
||||
{text}
|
||||
|
||||
输出格式:
|
||||
{{
|
||||
"entities": [
|
||||
{{"text": "原始文本内容", "type": "案号"}},
|
||||
...
|
||||
]
|
||||
}}
|
||||
|
||||
请严格按照JSON格式输出结果。
|
||||
""")
|
||||
return prompt.format(text=text)
|
||||
|
||||
|
||||
def get_entity_linkage_prompt(entities_text: str) -> str:
|
||||
"""
|
||||
Returns a prompt that identifies related entities and groups them together.
|
||||
|
||||
Args:
|
||||
entities_text (str): The list of entities to be analyzed for linkage
|
||||
|
||||
Returns:
|
||||
str: The formatted prompt that will generate entity linkage information
|
||||
"""
|
||||
prompt = textwrap.dedent("""
|
||||
你是一个专业的法律文本实体关联分析助手。请分析以下实体列表,识别出相互关联的实体(如全称与简称、中文名与英文名等),并将它们分组。
|
||||
|
||||
关联规则:
|
||||
1. 公司名称关联:
|
||||
- 全称与简称(如:"阿里巴巴集团控股有限公司" 与 "阿里巴巴")
|
||||
- 中文名与英文名(如:"腾讯科技有限公司" 与 "Tencent Technology Ltd.")
|
||||
- 母公司与子公司(如:"腾讯" 与 "腾讯音乐")
|
||||
|
||||
|
||||
2. 每个组中应指定一个主要实体(is_primary: true),通常是:
|
||||
- 对于公司:选择最正式的全称
|
||||
- 对于人名:选择最常用的称呼
|
||||
|
||||
待分析实体列表:
|
||||
{entities_text}
|
||||
|
||||
输出格式:
|
||||
{{
|
||||
"entity_groups": [
|
||||
{{
|
||||
"group_id": "group_1",
|
||||
"group_type": "公司名称",
|
||||
"entities": [
|
||||
{{
|
||||
"text": "阿里巴巴集团控股有限公司",
|
||||
"type": "公司名称",
|
||||
"is_primary": true
|
||||
}},
|
||||
{{
|
||||
"text": "阿里巴巴",
|
||||
"type": "公司名称简称",
|
||||
"is_primary": false
|
||||
}}
|
||||
]
|
||||
}}
|
||||
]
|
||||
}}
|
||||
|
||||
注意事项:
|
||||
1. 只对确实有关联的实体进行分组
|
||||
2. 每个实体只能属于一个组
|
||||
3. 每个组必须有且仅有一个主要实体(is_primary: true)
|
||||
4. 如果实体之间没有明显关联,不要强制分组
|
||||
5. group_type 应该是 "公司名称"
|
||||
|
||||
请严格按照JSON格式输出结果。
|
||||
""")
|
||||
return prompt.format(entities_text=entities_text)
|
||||
|
|
@ -1,12 +1,12 @@
|
|||
import logging
|
||||
from document_handlers.document_factory import DocumentProcessorFactory
|
||||
from services.ollama_client import OllamaClient
|
||||
from ..document_handlers.document_factory import DocumentProcessorFactory
|
||||
from ..services.ollama_client import OllamaClient
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class DocumentService:
|
||||
def __init__(self, ollama_client: OllamaClient):
|
||||
self.ollama_client = ollama_client
|
||||
def __init__(self):
|
||||
pass
|
||||
|
||||
def process_document(self, input_path: str, output_path: str) -> bool:
|
||||
try:
|
||||
|
|
@ -0,0 +1,240 @@
|
|||
import logging
|
||||
from typing import Any, Dict, Optional
|
||||
from jsonschema import validate, ValidationError
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class LLMResponseValidator:
|
||||
"""Validator for LLM JSON responses with different schemas for different entity types"""
|
||||
|
||||
# Schema for basic entity extraction responses
|
||||
ENTITY_EXTRACTION_SCHEMA = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"entities": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"text": {"type": "string"},
|
||||
"type": {"type": "string"}
|
||||
},
|
||||
"required": ["text", "type"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["entities"]
|
||||
}
|
||||
|
||||
# Schema for entity linkage responses
|
||||
ENTITY_LINKAGE_SCHEMA = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"entity_groups": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"group_id": {"type": "string"},
|
||||
"group_type": {"type": "string"},
|
||||
"entities": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"text": {"type": "string"},
|
||||
"type": {"type": "string"},
|
||||
"is_primary": {"type": "boolean"}
|
||||
},
|
||||
"required": ["text", "type", "is_primary"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["group_id", "group_type", "entities"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["entity_groups"]
|
||||
}
|
||||
|
||||
# Schema for regex-based entity extraction (from entity_regex.py)
|
||||
REGEX_ENTITY_SCHEMA = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"entities": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"text": {"type": "string"},
|
||||
"type": {"type": "string"}
|
||||
},
|
||||
"required": ["text", "type"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["entities"]
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def validate_entity_extraction(cls, response: Dict[str, Any]) -> bool:
|
||||
"""
|
||||
Validate entity extraction response from LLM.
|
||||
|
||||
Args:
|
||||
response: The parsed JSON response from LLM
|
||||
|
||||
Returns:
|
||||
bool: True if valid, False otherwise
|
||||
"""
|
||||
try:
|
||||
validate(instance=response, schema=cls.ENTITY_EXTRACTION_SCHEMA)
|
||||
logger.debug(f"Entity extraction validation passed for response: {response}")
|
||||
return True
|
||||
except ValidationError as e:
|
||||
logger.warning(f"Entity extraction validation failed: {e}")
|
||||
logger.warning(f"Response that failed validation: {response}")
|
||||
return False
|
||||
|
||||
@classmethod
|
||||
def validate_entity_linkage(cls, response: Dict[str, Any]) -> bool:
|
||||
"""
|
||||
Validate entity linkage response from LLM.
|
||||
|
||||
Args:
|
||||
response: The parsed JSON response from LLM
|
||||
|
||||
Returns:
|
||||
bool: True if valid, False otherwise
|
||||
"""
|
||||
try:
|
||||
validate(instance=response, schema=cls.ENTITY_LINKAGE_SCHEMA)
|
||||
content_valid = cls._validate_linkage_content(response)
|
||||
if content_valid:
|
||||
logger.debug(f"Entity linkage validation passed for response: {response}")
|
||||
return True
|
||||
else:
|
||||
logger.warning(f"Entity linkage content validation failed for response: {response}")
|
||||
return False
|
||||
except ValidationError as e:
|
||||
logger.warning(f"Entity linkage validation failed: {e}")
|
||||
logger.warning(f"Response that failed validation: {response}")
|
||||
return False
|
||||
|
||||
@classmethod
|
||||
def validate_regex_entity(cls, response: Dict[str, Any]) -> bool:
|
||||
"""
|
||||
Validate regex-based entity extraction response.
|
||||
|
||||
Args:
|
||||
response: The parsed JSON response from regex extractors
|
||||
|
||||
Returns:
|
||||
bool: True if valid, False otherwise
|
||||
"""
|
||||
try:
|
||||
validate(instance=response, schema=cls.REGEX_ENTITY_SCHEMA)
|
||||
logger.debug(f"Regex entity validation passed for response: {response}")
|
||||
return True
|
||||
except ValidationError as e:
|
||||
logger.warning(f"Regex entity validation failed: {e}")
|
||||
logger.warning(f"Response that failed validation: {response}")
|
||||
return False
|
||||
|
||||
@classmethod
|
||||
def _validate_linkage_content(cls, response: Dict[str, Any]) -> bool:
|
||||
"""
|
||||
Additional content validation for entity linkage responses.
|
||||
|
||||
Args:
|
||||
response: The parsed JSON response from LLM
|
||||
|
||||
Returns:
|
||||
bool: True if content is valid, False otherwise
|
||||
"""
|
||||
entity_groups = response.get('entity_groups', [])
|
||||
|
||||
for group in entity_groups:
|
||||
# Validate group type
|
||||
group_type = group.get('group_type', '')
|
||||
if group_type not in ['公司名称', '人名']:
|
||||
logger.warning(f"Invalid group_type: {group_type}")
|
||||
return False
|
||||
|
||||
# Validate entities in group
|
||||
entities = group.get('entities', [])
|
||||
if not entities:
|
||||
logger.warning("Empty entity group found")
|
||||
return False
|
||||
|
||||
# Check that exactly one entity is marked as primary
|
||||
primary_count = sum(1 for entity in entities if entity.get('is_primary', False))
|
||||
if primary_count != 1:
|
||||
logger.warning(f"Group must have exactly one primary entity, found {primary_count}")
|
||||
return False
|
||||
|
||||
# Validate entity types within group
|
||||
for entity in entities:
|
||||
entity_type = entity.get('type', '')
|
||||
if group_type == '公司名称' and not any(keyword in entity_type for keyword in ['公司', 'Company']):
|
||||
logger.warning(f"Company group contains non-company entity: {entity_type}")
|
||||
return False
|
||||
elif group_type == '人名' and not any(keyword in entity_type for keyword in ['人名', '英文人名']):
|
||||
logger.warning(f"Person group contains non-person entity: {entity_type}")
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
@classmethod
|
||||
def validate_response_by_type(cls, response: Dict[str, Any], response_type: str) -> bool:
|
||||
"""
|
||||
Generic validator that routes to appropriate validation method based on response type.
|
||||
|
||||
Args:
|
||||
response: The parsed JSON response from LLM
|
||||
response_type: Type of response ('entity_extraction', 'entity_linkage', 'regex_entity')
|
||||
|
||||
Returns:
|
||||
bool: True if valid, False otherwise
|
||||
"""
|
||||
validators = {
|
||||
'entity_extraction': cls.validate_entity_extraction,
|
||||
'entity_linkage': cls.validate_entity_linkage,
|
||||
'regex_entity': cls.validate_regex_entity
|
||||
}
|
||||
|
||||
validator = validators.get(response_type)
|
||||
if not validator:
|
||||
logger.error(f"Unknown response type: {response_type}")
|
||||
return False
|
||||
|
||||
return validator(response)
|
||||
|
||||
@classmethod
|
||||
def get_validation_errors(cls, response: Dict[str, Any], response_type: str) -> Optional[str]:
|
||||
"""
|
||||
Get detailed validation errors for debugging.
|
||||
|
||||
Args:
|
||||
response: The parsed JSON response from LLM
|
||||
response_type: Type of response
|
||||
|
||||
Returns:
|
||||
Optional[str]: Error message or None if valid
|
||||
"""
|
||||
try:
|
||||
if response_type == 'entity_extraction':
|
||||
validate(instance=response, schema=cls.ENTITY_EXTRACTION_SCHEMA)
|
||||
elif response_type == 'entity_linkage':
|
||||
validate(instance=response, schema=cls.ENTITY_LINKAGE_SCHEMA)
|
||||
if not cls._validate_linkage_content(response):
|
||||
return "Content validation failed for entity linkage"
|
||||
elif response_type == 'regex_entity':
|
||||
validate(instance=response, schema=cls.REGEX_ENTITY_SCHEMA)
|
||||
else:
|
||||
return f"Unknown response type: {response_type}"
|
||||
|
||||
return None
|
||||
except ValidationError as e:
|
||||
return f"Schema validation error: {e}"
|
||||
|
|
@ -0,0 +1,33 @@
|
|||
from fastapi import FastAPI
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
from .core.config import settings
|
||||
from .api.endpoints import files
|
||||
from .core.database import engine, Base
|
||||
|
||||
# Create database tables
|
||||
Base.metadata.create_all(bind=engine)
|
||||
|
||||
app = FastAPI(
|
||||
title=settings.PROJECT_NAME,
|
||||
openapi_url=f"{settings.API_V1_STR}/openapi.json"
|
||||
)
|
||||
|
||||
# Set up CORS
|
||||
app.add_middleware(
|
||||
CORSMiddleware,
|
||||
allow_origins=["*"], # In production, replace with specific origins
|
||||
allow_credentials=True,
|
||||
allow_methods=["*"],
|
||||
allow_headers=["*"],
|
||||
)
|
||||
|
||||
# Include routers
|
||||
app.include_router(
|
||||
files.router,
|
||||
prefix=f"{settings.API_V1_STR}/files",
|
||||
tags=["files"]
|
||||
)
|
||||
|
||||
@app.get("/")
|
||||
async def root():
|
||||
return {"message": "Welcome to Legal Document Masker API"}
|
||||
|
|
@ -0,0 +1,22 @@
|
|||
from sqlalchemy import Column, String, DateTime, Text
|
||||
from datetime import datetime
|
||||
import uuid
|
||||
from ..core.database import Base
|
||||
|
||||
class FileStatus(str):
|
||||
NOT_STARTED = "not_started"
|
||||
PROCESSING = "processing"
|
||||
SUCCESS = "success"
|
||||
FAILED = "failed"
|
||||
|
||||
class File(Base):
|
||||
__tablename__ = "files"
|
||||
|
||||
id = Column(String(36), primary_key=True, default=lambda: str(uuid.uuid4()))
|
||||
filename = Column(String(255), nullable=False)
|
||||
original_path = Column(String(255), nullable=False)
|
||||
processed_path = Column(String(255))
|
||||
status = Column(String(20), nullable=False, default=FileStatus.NOT_STARTED)
|
||||
error_message = Column(Text)
|
||||
created_at = Column(DateTime, nullable=False, default=datetime.utcnow)
|
||||
updated_at = Column(DateTime, nullable=False, default=datetime.utcnow, onupdate=datetime.utcnow)
|
||||
|
|
@ -0,0 +1,21 @@
|
|||
from pydantic import BaseModel
|
||||
from datetime import datetime
|
||||
from typing import Optional
|
||||
from uuid import UUID
|
||||
|
||||
class FileBase(BaseModel):
|
||||
filename: str
|
||||
status: str
|
||||
error_message: Optional[str] = None
|
||||
|
||||
class FileResponse(FileBase):
|
||||
id: UUID
|
||||
created_at: datetime
|
||||
updated_at: datetime
|
||||
|
||||
class Config:
|
||||
from_attributes = True
|
||||
|
||||
class FileList(BaseModel):
|
||||
files: list[FileResponse]
|
||||
total: int
|
||||
|
|
@ -0,0 +1,87 @@
|
|||
from celery import Celery
|
||||
from ..core.config import settings
|
||||
from ..models.file import File, FileStatus
|
||||
from sqlalchemy.orm import Session
|
||||
from ..core.database import SessionLocal
|
||||
import sys
|
||||
import os
|
||||
from ..core.services.document_service import DocumentService
|
||||
from pathlib import Path
|
||||
from fastapi import HTTPException
|
||||
|
||||
|
||||
celery = Celery(
|
||||
'file_service',
|
||||
broker=settings.CELERY_BROKER_URL,
|
||||
backend=settings.CELERY_RESULT_BACKEND
|
||||
)
|
||||
|
||||
def delete_file(file_id: str):
|
||||
"""
|
||||
Delete a file and its associated records.
|
||||
This will:
|
||||
1. Delete the database record
|
||||
2. Delete the original uploaded file
|
||||
3. Delete the processed markdown file (if it exists)
|
||||
"""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
# Get the file record
|
||||
file = db.query(File).filter(File.id == file_id).first()
|
||||
if not file:
|
||||
raise HTTPException(status_code=404, detail="File not found")
|
||||
|
||||
# Delete the original file if it exists
|
||||
if file.original_path and os.path.exists(file.original_path):
|
||||
os.remove(file.original_path)
|
||||
|
||||
# Delete the processed file if it exists
|
||||
if file.processed_path and os.path.exists(file.processed_path):
|
||||
os.remove(file.processed_path)
|
||||
|
||||
# Delete the database record
|
||||
db.delete(file)
|
||||
db.commit()
|
||||
|
||||
except Exception as e:
|
||||
db.rollback()
|
||||
raise HTTPException(status_code=500, detail=f"Error deleting file: {str(e)}")
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
@celery.task
|
||||
def process_file(file_id: str):
|
||||
db = SessionLocal()
|
||||
try:
|
||||
file = db.query(File).filter(File.id == file_id).first()
|
||||
if not file:
|
||||
return
|
||||
|
||||
# Update status to processing
|
||||
file.status = FileStatus.PROCESSING
|
||||
db.commit()
|
||||
|
||||
try:
|
||||
# Process the file using your existing masking system
|
||||
process_service = DocumentService()
|
||||
|
||||
# Determine output path using file_id with .md extension
|
||||
output_filename = f"{file_id}.md"
|
||||
output_path = str(settings.PROCESSED_FOLDER / output_filename)
|
||||
|
||||
# Process document with both input and output paths
|
||||
process_service.process_document(file.original_path, output_path)
|
||||
|
||||
# Update file record with processed path
|
||||
file.processed_path = output_path
|
||||
file.status = FileStatus.SUCCESS
|
||||
db.commit()
|
||||
|
||||
except Exception as e:
|
||||
file.status = FileStatus.FAILED
|
||||
file.error_message = str(e)
|
||||
db.commit()
|
||||
raise
|
||||
|
||||
finally:
|
||||
db.close()
|
||||
|
|
@ -0,0 +1,37 @@
|
|||
version: '3.8'
|
||||
|
||||
services:
|
||||
api:
|
||||
build: .
|
||||
ports:
|
||||
- "8000:8000"
|
||||
volumes:
|
||||
- ./storage:/app/storage
|
||||
- ./legal_doc_masker.db:/app/legal_doc_masker.db
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- CELERY_BROKER_URL=redis://redis:6379/0
|
||||
- CELERY_RESULT_BACKEND=redis://redis:6379/0
|
||||
depends_on:
|
||||
- redis
|
||||
|
||||
celery_worker:
|
||||
build: .
|
||||
command: celery -A app.services.file_service worker --loglevel=info
|
||||
volumes:
|
||||
- ./storage:/app/storage
|
||||
- ./legal_doc_masker.db:/app/legal_doc_masker.db
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- CELERY_BROKER_URL=redis://redis:6379/0
|
||||
- CELERY_RESULT_BACKEND=redis://redis:6379/0
|
||||
depends_on:
|
||||
- redis
|
||||
- api
|
||||
|
||||
redis:
|
||||
image: redis:alpine
|
||||
ports:
|
||||
- "6379:6379"
|
||||
|
|
@ -0,0 +1,127 @@
|
|||
[2025-07-14 14:20:19,015: INFO/ForkPoolWorker-4] Raw response from LLM: {
|
||||
celery_worker-1 | "entities": []
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | [2025-07-14 14:20:19,016: INFO/ForkPoolWorker-4] Parsed mapping: {'entities': []}
|
||||
celery_worker-1 | [2025-07-14 14:20:19,020: INFO/ForkPoolWorker-4] Calling ollama to generate case numbers mapping for chunk (attempt 1/3):
|
||||
celery_worker-1 | 你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息,并按照指定的类别进行分类。请严格按照JSON格式输出结果。
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 实体类别包括:
|
||||
celery_worker-1 | - 案号
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 待处理文本:
|
||||
celery_worker-1 |
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 二审案件受理费450892 元,由北京丰复久信营销科技有限公司负担(已交纳)。
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 29. 本判决为终审判决。
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 审 判 长 史晓霞审 判 员 邓青菁审 判 员 李 淼二〇二二年七月七日法 官 助 理 黎 铧书 记 员 郑海兴
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 输出格式:
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "entities": [
|
||||
celery_worker-1 | {"text": "原始文本内容", "type": "案号"},
|
||||
celery_worker-1 | ...
|
||||
celery_worker-1 | ]
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 请严格按照JSON格式输出结果。
|
||||
celery_worker-1 |
|
||||
api-1 | INFO: 192.168.65.1:60045 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:34054 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:34054 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:22084 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
celery_worker-1 | [2025-07-14 14:20:31,279: INFO/ForkPoolWorker-4] Raw response from LLM: {
|
||||
celery_worker-1 | "entities": []
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | [2025-07-14 14:20:31,281: INFO/ForkPoolWorker-4] Parsed mapping: {'entities': []}
|
||||
celery_worker-1 | [2025-07-14 14:20:31,287: INFO/ForkPoolWorker-4] Chunk mapping: [{'entities': []}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': []}, {'entities': []}]
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Final chunk mappings: [{'entities': [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}]}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}]}, {'entities': [{'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}]}, {'entities': [{'text': '服务合同', 'type': '项目名'}]}, {'entities': [{'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '(2020)京0105 民初69754 号', 'type': '案号'}]}, {'entities': [{'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}]}, {'entities': [{'text': '丰复久信公司', 'type': '公司名称'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}]}, {'entities': [{'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}]}, {'entities': [{'text': '《计算机设备采购合同》', 'type': '项目名'}]}, {'entities': []}, {'entities': []}, {'entities': [{'text': '丰复久信公司', 'type': '公司名称'}, {'text': '中研智创公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': [{'text': '《服务合同书》', 'type': '项目名'}]}, {'entities': []}, {'entities': []}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': []}, {'entities': []}]
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '丰复久信公司', 'type': '公司名称'}
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '丰复久信公司', 'type': '公司名称'}
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '中研智创公司', 'type': '公司名称'}
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Merged 22 unique entities
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Unique entities: [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}, {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}, {'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}, {'text': '服务合同', 'type': '项目名'}, {'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '(2020)京0105 民初69754 号', 'type': '案号'}, {'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}, {'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}, {'text': '《计算机设备采购合同》', 'type': '项目名'}, {'text': '《服务合同书》', 'type': '项目名'}]
|
||||
celery_worker-1 | [2025-07-14 14:20:31,289: INFO/ForkPoolWorker-4] Calling ollama to generate entity linkage (attempt 1/3)
|
||||
api-1 | INFO: 192.168.65.1:52168 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:61426 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:30702 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:48159 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:16860 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:21262 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:45564 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:32142 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:27769 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:21196 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
celery_worker-1 | [2025-07-14 14:21:21,436: INFO/ForkPoolWorker-4] Raw entity linkage response from LLM: {
|
||||
celery_worker-1 | "entity_groups": [
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "group_id": "group_1",
|
||||
celery_worker-1 | "group_type": "公司名称",
|
||||
celery_worker-1 | "entities": [
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "北京丰复久信营销科技有限公司",
|
||||
celery_worker-1 | "type": "公司名称",
|
||||
celery_worker-1 | "is_primary": true
|
||||
celery_worker-1 | },
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "丰复久信公司",
|
||||
celery_worker-1 | "type": "公司名称简称",
|
||||
celery_worker-1 | "is_primary": false
|
||||
celery_worker-1 | },
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "丰复久信",
|
||||
celery_worker-1 | "type": "公司名称简称",
|
||||
celery_worker-1 | "is_primary": false
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | ]
|
||||
celery_worker-1 | },
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "group_id": "group_2",
|
||||
celery_worker-1 | "group_type": "公司名称",
|
||||
celery_worker-1 | "entities": [
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "中研智创区块链技术有限公司",
|
||||
celery_worker-1 | "type": "公司名称",
|
||||
celery_worker-1 | "is_primary": true
|
||||
celery_worker-1 | },
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "中研智创公司",
|
||||
celery_worker-1 | "type": "公司名称简称",
|
||||
celery_worker-1 | "is_primary": false
|
||||
celery_worker-1 | },
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "中研智创",
|
||||
celery_worker-1 | "type": "公司名称简称",
|
||||
celery_worker-1 | "is_primary": false
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | ]
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | ]
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | [2025-07-14 14:21:21,437: INFO/ForkPoolWorker-4] Parsed entity linkage: {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
|
||||
celery_worker-1 | [2025-07-14 14:21:21,445: INFO/ForkPoolWorker-4] Successfully created entity linkage with 2 groups
|
||||
celery_worker-1 | [2025-07-14 14:21:21,445: INFO/ForkPoolWorker-4] Entity linkage: {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Generated masked mapping for 22 entities
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Combined mapping: {'郭东军': '某', '王欢子': '某甲', '北京丰复久信营销科技有限公司': '某公司', '丰复久信公司': '某公司甲', '中研智创区块链技术有限公司': '某公司乙', '中研智才公司': '某公司丙', '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室': '某乙', '天津市津南区双港镇工业园区优谷产业园5 号楼-1505': '某丙', '服务合同': '某丁', '(2022)京 03 民终 3852 号': '某戊', '(2020)京0105 民初69754 号': '某己', '李圣艳': '某庚', '闫向东': '某辛', '李敏': '某壬', '布兰登·斯密特': '某癸', '中研智创公司': '某公司丁', '丰复久信': '某公司戊', '中研智创': '某公司己', '上海市': '某11', '北京': '某12', '《计算机设备采购合同》': '某13', '《服务合同书》': '某14'}
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '北京丰复久信营销科技有限公司' to '北京丰复久信营销科技有限公司' with masked name '某公司'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '丰复久信公司' to '北京丰复久信营销科技有限公司' with masked name '某公司'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '丰复久信' to '北京丰复久信营销科技有限公司' with masked name '某公司'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创区块链技术有限公司' to '中研智创区块链技术有限公司' with masked name '某公司乙'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创公司' to '中研智创区块链技术有限公司' with masked name '某公司乙'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创' to '中研智创区块链技术有限公司' with masked name '某公司乙'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Final mapping: {'郭东军': '某', '王欢子': '某甲', '北京丰复久信营销科技有限公司': '某公司', '丰复久信公司': '某公司', '中研智创区块链技术有限公司': '某公司乙', '中研智才公司': '某公司丙', '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室': '某乙', '天津市津南区双港镇工业园区优谷产业园5 号楼-1505': '某丙', '服务合同': '某丁', '(2022)京 03 民终 3852 号': '某戊', '(2020)京0105 民初69754 号': '某己', '李圣艳': '某庚', '闫向东': '某辛', '李敏': '某壬', '布兰登·斯密特': '某癸', '中研智创公司': '某公司乙', '丰复久信': '某公司', '中研智创': '某公司乙', '上海市': '某11', '北京': '某12', '《计算机设备采购合同》': '某13', '《服务合同书》': '某14'}
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Successfully masked content
|
||||
celery_worker-1 | [2025-07-14 14:21:21,449: INFO/ForkPoolWorker-4] Successfully saved masked content to /app/storage/processed/47522ea9-c259-4304-bfe4-1d3ed6902ede.md
|
||||
celery_worker-1 | [2025-07-14 14:21:21,470: INFO/ForkPoolWorker-4] Task app.services.file_service.process_file[5cfbca4c-0f6f-4c71-a66b-b22ee2d28139] succeeded in 311.847165101s: None
|
||||
api-1 | INFO: 192.168.65.1:33432 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:40073 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:29550 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:61350 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:61755 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:63726 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:43446 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:45624 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:25256 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:43464 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
{
|
||||
"name": "backend",
|
||||
"lockfileVersion": 3,
|
||||
"requires": true,
|
||||
"packages": {}
|
||||
}
|
||||
|
|
@ -0,0 +1,32 @@
|
|||
# FastAPI and server
|
||||
fastapi>=0.104.0
|
||||
uvicorn>=0.24.0
|
||||
python-multipart>=0.0.6
|
||||
websockets>=12.0
|
||||
|
||||
# Database
|
||||
sqlalchemy>=2.0.0
|
||||
alembic>=1.12.0
|
||||
|
||||
# Background tasks
|
||||
celery>=5.3.0
|
||||
redis>=5.0.0
|
||||
|
||||
# Security
|
||||
python-jose[cryptography]>=3.3.0
|
||||
passlib[bcrypt]>=1.7.4
|
||||
python-dotenv>=1.0.0
|
||||
|
||||
# Testing
|
||||
pytest>=7.4.0
|
||||
httpx>=0.25.0
|
||||
|
||||
# Existing project dependencies
|
||||
pydantic-settings>=2.0.0
|
||||
watchdog==2.1.6
|
||||
requests==2.28.1
|
||||
python-docx>=0.8.11
|
||||
PyPDF2>=3.0.0
|
||||
pandas>=2.0.0
|
||||
# magic-pdf[full]
|
||||
jsonschema>=4.20.0
|
||||
|
|
@ -0,0 +1,62 @@
|
|||
import pytest
|
||||
from app.core.document_handlers.ner_processor import NerProcessor
|
||||
|
||||
def test_generate_masked_mapping():
|
||||
processor = NerProcessor()
|
||||
unique_entities = [
|
||||
{'text': '李雷', 'type': '人名'},
|
||||
{'text': '李明', 'type': '人名'},
|
||||
{'text': '王强', 'type': '人名'},
|
||||
{'text': 'Acme Manufacturing Inc.', 'type': '英文公司名', 'industry': 'manufacturing'},
|
||||
{'text': 'Google LLC', 'type': '英文公司名'},
|
||||
{'text': 'A公司', 'type': '公司名称'},
|
||||
{'text': 'B公司', 'type': '公司名称'},
|
||||
{'text': 'John Smith', 'type': '英文人名'},
|
||||
{'text': 'Elizabeth Windsor', 'type': '英文人名'},
|
||||
{'text': '华梦龙光伏项目', 'type': '项目名'},
|
||||
{'text': '案号12345', 'type': '案号'},
|
||||
{'text': '310101198802080000', 'type': '身份证号'},
|
||||
{'text': '9133021276453538XT', 'type': '社会信用代码'},
|
||||
]
|
||||
linkage = {
|
||||
'entity_groups': [
|
||||
{
|
||||
'group_id': 'g1',
|
||||
'group_type': '公司名称',
|
||||
'entities': [
|
||||
{'text': 'A公司', 'type': '公司名称', 'is_primary': True},
|
||||
{'text': 'B公司', 'type': '公司名称', 'is_primary': False},
|
||||
]
|
||||
},
|
||||
{
|
||||
'group_id': 'g2',
|
||||
'group_type': '人名',
|
||||
'entities': [
|
||||
{'text': '李雷', 'type': '人名', 'is_primary': True},
|
||||
{'text': '李明', 'type': '人名', 'is_primary': False},
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
mapping = processor._generate_masked_mapping(unique_entities, linkage)
|
||||
# 人名
|
||||
assert mapping['李雷'].startswith('李某')
|
||||
assert mapping['李明'].startswith('李某')
|
||||
assert mapping['王强'].startswith('王某')
|
||||
# 英文公司名
|
||||
assert mapping['Acme Manufacturing Inc.'] == 'MANUFACTURING'
|
||||
assert mapping['Google LLC'] == 'COMPANY'
|
||||
# 公司名同组
|
||||
assert mapping['A公司'] == mapping['B公司']
|
||||
assert mapping['A公司'].endswith('公司')
|
||||
# 英文人名
|
||||
assert mapping['John Smith'] == 'J*** S***'
|
||||
assert mapping['Elizabeth Windsor'] == 'E*** W***'
|
||||
# 项目名
|
||||
assert mapping['华梦龙光伏项目'].endswith('项目')
|
||||
# 案号
|
||||
assert mapping['案号12345'] == '***'
|
||||
# 身份证号
|
||||
assert mapping['310101198802080000'] == 'XXXXXX'
|
||||
# 社会信用代码
|
||||
assert mapping['9133021276453538XT'] == 'XXXXXXXX'
|
||||
|
|
@ -1,2 +0,0 @@
|
|||
rm ./doc_src/*.md
|
||||
cp ./doc/*.md ./doc_src/
|
||||
|
|
@ -0,0 +1,105 @@
|
|||
version: '3.8'
|
||||
|
||||
services:
|
||||
# Mineru API Service
|
||||
mineru-api:
|
||||
build:
|
||||
context: ./mineru
|
||||
dockerfile: Dockerfile
|
||||
platform: linux/arm64
|
||||
ports:
|
||||
- "8001:8000"
|
||||
volumes:
|
||||
- ./mineru/storage/uploads:/app/storage/uploads
|
||||
- ./mineru/storage/processed:/app/storage/processed
|
||||
environment:
|
||||
- PYTHONUNBUFFERED=1
|
||||
- MINERU_MODEL_SOURCE=local
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 60s
|
||||
networks:
|
||||
- app-network
|
||||
|
||||
# Backend API Service
|
||||
backend-api:
|
||||
build:
|
||||
context: ./backend
|
||||
dockerfile: Dockerfile
|
||||
ports:
|
||||
- "8000:8000"
|
||||
volumes:
|
||||
- ./backend/storage:/app/storage
|
||||
- ./backend/legal_doc_masker.db:/app/legal_doc_masker.db
|
||||
env_file:
|
||||
- ./backend/.env
|
||||
environment:
|
||||
- CELERY_BROKER_URL=redis://redis:6379/0
|
||||
- CELERY_RESULT_BACKEND=redis://redis:6379/0
|
||||
- MINERU_API_URL=http://mineru-api:8000
|
||||
depends_on:
|
||||
- redis
|
||||
- mineru-api
|
||||
networks:
|
||||
- app-network
|
||||
|
||||
# Celery Worker
|
||||
celery-worker:
|
||||
build:
|
||||
context: ./backend
|
||||
dockerfile: Dockerfile
|
||||
command: celery -A app.services.file_service worker --loglevel=info
|
||||
volumes:
|
||||
- ./backend/storage:/app/storage
|
||||
- ./backend/legal_doc_masker.db:/app/legal_doc_masker.db
|
||||
env_file:
|
||||
- ./backend/.env
|
||||
environment:
|
||||
- CELERY_BROKER_URL=redis://redis:6379/0
|
||||
- CELERY_RESULT_BACKEND=redis://redis:6379/0
|
||||
- MINERU_API_URL=http://mineru-api:8000
|
||||
depends_on:
|
||||
- redis
|
||||
- backend-api
|
||||
networks:
|
||||
- app-network
|
||||
|
||||
# Redis Service
|
||||
redis:
|
||||
image: redis:alpine
|
||||
ports:
|
||||
- "6379:6379"
|
||||
networks:
|
||||
- app-network
|
||||
|
||||
# Frontend Service
|
||||
frontend:
|
||||
build:
|
||||
context: ./frontend
|
||||
dockerfile: Dockerfile
|
||||
args:
|
||||
- REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
|
||||
ports:
|
||||
- "3000:80"
|
||||
env_file:
|
||||
- ./frontend/.env
|
||||
environment:
|
||||
- NODE_ENV=production
|
||||
- REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
|
||||
restart: unless-stopped
|
||||
depends_on:
|
||||
- backend-api
|
||||
networks:
|
||||
- app-network
|
||||
|
||||
networks:
|
||||
app-network:
|
||||
driver: bridge
|
||||
|
||||
volumes:
|
||||
uploads:
|
||||
processed:
|
||||
|
|
@ -0,0 +1,168 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Docker Image Export Script
|
||||
# Exports all project Docker images for migration to another environment
|
||||
|
||||
set -e
|
||||
|
||||
echo "🚀 Legal Document Masker - Docker Image Export"
|
||||
echo "=============================================="
|
||||
|
||||
# Function to check if Docker is running
|
||||
check_docker() {
|
||||
if ! docker info > /dev/null 2>&1; then
|
||||
echo "❌ Docker is not running. Please start Docker and try again."
|
||||
exit 1
|
||||
fi
|
||||
echo "✅ Docker is running"
|
||||
}
|
||||
|
||||
# Function to check if images exist
|
||||
check_images() {
|
||||
echo "🔍 Checking for required images..."
|
||||
|
||||
local missing_images=()
|
||||
|
||||
if ! docker images | grep -q "legal-doc-masker-backend-api"; then
|
||||
missing_images+=("legal-doc-masker-backend-api")
|
||||
fi
|
||||
|
||||
if ! docker images | grep -q "legal-doc-masker-frontend"; then
|
||||
missing_images+=("legal-doc-masker-frontend")
|
||||
fi
|
||||
|
||||
if ! docker images | grep -q "legal-doc-masker-mineru-api"; then
|
||||
missing_images+=("legal-doc-masker-mineru-api")
|
||||
fi
|
||||
|
||||
if ! docker images | grep -q "redis:alpine"; then
|
||||
missing_images+=("redis:alpine")
|
||||
fi
|
||||
|
||||
if [ ${#missing_images[@]} -ne 0 ]; then
|
||||
echo "❌ Missing images: ${missing_images[*]}"
|
||||
echo "Please build the images first using: docker-compose build"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "✅ All required images found"
|
||||
}
|
||||
|
||||
# Function to create export directory
|
||||
create_export_dir() {
|
||||
local export_dir="docker-images-export-$(date +%Y%m%d-%H%M%S)"
|
||||
mkdir -p "$export_dir"
|
||||
cd "$export_dir"
|
||||
echo "📁 Created export directory: $export_dir"
|
||||
echo "$export_dir"
|
||||
}
|
||||
|
||||
# Function to export images
|
||||
export_images() {
|
||||
local export_dir="$1"
|
||||
|
||||
echo "📦 Exporting Docker images..."
|
||||
|
||||
# Export backend image
|
||||
echo " 📦 Exporting backend-api image..."
|
||||
docker save legal-doc-masker-backend-api:latest -o backend-api.tar
|
||||
|
||||
# Export frontend image
|
||||
echo " 📦 Exporting frontend image..."
|
||||
docker save legal-doc-masker-frontend:latest -o frontend.tar
|
||||
|
||||
# Export mineru image
|
||||
echo " 📦 Exporting mineru-api image..."
|
||||
docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
|
||||
|
||||
# Export redis image
|
||||
echo " 📦 Exporting redis image..."
|
||||
docker save redis:alpine -o redis.tar
|
||||
|
||||
echo "✅ All images exported successfully!"
|
||||
}
|
||||
|
||||
# Function to show export summary
|
||||
show_summary() {
|
||||
echo ""
|
||||
echo "📊 Export Summary:"
|
||||
echo "=================="
|
||||
ls -lh *.tar
|
||||
|
||||
echo ""
|
||||
echo "📋 Files to transfer:"
|
||||
echo "===================="
|
||||
for file in *.tar; do
|
||||
echo " - $file"
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "💾 Total size: $(du -sh . | cut -f1)"
|
||||
}
|
||||
|
||||
# Function to create compressed archive
|
||||
create_archive() {
|
||||
echo ""
|
||||
echo "🗜️ Creating compressed archive..."
|
||||
|
||||
local archive_name="legal-doc-masker-images-$(date +%Y%m%d-%H%M%S).tar.gz"
|
||||
tar -czf "$archive_name" *.tar
|
||||
|
||||
echo "✅ Created archive: $archive_name"
|
||||
echo "📊 Archive size: $(du -sh "$archive_name" | cut -f1)"
|
||||
|
||||
echo ""
|
||||
echo "📋 Transfer options:"
|
||||
echo "==================="
|
||||
echo "1. Transfer individual .tar files"
|
||||
echo "2. Transfer compressed archive: $archive_name"
|
||||
}
|
||||
|
||||
# Function to show transfer instructions
|
||||
show_transfer_instructions() {
|
||||
echo ""
|
||||
echo "📤 Transfer Instructions:"
|
||||
echo "========================"
|
||||
echo ""
|
||||
echo "Option 1: Transfer individual files"
|
||||
echo "-----------------------------------"
|
||||
echo "scp *.tar user@target-server:/path/to/destination/"
|
||||
echo ""
|
||||
echo "Option 2: Transfer compressed archive"
|
||||
echo "-------------------------------------"
|
||||
echo "scp legal-doc-masker-images-*.tar.gz user@target-server:/path/to/destination/"
|
||||
echo ""
|
||||
echo "Option 3: USB Drive"
|
||||
echo "-------------------"
|
||||
echo "cp *.tar /Volumes/USB_DRIVE/docker-images/"
|
||||
echo "cp legal-doc-masker-images-*.tar.gz /Volumes/USB_DRIVE/"
|
||||
echo ""
|
||||
echo "Option 4: Cloud Storage"
|
||||
echo "----------------------"
|
||||
echo "aws s3 cp *.tar s3://your-bucket/docker-images/"
|
||||
echo "aws s3 cp legal-doc-masker-images-*.tar.gz s3://your-bucket/docker-images/"
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
check_docker
|
||||
check_images
|
||||
|
||||
local export_dir=$(create_export_dir)
|
||||
export_images "$export_dir"
|
||||
show_summary
|
||||
create_archive
|
||||
show_transfer_instructions
|
||||
|
||||
echo ""
|
||||
echo "🎉 Export completed successfully!"
|
||||
echo "📁 Export location: $(pwd)"
|
||||
echo ""
|
||||
echo "Next steps:"
|
||||
echo "1. Transfer the files to your target environment"
|
||||
echo "2. Use import-images.sh on the target environment"
|
||||
echo "3. Copy docker-compose.yml and other config files"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
|
|
@ -0,0 +1,11 @@
|
|||
node_modules
|
||||
npm-debug.log
|
||||
build
|
||||
.git
|
||||
.gitignore
|
||||
README.md
|
||||
.env
|
||||
.env.local
|
||||
.env.development.local
|
||||
.env.test.local
|
||||
.env.production.local
|
||||
|
|
@ -0,0 +1,2 @@
|
|||
# REACT_APP_API_BASE_URL=http://192.168.2.203:8000/api/v1
|
||||
REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
|
||||
|
|
@ -0,0 +1,33 @@
|
|||
# Build stage
|
||||
FROM node:18-alpine as build
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Copy package files
|
||||
COPY package*.json ./
|
||||
|
||||
# Install dependencies
|
||||
RUN npm ci
|
||||
|
||||
# Copy source code
|
||||
COPY . .
|
||||
|
||||
# Build the app with environment variables
|
||||
ARG REACT_APP_API_BASE_URL
|
||||
ENV REACT_APP_API_BASE_URL=$REACT_APP_API_BASE_URL
|
||||
RUN npm run build
|
||||
|
||||
# Production stage
|
||||
FROM nginx:alpine
|
||||
|
||||
# Copy built assets from build stage
|
||||
COPY --from=build /app/build /usr/share/nginx/html
|
||||
|
||||
# Copy nginx configuration
|
||||
COPY nginx.conf /etc/nginx/conf.d/default.conf
|
||||
|
||||
# Expose port 80
|
||||
EXPOSE 80
|
||||
|
||||
# Start nginx
|
||||
CMD ["nginx", "-g", "daemon off;"]
|
||||
|
|
@ -0,0 +1,55 @@
|
|||
# Legal Document Masker Frontend
|
||||
|
||||
This is the frontend application for the Legal Document Masker service. It provides a user interface for uploading legal documents, monitoring their processing status, and downloading the masked versions.
|
||||
|
||||
## Features
|
||||
|
||||
- Drag and drop file upload
|
||||
- Real-time status updates
|
||||
- File list with processing status
|
||||
- Multi-file selection and download
|
||||
- Modern Material-UI interface
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Node.js (v14 or higher)
|
||||
- npm (v6 or higher)
|
||||
|
||||
## Installation
|
||||
|
||||
1. Install dependencies:
|
||||
```bash
|
||||
npm install
|
||||
```
|
||||
|
||||
2. Start the development server:
|
||||
```bash
|
||||
npm start
|
||||
```
|
||||
|
||||
The application will be available at http://localhost:3000
|
||||
|
||||
## Development
|
||||
|
||||
The frontend is built with:
|
||||
- React 18
|
||||
- TypeScript
|
||||
- Material-UI
|
||||
- React Query for data fetching
|
||||
- React Dropzone for file uploads
|
||||
|
||||
## Building for Production
|
||||
|
||||
To create a production build:
|
||||
|
||||
```bash
|
||||
npm run build
|
||||
```
|
||||
|
||||
The build artifacts will be stored in the `build/` directory.
|
||||
|
||||
## Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `REACT_APP_API_URL`: The URL of the backend API (default: http://localhost:8000/api/v1)
|
||||
|
|
@ -0,0 +1,24 @@
|
|||
version: '3.8'
|
||||
|
||||
services:
|
||||
frontend:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile
|
||||
args:
|
||||
- REACT_APP_API_BASE_URL=${REACT_APP_API_BASE_URL}
|
||||
ports:
|
||||
- "3000:80"
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- NODE_ENV=production
|
||||
- REACT_APP_API_BASE_URL=${REACT_APP_API_BASE_URL}
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- app-network
|
||||
|
||||
|
||||
networks:
|
||||
app-network:
|
||||
driver: bridge
|
||||
|
|
@ -0,0 +1,25 @@
|
|||
server {
|
||||
listen 80;
|
||||
server_name localhost;
|
||||
|
||||
location / {
|
||||
root /usr/share/nginx/html;
|
||||
index index.html;
|
||||
try_files $uri $uri/ /index.html;
|
||||
}
|
||||
|
||||
# Cache static assets
|
||||
location /static/ {
|
||||
root /usr/share/nginx/html;
|
||||
expires 1y;
|
||||
add_header Cache-Control "public, no-transform";
|
||||
}
|
||||
|
||||
# Enable gzip compression
|
||||
gzip on;
|
||||
gzip_vary on;
|
||||
gzip_min_length 10240;
|
||||
gzip_proxied expired no-cache no-store private auth;
|
||||
gzip_types text/plain text/css text/xml text/javascript application/x-javascript application/xml application/javascript;
|
||||
gzip_disable "MSIE [1-6]\.";
|
||||
}
|
||||
File diff suppressed because it is too large
Load Diff
|
|
@ -0,0 +1,50 @@
|
|||
{
|
||||
"name": "legal-doc-masker-frontend",
|
||||
"version": "0.1.0",
|
||||
"private": true,
|
||||
"dependencies": {
|
||||
"@emotion/react": "^11.11.3",
|
||||
"@emotion/styled": "^11.11.0",
|
||||
"@mui/icons-material": "^5.15.10",
|
||||
"@mui/material": "^5.15.10",
|
||||
"@testing-library/jest-dom": "^5.17.0",
|
||||
"@testing-library/react": "^13.4.0",
|
||||
"@testing-library/user-event": "^13.5.0",
|
||||
"@types/jest": "^27.5.2",
|
||||
"@types/node": "^16.18.80",
|
||||
"@types/react": "^18.2.55",
|
||||
"@types/react-dom": "^18.2.19",
|
||||
"axios": "^1.6.7",
|
||||
"react": "^18.2.0",
|
||||
"react-dom": "^18.2.0",
|
||||
"react-dropzone": "^14.2.3",
|
||||
"react-query": "^3.39.3",
|
||||
"react-scripts": "5.0.1",
|
||||
"typescript": "^4.9.5",
|
||||
"web-vitals": "^2.1.4"
|
||||
},
|
||||
"scripts": {
|
||||
"start": "react-scripts start",
|
||||
"build": "react-scripts build",
|
||||
"test": "react-scripts test",
|
||||
"eject": "react-scripts eject"
|
||||
},
|
||||
"eslintConfig": {
|
||||
"extends": [
|
||||
"react-app",
|
||||
"react-app/jest"
|
||||
]
|
||||
},
|
||||
"browserslist": {
|
||||
"production": [
|
||||
">0.2%",
|
||||
"not dead",
|
||||
"not op_mini all"
|
||||
],
|
||||
"development": [
|
||||
"last 1 chrome version",
|
||||
"last 1 firefox version",
|
||||
"last 1 safari version"
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,20 @@
|
|||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<link rel="icon" href="%PUBLIC_URL%/favicon.ico" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<meta name="theme-color" content="#000000" />
|
||||
<meta
|
||||
name="description"
|
||||
content="Legal Document Masker - Upload and process legal documents"
|
||||
/>
|
||||
<link rel="apple-touch-icon" href="%PUBLIC_URL%/logo192.png" />
|
||||
<link rel="manifest" href="%PUBLIC_URL%/manifest.json" />
|
||||
<title>Legal Document Masker</title>
|
||||
</head>
|
||||
<body>
|
||||
<noscript>You need to enable JavaScript to run this app.</noscript>
|
||||
<div id="root"></div>
|
||||
</body>
|
||||
</html>
|
||||
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"short_name": "Legal Doc Masker",
|
||||
"name": "Legal Document Masker",
|
||||
"icons": [
|
||||
{
|
||||
"src": "favicon.ico",
|
||||
"sizes": "64x64 32x32 24x24 16x16",
|
||||
"type": "image/x-icon"
|
||||
}
|
||||
],
|
||||
"start_url": ".",
|
||||
"display": "standalone",
|
||||
"theme_color": "#000000",
|
||||
"background_color": "#ffffff"
|
||||
}
|
||||
|
|
@ -0,0 +1,58 @@
|
|||
import React, { useEffect, useState } from 'react';
|
||||
import { Container, Typography, Box } from '@mui/material';
|
||||
import { useQuery, useQueryClient } from 'react-query';
|
||||
import FileUpload from './components/FileUpload';
|
||||
import FileList from './components/FileList';
|
||||
import { File } from './types/file';
|
||||
import { api } from './services/api';
|
||||
|
||||
function App() {
|
||||
const queryClient = useQueryClient();
|
||||
const [files, setFiles] = useState<File[]>([]);
|
||||
|
||||
const { data, isLoading, error } = useQuery<File[]>('files', api.listFiles, {
|
||||
refetchInterval: 5000, // Poll every 5 seconds
|
||||
});
|
||||
|
||||
useEffect(() => {
|
||||
if (data) {
|
||||
setFiles(data);
|
||||
}
|
||||
}, [data]);
|
||||
|
||||
const handleUploadComplete = () => {
|
||||
queryClient.invalidateQueries('files');
|
||||
};
|
||||
|
||||
if (isLoading) {
|
||||
return (
|
||||
<Container>
|
||||
<Typography>Loading...</Typography>
|
||||
</Container>
|
||||
);
|
||||
}
|
||||
|
||||
if (error) {
|
||||
return (
|
||||
<Container>
|
||||
<Typography color="error">Error loading files</Typography>
|
||||
</Container>
|
||||
);
|
||||
}
|
||||
|
||||
return (
|
||||
<Container maxWidth="lg">
|
||||
<Box sx={{ my: 4 }}>
|
||||
<Typography variant="h4" component="h1" gutterBottom>
|
||||
Legal Document Masker
|
||||
</Typography>
|
||||
<Box sx={{ mb: 4 }}>
|
||||
<FileUpload onUploadComplete={handleUploadComplete} />
|
||||
</Box>
|
||||
<FileList files={files} onFileStatusChange={handleUploadComplete} />
|
||||
</Box>
|
||||
</Container>
|
||||
);
|
||||
}
|
||||
|
||||
export default App;
|
||||
|
|
@ -0,0 +1,230 @@
|
|||
import React, { useState } from 'react';
|
||||
import {
|
||||
Table,
|
||||
TableBody,
|
||||
TableCell,
|
||||
TableContainer,
|
||||
TableHead,
|
||||
TableRow,
|
||||
Paper,
|
||||
IconButton,
|
||||
Checkbox,
|
||||
Button,
|
||||
Chip,
|
||||
Dialog,
|
||||
DialogTitle,
|
||||
DialogContent,
|
||||
DialogActions,
|
||||
Typography,
|
||||
} from '@mui/material';
|
||||
import { Download as DownloadIcon, Delete as DeleteIcon } from '@mui/icons-material';
|
||||
import { File, FileStatus } from '../types/file';
|
||||
import { api } from '../services/api';
|
||||
|
||||
interface FileListProps {
|
||||
files: File[];
|
||||
onFileStatusChange: () => void;
|
||||
}
|
||||
|
||||
const FileList: React.FC<FileListProps> = ({ files, onFileStatusChange }) => {
|
||||
const [selectedFiles, setSelectedFiles] = useState<string[]>([]);
|
||||
const [deleteDialogOpen, setDeleteDialogOpen] = useState(false);
|
||||
const [fileToDelete, setFileToDelete] = useState<string | null>(null);
|
||||
|
||||
const handleSelectFile = (fileId: string) => {
|
||||
setSelectedFiles((prev) =>
|
||||
prev.includes(fileId)
|
||||
? prev.filter((id) => id !== fileId)
|
||||
: [...prev, fileId]
|
||||
);
|
||||
};
|
||||
|
||||
const handleSelectAll = () => {
|
||||
setSelectedFiles((prev) =>
|
||||
prev.length === files.length ? [] : files.map((file) => file.id)
|
||||
);
|
||||
};
|
||||
|
||||
const handleDownload = async (fileId: string) => {
|
||||
try {
|
||||
console.log('=== FRONTEND DOWNLOAD START ===');
|
||||
console.log('File ID:', fileId);
|
||||
|
||||
const file = files.find((f) => f.id === fileId);
|
||||
console.log('File object:', file);
|
||||
|
||||
const blob = await api.downloadFile(fileId);
|
||||
console.log('Blob received:', blob);
|
||||
console.log('Blob type:', blob.type);
|
||||
console.log('Blob size:', blob.size);
|
||||
|
||||
const url = window.URL.createObjectURL(blob);
|
||||
const a = document.createElement('a');
|
||||
a.href = url;
|
||||
|
||||
// Match backend behavior: change extension to .md
|
||||
const originalFilename = file?.filename || 'downloaded-file';
|
||||
const filenameWithoutExt = originalFilename.replace(/\.[^/.]+$/, ''); // Remove extension
|
||||
const downloadFilename = `${filenameWithoutExt}.md`;
|
||||
|
||||
console.log('Original filename:', originalFilename);
|
||||
console.log('Filename without extension:', filenameWithoutExt);
|
||||
console.log('Download filename:', downloadFilename);
|
||||
|
||||
a.download = downloadFilename;
|
||||
document.body.appendChild(a);
|
||||
a.click();
|
||||
window.URL.revokeObjectURL(url);
|
||||
document.body.removeChild(a);
|
||||
|
||||
console.log('=== FRONTEND DOWNLOAD END ===');
|
||||
} catch (error) {
|
||||
console.error('Error downloading file:', error);
|
||||
}
|
||||
};
|
||||
|
||||
const handleDownloadSelected = async () => {
|
||||
for (const fileId of selectedFiles) {
|
||||
await handleDownload(fileId);
|
||||
}
|
||||
};
|
||||
|
||||
const handleDeleteClick = (fileId: string) => {
|
||||
setFileToDelete(fileId);
|
||||
setDeleteDialogOpen(true);
|
||||
};
|
||||
|
||||
const handleDeleteConfirm = async () => {
|
||||
if (fileToDelete) {
|
||||
try {
|
||||
await api.deleteFile(fileToDelete);
|
||||
onFileStatusChange();
|
||||
} catch (error) {
|
||||
console.error('Error deleting file:', error);
|
||||
}
|
||||
}
|
||||
setDeleteDialogOpen(false);
|
||||
setFileToDelete(null);
|
||||
};
|
||||
|
||||
const handleDeleteCancel = () => {
|
||||
setDeleteDialogOpen(false);
|
||||
setFileToDelete(null);
|
||||
};
|
||||
|
||||
const getStatusColor = (status: FileStatus) => {
|
||||
switch (status) {
|
||||
case FileStatus.SUCCESS:
|
||||
return 'success';
|
||||
case FileStatus.FAILED:
|
||||
return 'error';
|
||||
case FileStatus.PROCESSING:
|
||||
return 'warning';
|
||||
default:
|
||||
return 'default';
|
||||
}
|
||||
};
|
||||
|
||||
return (
|
||||
<div>
|
||||
<div style={{ marginBottom: '1rem' }}>
|
||||
<Button
|
||||
variant="contained"
|
||||
color="primary"
|
||||
onClick={handleDownloadSelected}
|
||||
disabled={selectedFiles.length === 0}
|
||||
sx={{ mr: 1 }}
|
||||
>
|
||||
Download Selected
|
||||
</Button>
|
||||
</div>
|
||||
<TableContainer component={Paper}>
|
||||
<Table>
|
||||
<TableHead>
|
||||
<TableRow>
|
||||
<TableCell padding="checkbox">
|
||||
<Checkbox
|
||||
checked={selectedFiles.length === files.length}
|
||||
indeterminate={selectedFiles.length > 0 && selectedFiles.length < files.length}
|
||||
onChange={handleSelectAll}
|
||||
/>
|
||||
</TableCell>
|
||||
<TableCell>Filename</TableCell>
|
||||
<TableCell>Status</TableCell>
|
||||
<TableCell>Created At</TableCell>
|
||||
<TableCell>Finished At</TableCell>
|
||||
<TableCell>Actions</TableCell>
|
||||
</TableRow>
|
||||
</TableHead>
|
||||
<TableBody>
|
||||
{files.map((file) => (
|
||||
<TableRow key={file.id}>
|
||||
<TableCell padding="checkbox">
|
||||
<Checkbox
|
||||
checked={selectedFiles.includes(file.id)}
|
||||
onChange={() => handleSelectFile(file.id)}
|
||||
/>
|
||||
</TableCell>
|
||||
<TableCell>{file.filename}</TableCell>
|
||||
<TableCell>
|
||||
<Chip
|
||||
label={file.status}
|
||||
color={getStatusColor(file.status) as any}
|
||||
size="small"
|
||||
/>
|
||||
</TableCell>
|
||||
<TableCell>
|
||||
{new Date(file.created_at).toLocaleString()}
|
||||
</TableCell>
|
||||
<TableCell>
|
||||
{(file.status === FileStatus.SUCCESS || file.status === FileStatus.FAILED)
|
||||
? new Date(file.updated_at).toLocaleString()
|
||||
: '—'}
|
||||
</TableCell>
|
||||
<TableCell>
|
||||
<IconButton
|
||||
onClick={() => handleDeleteClick(file.id)}
|
||||
size="small"
|
||||
color="error"
|
||||
sx={{ mr: 1 }}
|
||||
>
|
||||
<DeleteIcon />
|
||||
</IconButton>
|
||||
{file.status === FileStatus.SUCCESS && (
|
||||
<IconButton
|
||||
onClick={() => handleDownload(file.id)}
|
||||
size="small"
|
||||
color="primary"
|
||||
>
|
||||
<DownloadIcon />
|
||||
</IconButton>
|
||||
)}
|
||||
</TableCell>
|
||||
</TableRow>
|
||||
))}
|
||||
</TableBody>
|
||||
</Table>
|
||||
</TableContainer>
|
||||
|
||||
<Dialog
|
||||
open={deleteDialogOpen}
|
||||
onClose={handleDeleteCancel}
|
||||
>
|
||||
<DialogTitle>Confirm Delete</DialogTitle>
|
||||
<DialogContent>
|
||||
<Typography>
|
||||
Are you sure you want to delete this file? This action cannot be undone.
|
||||
</Typography>
|
||||
</DialogContent>
|
||||
<DialogActions>
|
||||
<Button onClick={handleDeleteCancel}>Cancel</Button>
|
||||
<Button onClick={handleDeleteConfirm} color="error" variant="contained">
|
||||
Delete
|
||||
</Button>
|
||||
</DialogActions>
|
||||
</Dialog>
|
||||
</div>
|
||||
);
|
||||
};
|
||||
|
||||
export default FileList;
|
||||
|
|
@ -0,0 +1,66 @@
|
|||
import React, { useCallback } from 'react';
|
||||
import { useDropzone } from 'react-dropzone';
|
||||
import { Box, Typography, CircularProgress } from '@mui/material';
|
||||
import { api } from '../services/api';
|
||||
|
||||
interface FileUploadProps {
|
||||
onUploadComplete: () => void;
|
||||
}
|
||||
|
||||
const FileUpload: React.FC<FileUploadProps> = ({ onUploadComplete }) => {
|
||||
const [isUploading, setIsUploading] = React.useState(false);
|
||||
|
||||
const onDrop = useCallback(async (acceptedFiles: File[]) => {
|
||||
setIsUploading(true);
|
||||
try {
|
||||
for (const file of acceptedFiles) {
|
||||
await api.uploadFile(file);
|
||||
}
|
||||
onUploadComplete();
|
||||
} catch (error) {
|
||||
console.error('Error uploading files:', error);
|
||||
} finally {
|
||||
setIsUploading(false);
|
||||
}
|
||||
}, [onUploadComplete]);
|
||||
|
||||
const { getRootProps, getInputProps, isDragActive } = useDropzone({
|
||||
onDrop,
|
||||
accept: {
|
||||
'application/pdf': ['.pdf'],
|
||||
'application/msword': ['.doc'],
|
||||
'application/vnd.openxmlformats-officedocument.wordprocessingml.document': ['.docx'],
|
||||
'text/markdown': ['.md'],
|
||||
},
|
||||
});
|
||||
|
||||
return (
|
||||
<Box
|
||||
{...getRootProps()}
|
||||
sx={{
|
||||
border: '2px dashed #ccc',
|
||||
borderRadius: 2,
|
||||
p: 3,
|
||||
textAlign: 'center',
|
||||
cursor: 'pointer',
|
||||
bgcolor: isDragActive ? 'action.hover' : 'background.paper',
|
||||
'&:hover': {
|
||||
bgcolor: 'action.hover',
|
||||
},
|
||||
}}
|
||||
>
|
||||
<input {...getInputProps()} />
|
||||
{isUploading ? (
|
||||
<CircularProgress />
|
||||
) : (
|
||||
<Typography>
|
||||
{isDragActive
|
||||
? 'Drop the files here...'
|
||||
: 'Drag and drop files here, or click to select files'}
|
||||
</Typography>
|
||||
)}
|
||||
</Box>
|
||||
);
|
||||
};
|
||||
|
||||
export default FileUpload;
|
||||
|
|
@ -0,0 +1,8 @@
|
|||
/// <reference types="react-scripts" />
|
||||
|
||||
declare namespace NodeJS {
|
||||
interface ProcessEnv {
|
||||
readonly REACT_APP_API_BASE_URL: string;
|
||||
// Add other environment variables here
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,29 @@
|
|||
import React from 'react';
|
||||
import ReactDOM from 'react-dom/client';
|
||||
import { QueryClient, QueryClientProvider } from 'react-query';
|
||||
import { ThemeProvider, createTheme } from '@mui/material';
|
||||
import CssBaseline from '@mui/material/CssBaseline';
|
||||
import App from './App';
|
||||
|
||||
const queryClient = new QueryClient();
|
||||
|
||||
const theme = createTheme({
|
||||
palette: {
|
||||
mode: 'light',
|
||||
},
|
||||
});
|
||||
|
||||
const root = ReactDOM.createRoot(
|
||||
document.getElementById('root') as HTMLElement
|
||||
);
|
||||
|
||||
root.render(
|
||||
<React.StrictMode>
|
||||
<QueryClientProvider client={queryClient}>
|
||||
<ThemeProvider theme={theme}>
|
||||
<CssBaseline />
|
||||
<App />
|
||||
</ThemeProvider>
|
||||
</QueryClientProvider>
|
||||
</React.StrictMode>
|
||||
);
|
||||
|
|
@ -0,0 +1,44 @@
|
|||
import axios from 'axios';
|
||||
import { File, FileUploadResponse } from '../types/file';
|
||||
|
||||
const API_BASE_URL = process.env.REACT_APP_API_BASE_URL || 'http://localhost:8000/api/v1';
|
||||
|
||||
// Create axios instance with default config
|
||||
const axiosInstance = axios.create({
|
||||
baseURL: API_BASE_URL,
|
||||
timeout: 30000, // 30 seconds timeout
|
||||
});
|
||||
|
||||
export const api = {
|
||||
uploadFile: async (file: globalThis.File): Promise<FileUploadResponse> => {
|
||||
const formData = new FormData();
|
||||
formData.append('file', file);
|
||||
const response = await axiosInstance.post('/files/upload', formData, {
|
||||
headers: {
|
||||
'Content-Type': 'multipart/form-data',
|
||||
},
|
||||
});
|
||||
return response.data;
|
||||
},
|
||||
|
||||
listFiles: async (): Promise<File[]> => {
|
||||
const response = await axiosInstance.get('/files/files');
|
||||
return response.data;
|
||||
},
|
||||
|
||||
getFile: async (fileId: string): Promise<File> => {
|
||||
const response = await axiosInstance.get(`/files/files/${fileId}`);
|
||||
return response.data;
|
||||
},
|
||||
|
||||
downloadFile: async (fileId: string): Promise<Blob> => {
|
||||
const response = await axiosInstance.get(`/files/files/${fileId}/download`, {
|
||||
responseType: 'blob',
|
||||
});
|
||||
return response.data;
|
||||
},
|
||||
|
||||
deleteFile: async (fileId: string): Promise<void> => {
|
||||
await axiosInstance.delete(`/files/files/${fileId}`);
|
||||
},
|
||||
};
|
||||
|
|
@ -0,0 +1,23 @@
|
|||
export enum FileStatus {
|
||||
NOT_STARTED = "not_started",
|
||||
PROCESSING = "processing",
|
||||
SUCCESS = "success",
|
||||
FAILED = "failed"
|
||||
}
|
||||
|
||||
export interface File {
|
||||
id: string;
|
||||
filename: string;
|
||||
status: FileStatus;
|
||||
error_message?: string;
|
||||
created_at: string;
|
||||
updated_at: string;
|
||||
}
|
||||
|
||||
export interface FileUploadResponse {
|
||||
id: string;
|
||||
filename: string;
|
||||
status: FileStatus;
|
||||
created_at: string;
|
||||
updated_at: string;
|
||||
}
|
||||
|
|
@ -0,0 +1,26 @@
|
|||
{
|
||||
"compilerOptions": {
|
||||
"target": "es5",
|
||||
"lib": [
|
||||
"dom",
|
||||
"dom.iterable",
|
||||
"esnext"
|
||||
],
|
||||
"allowJs": true,
|
||||
"skipLibCheck": true,
|
||||
"esModuleInterop": true,
|
||||
"allowSyntheticDefaultImports": true,
|
||||
"strict": true,
|
||||
"forceConsistentCasingInFileNames": true,
|
||||
"noFallthroughCasesInSwitch": true,
|
||||
"module": "esnext",
|
||||
"moduleResolution": "node",
|
||||
"resolveJsonModule": true,
|
||||
"isolatedModules": true,
|
||||
"noEmit": true,
|
||||
"jsx": "react-jsx"
|
||||
},
|
||||
"include": [
|
||||
"src"
|
||||
]
|
||||
}
|
||||
|
|
@ -0,0 +1,232 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Docker Image Import Script
|
||||
# Imports Docker images on target environment for migration
|
||||
|
||||
set -e
|
||||
|
||||
echo "🚀 Legal Document Masker - Docker Image Import"
|
||||
echo "=============================================="
|
||||
|
||||
# Function to check if Docker is running
|
||||
check_docker() {
|
||||
if ! docker info > /dev/null 2>&1; then
|
||||
echo "❌ Docker is not running. Please start Docker and try again."
|
||||
exit 1
|
||||
fi
|
||||
echo "✅ Docker is running"
|
||||
}
|
||||
|
||||
# Function to check for tar files
|
||||
check_tar_files() {
|
||||
echo "🔍 Checking for Docker image files..."
|
||||
|
||||
local missing_files=()
|
||||
|
||||
if [ ! -f "backend-api.tar" ]; then
|
||||
missing_files+=("backend-api.tar")
|
||||
fi
|
||||
|
||||
if [ ! -f "frontend.tar" ]; then
|
||||
missing_files+=("frontend.tar")
|
||||
fi
|
||||
|
||||
if [ ! -f "mineru-api.tar" ]; then
|
||||
missing_files+=("mineru-api.tar")
|
||||
fi
|
||||
|
||||
if [ ! -f "redis.tar" ]; then
|
||||
missing_files+=("redis.tar")
|
||||
fi
|
||||
|
||||
if [ ${#missing_files[@]} -ne 0 ]; then
|
||||
echo "❌ Missing files: ${missing_files[*]}"
|
||||
echo ""
|
||||
echo "Please ensure all .tar files are in the current directory."
|
||||
echo "If you have a compressed archive, extract it first:"
|
||||
echo " tar -xzf legal-doc-masker-images-*.tar.gz"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "✅ All required files found"
|
||||
}
|
||||
|
||||
# Function to check available disk space
|
||||
check_disk_space() {
|
||||
echo "💾 Checking available disk space..."
|
||||
|
||||
local required_space=0
|
||||
for file in *.tar; do
|
||||
local file_size=$(stat -f%z "$file" 2>/dev/null || stat -c%s "$file" 2>/dev/null || echo 0)
|
||||
required_space=$((required_space + file_size))
|
||||
done
|
||||
|
||||
local available_space=$(df . | awk 'NR==2 {print $4}')
|
||||
available_space=$((available_space * 1024)) # Convert to bytes
|
||||
|
||||
if [ $required_space -gt $available_space ]; then
|
||||
echo "❌ Insufficient disk space"
|
||||
echo "Required: $(numfmt --to=iec $required_space)"
|
||||
echo "Available: $(numfmt --to=iec $available_space)"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "✅ Sufficient disk space available"
|
||||
}
|
||||
|
||||
# Function to import images
|
||||
import_images() {
|
||||
echo "📦 Importing Docker images..."
|
||||
|
||||
# Import backend image
|
||||
echo " 📦 Importing backend-api image..."
|
||||
docker load -i backend-api.tar
|
||||
|
||||
# Import frontend image
|
||||
echo " 📦 Importing frontend image..."
|
||||
docker load -i frontend.tar
|
||||
|
||||
# Import mineru image
|
||||
echo " 📦 Importing mineru-api image..."
|
||||
docker load -i mineru-api.tar
|
||||
|
||||
# Import redis image
|
||||
echo " 📦 Importing redis image..."
|
||||
docker load -i redis.tar
|
||||
|
||||
echo "✅ All images imported successfully!"
|
||||
}
|
||||
|
||||
# Function to verify imported images
|
||||
verify_images() {
|
||||
echo "🔍 Verifying imported images..."
|
||||
|
||||
local missing_images=()
|
||||
|
||||
if ! docker images | grep -q "legal-doc-masker-backend-api"; then
|
||||
missing_images+=("legal-doc-masker-backend-api")
|
||||
fi
|
||||
|
||||
if ! docker images | grep -q "legal-doc-masker-frontend"; then
|
||||
missing_images+=("legal-doc-masker-frontend")
|
||||
fi
|
||||
|
||||
if ! docker images | grep -q "legal-doc-masker-mineru-api"; then
|
||||
missing_images+=("legal-doc-masker-mineru-api")
|
||||
fi
|
||||
|
||||
if ! docker images | grep -q "redis:alpine"; then
|
||||
missing_images+=("redis:alpine")
|
||||
fi
|
||||
|
||||
if [ ${#missing_images[@]} -ne 0 ]; then
|
||||
echo "❌ Missing imported images: ${missing_images[*]}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "✅ All images verified successfully!"
|
||||
}
|
||||
|
||||
# Function to show imported images
|
||||
show_imported_images() {
|
||||
echo ""
|
||||
echo "📊 Imported Images:"
|
||||
echo "==================="
|
||||
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep legal-doc-masker
|
||||
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep redis
|
||||
}
|
||||
|
||||
# Function to create necessary directories
|
||||
create_directories() {
|
||||
echo ""
|
||||
echo "📁 Creating necessary directories..."
|
||||
|
||||
mkdir -p backend/storage
|
||||
mkdir -p mineru/storage/uploads
|
||||
mkdir -p mineru/storage/processed
|
||||
|
||||
echo "✅ Directories created"
|
||||
}
|
||||
|
||||
# Function to check for required files
|
||||
check_required_files() {
|
||||
echo ""
|
||||
echo "🔍 Checking for required configuration files..."
|
||||
|
||||
local missing_files=()
|
||||
|
||||
if [ ! -f "docker-compose.yml" ]; then
|
||||
missing_files+=("docker-compose.yml")
|
||||
fi
|
||||
|
||||
if [ ! -f "DOCKER_COMPOSE_README.md" ]; then
|
||||
missing_files+=("DOCKER_COMPOSE_README.md")
|
||||
fi
|
||||
|
||||
if [ ${#missing_files[@]} -ne 0 ]; then
|
||||
echo "⚠️ Missing files: ${missing_files[*]}"
|
||||
echo "Please copy these files from the source environment:"
|
||||
echo " - docker-compose.yml"
|
||||
echo " - DOCKER_COMPOSE_README.md"
|
||||
echo " - backend/.env (if exists)"
|
||||
echo " - frontend/.env (if exists)"
|
||||
echo " - mineru/.env (if exists)"
|
||||
else
|
||||
echo "✅ All required configuration files found"
|
||||
fi
|
||||
}
|
||||
|
||||
# Function to show next steps
|
||||
show_next_steps() {
|
||||
echo ""
|
||||
echo "🎉 Import completed successfully!"
|
||||
echo ""
|
||||
echo "📋 Next Steps:"
|
||||
echo "=============="
|
||||
echo ""
|
||||
echo "1. Copy configuration files (if not already present):"
|
||||
echo " - docker-compose.yml"
|
||||
echo " - backend/.env"
|
||||
echo " - frontend/.env"
|
||||
echo " - mineru/.env"
|
||||
echo ""
|
||||
echo "2. Start the services:"
|
||||
echo " docker-compose up -d"
|
||||
echo ""
|
||||
echo "3. Verify services are running:"
|
||||
echo " docker-compose ps"
|
||||
echo ""
|
||||
echo "4. Test the endpoints:"
|
||||
echo " - Frontend: http://localhost:3000"
|
||||
echo " - Backend API: http://localhost:8000"
|
||||
echo " - Mineru API: http://localhost:8001"
|
||||
echo ""
|
||||
echo "5. View logs if needed:"
|
||||
echo " docker-compose logs -f [service-name]"
|
||||
}
|
||||
|
||||
# Function to handle compressed archive
|
||||
handle_compressed_archive() {
|
||||
if ls legal-doc-masker-images-*.tar.gz 1> /dev/null 2>&1; then
|
||||
echo "🗜️ Found compressed archive, extracting..."
|
||||
tar -xzf legal-doc-masker-images-*.tar.gz
|
||||
echo "✅ Archive extracted"
|
||||
fi
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
check_docker
|
||||
handle_compressed_archive
|
||||
check_tar_files
|
||||
check_disk_space
|
||||
import_images
|
||||
verify_images
|
||||
show_imported_images
|
||||
create_directories
|
||||
check_required_files
|
||||
show_next_steps
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
|
|
@ -0,0 +1,46 @@
|
|||
FROM python:3.12-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install system dependencies
|
||||
RUN apt-get update && apt-get install -y \
|
||||
build-essential \
|
||||
libreoffice \
|
||||
wget \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
RUN pip install --upgrade pip
|
||||
RUN pip install uv
|
||||
|
||||
# Configure uv and install mineru
|
||||
ENV UV_SYSTEM_PYTHON=1
|
||||
RUN uv pip install --system -U "mineru[core]"
|
||||
|
||||
|
||||
|
||||
# Copy requirements first to leverage Docker cache
|
||||
# COPY requirements.txt .
|
||||
# RUN pip install huggingface_hub
|
||||
# RUN wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
|
||||
# RUN wget https://raw.githubusercontent.com/opendatalab/MinerU/refs/heads/release-1.3.1/scripts/download_models_hf.py -O download_models_hf.py
|
||||
|
||||
# RUN python download_models_hf.py
|
||||
RUN mineru-models-download -s modelscope -m pipeline
|
||||
|
||||
|
||||
|
||||
|
||||
# RUN pip install --no-cache-dir -r requirements.txt
|
||||
# RUN pip install -U magic-pdf[full]
|
||||
|
||||
|
||||
# Copy the rest of the application
|
||||
# COPY . .
|
||||
|
||||
# Create storage directories
|
||||
# RUN mkdir -p storage/uploads storage/processed
|
||||
|
||||
# Expose the port the app runs on
|
||||
EXPOSE 8000
|
||||
|
||||
# Command to run the application
|
||||
CMD ["mineru-api", "--host", "0.0.0.0", "--port", "8000"]
|
||||
|
|
@ -0,0 +1,27 @@
|
|||
version: '3.8'
|
||||
|
||||
services:
|
||||
mineru-api:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile
|
||||
platform: linux/arm64
|
||||
ports:
|
||||
- "8001:8000"
|
||||
volumes:
|
||||
- ./storage/uploads:/app/storage/uploads
|
||||
- ./storage/processed:/app/storage/processed
|
||||
environment:
|
||||
- PYTHONUNBUFFERED=1
|
||||
- MINERU_MODEL_SOURCE=local
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 60s
|
||||
|
||||
volumes:
|
||||
uploads:
|
||||
processed:
|
||||
|
|
@ -1,11 +0,0 @@
|
|||
# Base dependencies
|
||||
pydantic-settings>=2.0.0
|
||||
python-dotenv==1.0.0
|
||||
watchdog==2.1.6
|
||||
requests==2.28.1
|
||||
|
||||
# Document processing
|
||||
python-docx>=0.8.11
|
||||
PyPDF2>=3.0.0
|
||||
pandas>=2.0.0
|
||||
magic-pdf[full]
|
||||
|
|
@ -0,0 +1,43 @@
|
|||
# 北京市第三中级人民法院民事判决书
|
||||
|
||||
(2022)京 03 民终 3852 号
|
||||
|
||||
上诉人(原审原告):北京丰复久信营销科技有限公司,住所地北京市海淀区北小马厂6 号1 号楼华天大厦1306 室。
|
||||
|
||||
法定代表人:郭东军,执行董事、经理。委托诉讼代理人:周大海,北京市康达律师事务所律师。委托诉讼代理人:王乃哲,北京市康达律师事务所律师。
|
||||
|
||||
被上诉人(原审被告):中研智创区块链技术有限公司,住所地天津市津南区双港镇工业园区优谷产业园5 号楼-1505。
|
||||
|
||||
法定代表人:王欢子,总经理。
|
||||
|
||||
委托诉讼代理人:魏鑫,北京市昊衡律师事务所律师。
|
||||
|
||||
1.上诉人北京丰复久信营销科技有限公司(以下简称丰复久信公司)因与被上诉人中研智创区块链技术有限公司(以下简称中研智创公司)服务合同纠纷一案,不服北京市朝阳区人民法院(2020)京0105 民初69754 号民事判决,向本院提起上诉。本院立案后,依法组成合议庭开庭进行了审理。上诉人丰复久信公司之委托诉讼代理人周大海、王乃哲,被上诉人中研智创公司之委托诉讼代理人魏鑫到庭参加诉讼。本案现已审理终结。
|
||||
|
||||
2.丰复久信公司上诉请求:1.撤销一审判决,发回重审或依法改判支持丰复久信公司一审全部诉讼请求;2.或在维持原判的同时判令中研智创公司向丰复久信公司返还 1000 万元款项,并赔偿丰复久信公司因此支付的律师费 220 万元;3.判令中研智创公司承担本案一审、二审全部诉讼费用。事实与理由:一、根据2019 年的政策导向,丰复久信公司的投资行为并无任何法律或政策瑕疵。丰复久信公司仅投资挖矿,没有购买比特币,故在当时国家、政府层面有相关政策支持甚至鼓励的前提下,一审法院仅凭“挖矿”行为就得出丰复久信公司扰乱金融秩序的结论,是错误的。二、一审法院没有全面、深入审查相关事实,且遗漏了最核心的数据调查工作。三、本案一审判决适用法律错误。涉案合同成立及履行期间并无合同无效的情形,当属有效。一审法院以挖矿活动耗能巨大、不利于我国产业结构调整为依据之一,作出合同无效的判决,实属牵强。最高人民法院发布的全国法院系统2020 年度优秀案例分析评选活动获奖名单中,由上海市第一中级人民法院刘江法官编写的“李圣艳、布兰登·斯密特诉闫向东、李敏等财产损害赔偿纠纷案— —比特币的法律属性及其司法救济”一案入选,该案同样发生在丰复久信公司与中研智创公司合同履行过程中,一审法院认定同时期同类型的涉案合同无效,与上述最高人民法院的优秀案例相悖。四、一审法院径行认定合同无效,未向丰复久信公司进行释明构成程序违法。
|
||||
|
||||
3.中研智创公司辩称,同意一审判决,不同意丰复久信公司的上诉请求。首先,一审法院曾在庭审中询问丰复久信公司关于机器返还的问题,一审法院进行了释明。其次,如二审法院对其该项上诉请求进行判决,会剥夺中研智创公司针对该部分请求再行上诉的权利。
|
||||
|
||||
4.丰复久信公司向一审法院起诉请求:1.中研智创公司交付278.1654976 个比特币,或者按照 2021 年 1 月 25 日比特币的价格交付9550812.36 美元;2.中研智创公司赔偿丰复久信公司服务期到期后占用微型存储空间服务器的损失(自2020 年7 月1日起至实际返还服务器时止,按照bitinfocharts 网站公布的相关日产比特币数据,计算应赔偿比特币数量或按照2021 年1 月25 日比特币的价格交付美元)。
|
||||
|
||||
5.一审法院查明事实:2019 年5 月6 日,丰复久信公司作为甲方(买方)与乙方(卖方)中研智创公司签订《计算机设备采购合
|
||||
|
||||
同》,约定:货物名称为计算机设备,型号规格及数量为T2T-30T 规格型号的微型存储空间服务器1542 台,单价5040/ 台合同金额为 7 771 680 元;交货期 2019 年 8 月 31 日前;交货方式为乙方自行送货到甲方所在地,并提供安装服务,运输工具及运费由乙方负责;交货地点北京;签订购货合同,设备安装完毕后一次性支付项目总货款;乙方提供货物的质量保证期为自交货验收结束之日起不少于十二个月(具体按清单要求);乙方交货前应对产品作出全面检查和对验收文件进行整理,并列出清单,作为甲方收货验收和使用的技术条件依据,检验的结果应随货物交甲方,甲方对乙方提供的货物在使用前进行调试时,乙方协助甲方一起调试,直到符合技术要求,甲方才做最终验收,验收时乙方必须在现场,验收完毕后作出验收结果报告,并经双方签字生效。
|
||||
|
||||
6.同日,丰复久信公司作为甲方(客户方)与乙方中研智创公司(服务方)签订《服务合同书》,约定:乙方同意就采购合同中的微型存储空间服务器向甲方提供特定服务;服务的内容包括质保、维修、服务器设备代为运行管理、代为缴纳服务器相关用度花费如电费等,详细内容见附件一;如果乙方在工作中因自身过错而发生任何错误或遗漏,应无条件更正,不另外收费,并对因此而对甲方造成的损失承担赔偿责任,赔偿额以本合同约定的服务费为限;若因甲方原因造成工作延误,将由甲方承担相应的损失;服务费总金额为2 228 320 元,甲乙双方一致同意项目服务费以人民币形式,于本合同签订后3 日内一次性支付;甲方可以提前10 个工作日以书面形式要求变更或增加所提供的服务,该等变更最终应由双方商定认可,其中包括与该等变更有关的任何费用调整等。合同后附附件一以表格形式列明,1.1542 台T2T-30T 微型存储空间服务器的质保、维修,时限12 个月,完成标准为完成甲方指定的运行量;2.服务器的日常运行管理,时限12 个月;3.代扣代缴电费;4.其他(空白)。
|
||||
|
||||
24. 2021 年9 月3 日国家发展和改革委员会等部门《关于整治虚拟货币“挖矿”活动的通知》显示,虚拟货币挖矿活动能源消耗和碳排放量大,对国民经济贡献度低,对产业发展、科技进步等带动作用有限,加之虚拟货币生产、交易环节衍生的风险越发突出,其盲目无序发展对推动经济社会高质量发展和节能减排带来不利影响。故以电力资源、碳排放量为代价的“挖矿”行为,与经济社会高质量发展和碳达峰、碳中和目标相悖,与公共利益相悖。
|
||||
|
||||
26. 综上,相关部门整治虚拟货币“挖矿”活动、认定虚拟货币相关业务活动属于非法金融活动,有利于保障我国发展利益和金融安全。从“挖矿”行为的高能耗以及比特币交易活动对国家金融秩序和社会秩序的影响来看,一审法院认定涉案合同无效是正确的。双方作为社会主义市场经济主体,既应遵守市场经济规则,亦应承担起相应的社会责任,推动经济社会高质量发展、可持续发展。
|
||||
|
||||
27. 关于合同无效后的返还问题,一审法院未予处理,双方可另行解决。
|
||||
|
||||
28. 综上所述,丰复久信公司的上诉请求不能成立,应予驳回;一审判决并无不当,应予维持。依照《中华人民共和国民事诉讼法》第一百七十七条第一款第一项规定,判决如下:
|
||||
|
||||
驳回上诉,维持原判。
|
||||
|
||||
二审案件受理费450892 元,由北京丰复久信营销科技有限公司负担(已交纳)。
|
||||
|
||||
29. 本判决为终审判决。
|
||||
|
||||
审 判 长 史晓霞审 判 员 邓青菁审 判 员 李 淼二〇二二年七月七日法 官 助 理 黎 铧书 记 员 郑海兴
|
||||
|
|
@ -0,0 +1,110 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Unified Docker Compose Setup Script
|
||||
# This script helps set up the unified Docker Compose environment
|
||||
|
||||
set -e
|
||||
|
||||
echo "🚀 Setting up Unified Docker Compose Environment"
|
||||
|
||||
# Function to check if Docker is running
|
||||
check_docker() {
|
||||
if ! docker info > /dev/null 2>&1; then
|
||||
echo "❌ Docker is not running. Please start Docker and try again."
|
||||
exit 1
|
||||
fi
|
||||
echo "✅ Docker is running"
|
||||
}
|
||||
|
||||
# Function to stop existing individual services
|
||||
stop_individual_services() {
|
||||
echo "🛑 Stopping individual Docker Compose services..."
|
||||
|
||||
if [ -f "backend/docker-compose.yml" ]; then
|
||||
echo "Stopping backend services..."
|
||||
cd backend && docker-compose down 2>/dev/null || true && cd ..
|
||||
fi
|
||||
|
||||
if [ -f "frontend/docker-compose.yml" ]; then
|
||||
echo "Stopping frontend services..."
|
||||
cd frontend && docker-compose down 2>/dev/null || true && cd ..
|
||||
fi
|
||||
|
||||
if [ -f "mineru/docker-compose.yml" ]; then
|
||||
echo "Stopping mineru services..."
|
||||
cd mineru && docker-compose down 2>/dev/null || true && cd ..
|
||||
fi
|
||||
|
||||
echo "✅ Individual services stopped"
|
||||
}
|
||||
|
||||
# Function to create necessary directories
|
||||
create_directories() {
|
||||
echo "📁 Creating necessary directories..."
|
||||
|
||||
mkdir -p backend/storage
|
||||
mkdir -p mineru/storage/uploads
|
||||
mkdir -p mineru/storage/processed
|
||||
|
||||
echo "✅ Directories created"
|
||||
}
|
||||
|
||||
# Function to check if unified docker-compose.yml exists
|
||||
check_unified_compose() {
|
||||
if [ ! -f "docker-compose.yml" ]; then
|
||||
echo "❌ Unified docker-compose.yml not found in current directory"
|
||||
echo "Please run this script from the project root directory"
|
||||
exit 1
|
||||
fi
|
||||
echo "✅ Unified docker-compose.yml found"
|
||||
}
|
||||
|
||||
# Function to build and start services
|
||||
start_unified_services() {
|
||||
echo "🔨 Building and starting unified services..."
|
||||
|
||||
# Build all services
|
||||
docker-compose build
|
||||
|
||||
# Start services
|
||||
docker-compose up -d
|
||||
|
||||
echo "✅ Unified services started"
|
||||
}
|
||||
|
||||
# Function to check service status
|
||||
check_service_status() {
|
||||
echo "📊 Checking service status..."
|
||||
|
||||
docker-compose ps
|
||||
|
||||
echo ""
|
||||
echo "🌐 Service URLs:"
|
||||
echo "Frontend: http://localhost:3000"
|
||||
echo "Backend API: http://localhost:8000"
|
||||
echo "Mineru API: http://localhost:8001"
|
||||
echo ""
|
||||
echo "📝 To view logs: docker-compose logs -f [service-name]"
|
||||
echo "📝 To stop services: docker-compose down"
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
echo "=========================================="
|
||||
echo "Unified Docker Compose Setup"
|
||||
echo "=========================================="
|
||||
|
||||
check_docker
|
||||
check_unified_compose
|
||||
stop_individual_services
|
||||
create_directories
|
||||
start_unified_services
|
||||
check_service_status
|
||||
|
||||
echo ""
|
||||
echo "🎉 Setup complete! Your unified Docker environment is ready."
|
||||
echo "Check the DOCKER_COMPOSE_README.md for more information."
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
|
|
@ -1,31 +0,0 @@
|
|||
# settings.py
|
||||
|
||||
from pydantic_settings import BaseSettings
|
||||
from typing import Optional
|
||||
|
||||
class Settings(BaseSettings):
|
||||
# Storage paths
|
||||
OBJECT_STORAGE_PATH: str = ""
|
||||
TARGET_DIRECTORY_PATH: str = ""
|
||||
|
||||
# Ollama API settings
|
||||
OLLAMA_API_URL: str = "https://api.ollama.com"
|
||||
OLLAMA_API_KEY: str = ""
|
||||
OLLAMA_MODEL: str = "llama2"
|
||||
|
||||
# File monitoring settings
|
||||
MONITOR_INTERVAL: int = 5
|
||||
|
||||
# Logging settings
|
||||
LOG_LEVEL: str = "INFO"
|
||||
LOG_FORMAT: str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
||||
LOG_DATE_FORMAT: str = "%Y-%m-%d %H:%M:%S"
|
||||
LOG_FILE: str = "app.log"
|
||||
|
||||
class Config:
|
||||
env_file = ".env"
|
||||
env_file_encoding = "utf-8"
|
||||
extra = "allow"
|
||||
|
||||
# Create settings instance
|
||||
settings = Settings()
|
||||
|
|
@ -1,190 +0,0 @@
|
|||
from abc import ABC, abstractmethod
|
||||
from typing import Any, Dict
|
||||
from prompts.masking_prompts import get_masking_mapping_prompt
|
||||
import logging
|
||||
import json
|
||||
from services.ollama_client import OllamaClient
|
||||
from config.settings import settings
|
||||
from utils.json_extractor import LLMJsonExtractor
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class DocumentProcessor(ABC):
|
||||
def __init__(self):
|
||||
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
|
||||
self.max_chunk_size = 1000 # Maximum number of characters per chunk
|
||||
self.max_retries = 3 # Maximum number of retries for mapping generation
|
||||
|
||||
@abstractmethod
|
||||
def read_content(self) -> str:
|
||||
"""Read document content"""
|
||||
pass
|
||||
|
||||
def _split_into_chunks(self, sentences: list[str]) -> list[str]:
|
||||
"""Split sentences into chunks that don't exceed max_chunk_size"""
|
||||
chunks = []
|
||||
current_chunk = ""
|
||||
|
||||
for sentence in sentences:
|
||||
if not sentence.strip():
|
||||
continue
|
||||
|
||||
# If adding this sentence would exceed the limit, save current chunk and start new one
|
||||
if len(current_chunk) + len(sentence) > self.max_chunk_size and current_chunk:
|
||||
chunks.append(current_chunk)
|
||||
current_chunk = sentence
|
||||
else:
|
||||
if current_chunk:
|
||||
current_chunk += "。" + sentence
|
||||
else:
|
||||
current_chunk = sentence
|
||||
|
||||
# Add the last chunk if it's not empty
|
||||
if current_chunk:
|
||||
chunks.append(current_chunk)
|
||||
|
||||
return chunks
|
||||
|
||||
def _validate_mapping_format(self, mapping: Dict[str, Any]) -> bool:
|
||||
"""
|
||||
Validate that the mapping follows the required format:
|
||||
{
|
||||
"原文1": "脱敏后1",
|
||||
"原文2": "脱敏后2",
|
||||
...
|
||||
}
|
||||
"""
|
||||
if not isinstance(mapping, dict):
|
||||
logger.warning("Mapping is not a dictionary")
|
||||
return False
|
||||
|
||||
# Check if any key or value is not a string
|
||||
for key, value in mapping.items():
|
||||
if not isinstance(key, str) or not isinstance(value, str):
|
||||
logger.warning(f"Invalid mapping format - key or value is not a string: {key}: {value}")
|
||||
return False
|
||||
|
||||
# Check if the mapping has any nested structures
|
||||
if any(isinstance(v, (dict, list)) for v in mapping.values()):
|
||||
logger.warning("Invalid mapping format - contains nested structures")
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
def _build_mapping(self, chunk: str) -> Dict[str, str]:
|
||||
"""Build mapping for a single chunk of text with retry logic"""
|
||||
for attempt in range(self.max_retries):
|
||||
try:
|
||||
formatted_prompt = get_masking_mapping_prompt(chunk)
|
||||
logger.info(f"Calling ollama to generate mapping for chunk (attempt {attempt + 1}/{self.max_retries}): {formatted_prompt}")
|
||||
response = self.ollama_client.generate(formatted_prompt)
|
||||
logger.info(f"Raw response from LLM: {response}")
|
||||
|
||||
# Parse the JSON response into a dictionary
|
||||
mapping = LLMJsonExtractor.parse_raw_json_str(response)
|
||||
logger.info(f"Parsed mapping: {mapping}")
|
||||
|
||||
if mapping and self._validate_mapping_format(mapping):
|
||||
return mapping
|
||||
else:
|
||||
logger.warning(f"Invalid mapping format received on attempt {attempt + 1}, retrying...")
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating mapping on attempt {attempt + 1}: {e}")
|
||||
if attempt < self.max_retries - 1:
|
||||
logger.info("Retrying...")
|
||||
else:
|
||||
logger.error("Max retries reached, returning empty mapping")
|
||||
return {}
|
||||
|
||||
def _apply_mapping(self, text: str, mapping: Dict[str, str]) -> str:
|
||||
"""Apply the mapping to replace sensitive information"""
|
||||
masked_text = text
|
||||
for original, masked in mapping.items():
|
||||
# Ensure masked value is a string
|
||||
if isinstance(masked, dict):
|
||||
# If it's a dict, use the first value or a default
|
||||
masked = next(iter(masked.values()), "某")
|
||||
elif not isinstance(masked, str):
|
||||
# If it's not a string, convert to string or use default
|
||||
masked = str(masked) if masked is not None else "某"
|
||||
masked_text = masked_text.replace(original, masked)
|
||||
return masked_text
|
||||
|
||||
def _get_next_suffix(self, value: str) -> str:
|
||||
"""Get the next available suffix for a value that already has a suffix"""
|
||||
# Define the sequence of suffixes
|
||||
suffixes = ['甲', '乙', '丙', '丁', '戊', '己', '庚', '辛', '壬', '癸']
|
||||
|
||||
# Check if the value already has a suffix
|
||||
for suffix in suffixes:
|
||||
if value.endswith(suffix):
|
||||
# Find the next suffix in the sequence
|
||||
current_index = suffixes.index(suffix)
|
||||
if current_index + 1 < len(suffixes):
|
||||
return value[:-1] + suffixes[current_index + 1]
|
||||
else:
|
||||
# If we've used all suffixes, start over with the first one
|
||||
return value[:-1] + suffixes[0]
|
||||
|
||||
# If no suffix found, return the value with the first suffix
|
||||
return value + '甲'
|
||||
|
||||
def _merge_mappings(self, existing: Dict[str, str], new: Dict[str, str]) -> Dict[str, str]:
|
||||
"""
|
||||
Merge two mappings following the rules:
|
||||
1. If key exists in existing, keep existing value
|
||||
2. If value exists in existing:
|
||||
- If value ends with a suffix (甲乙丙丁...), add next suffix
|
||||
- If no suffix, add '甲'
|
||||
"""
|
||||
result = existing.copy()
|
||||
|
||||
# Get all existing values
|
||||
existing_values = set(result.values())
|
||||
|
||||
for key, value in new.items():
|
||||
if key in result:
|
||||
# Rule 1: Keep existing value if key exists
|
||||
continue
|
||||
|
||||
if value in existing_values:
|
||||
# Rule 2: Handle duplicate values
|
||||
new_value = self._get_next_suffix(value)
|
||||
result[key] = new_value
|
||||
existing_values.add(new_value)
|
||||
else:
|
||||
# No conflict, add as is
|
||||
result[key] = value
|
||||
existing_values.add(value)
|
||||
|
||||
return result
|
||||
|
||||
def process_content(self, content: str) -> str:
|
||||
"""Process document content by masking sensitive information"""
|
||||
# Split content into sentences
|
||||
sentences = content.split("。")
|
||||
|
||||
# Split sentences into manageable chunks
|
||||
chunks = self._split_into_chunks(sentences)
|
||||
logger.info(f"Split content into {len(chunks)} chunks")
|
||||
|
||||
# Build mapping for each chunk
|
||||
combined_mapping = {}
|
||||
for i, chunk in enumerate(chunks):
|
||||
logger.info(f"Processing chunk {i+1}/{len(chunks)}")
|
||||
chunk_mapping = self._build_mapping(chunk)
|
||||
if chunk_mapping: # Only update if we got a valid mapping
|
||||
combined_mapping = self._merge_mappings(combined_mapping, chunk_mapping)
|
||||
else:
|
||||
logger.warning(f"Failed to generate mapping for chunk {i+1}")
|
||||
|
||||
# Apply the combined mapping to the entire content
|
||||
masked_content = self._apply_mapping(content, combined_mapping)
|
||||
logger.info("Successfully masked content")
|
||||
|
||||
return masked_content
|
||||
|
||||
@abstractmethod
|
||||
def save_content(self, content: str) -> None:
|
||||
"""Save processed content"""
|
||||
pass
|
||||
|
|
@ -1,6 +0,0 @@
|
|||
from document_handlers.processors.txt_processor import TxtDocumentProcessor
|
||||
from document_handlers.processors.docx_processor import DocxDocumentProcessor
|
||||
from document_handlers.processors.pdf_processor import PdfDocumentProcessor
|
||||
from document_handlers.processors.md_processor import MarkdownDocumentProcessor
|
||||
|
||||
__all__ = ['TxtDocumentProcessor', 'DocxDocumentProcessor', 'PdfDocumentProcessor', 'MarkdownDocumentProcessor']
|
||||
|
|
@ -1,105 +0,0 @@
|
|||
import os
|
||||
import PyPDF2
|
||||
from document_handlers.document_processor import DocumentProcessor
|
||||
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
|
||||
from magic_pdf.data.dataset import PymuDocDataset
|
||||
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
|
||||
from magic_pdf.config.enums import SupportedPdfParseMethod
|
||||
from prompts.masking_prompts import get_masking_prompt, get_masking_mapping_prompt
|
||||
import logging
|
||||
from services.ollama_client import OllamaClient
|
||||
from config.settings import settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class PdfDocumentProcessor(DocumentProcessor):
|
||||
def __init__(self, input_path: str, output_path: str):
|
||||
super().__init__() # Call parent class's __init__
|
||||
self.input_path = input_path
|
||||
self.output_path = output_path
|
||||
self.output_dir = os.path.dirname(output_path)
|
||||
self.name_without_suff = os.path.splitext(os.path.basename(input_path))[0]
|
||||
|
||||
# Setup output directories
|
||||
self.local_image_dir = os.path.join(self.output_dir, "images")
|
||||
self.image_dir = os.path.basename(self.local_image_dir)
|
||||
os.makedirs(self.local_image_dir, exist_ok=True)
|
||||
|
||||
# Setup work directory under output directory
|
||||
self.work_dir = os.path.join(
|
||||
os.path.dirname(output_path),
|
||||
".work",
|
||||
os.path.splitext(os.path.basename(input_path))[0]
|
||||
)
|
||||
os.makedirs(self.work_dir, exist_ok=True)
|
||||
|
||||
self.work_local_image_dir = os.path.join(self.work_dir, "images")
|
||||
self.work_image_dir = os.path.basename(self.work_local_image_dir)
|
||||
os.makedirs(self.work_local_image_dir, exist_ok=True)
|
||||
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
|
||||
|
||||
def read_content(self) -> str:
|
||||
logger.info("Starting PDF content processing")
|
||||
|
||||
# Read the PDF file
|
||||
with open(self.input_path, 'rb') as file:
|
||||
content = file.read()
|
||||
|
||||
# Initialize writers
|
||||
image_writer = FileBasedDataWriter(self.work_local_image_dir)
|
||||
md_writer = FileBasedDataWriter(self.work_dir)
|
||||
|
||||
# Create Dataset Instance
|
||||
ds = PymuDocDataset(content)
|
||||
|
||||
logger.info("Classifying PDF type: %s", ds.classify())
|
||||
# Process based on PDF type
|
||||
if ds.classify() == SupportedPdfParseMethod.OCR:
|
||||
infer_result = ds.apply(doc_analyze, ocr=True)
|
||||
pipe_result = infer_result.pipe_ocr_mode(image_writer)
|
||||
else:
|
||||
infer_result = ds.apply(doc_analyze, ocr=False)
|
||||
pipe_result = infer_result.pipe_txt_mode(image_writer)
|
||||
|
||||
logger.info("Generating all outputs")
|
||||
# Generate all outputs
|
||||
infer_result.draw_model(os.path.join(self.work_dir, f"{self.name_without_suff}_model.pdf"))
|
||||
model_inference_result = infer_result.get_infer_res()
|
||||
|
||||
pipe_result.draw_layout(os.path.join(self.work_dir, f"{self.name_without_suff}_layout.pdf"))
|
||||
pipe_result.draw_span(os.path.join(self.work_dir, f"{self.name_without_suff}_spans.pdf"))
|
||||
|
||||
md_content = pipe_result.get_markdown(self.work_image_dir)
|
||||
pipe_result.dump_md(md_writer, f"{self.name_without_suff}.md", self.work_image_dir)
|
||||
|
||||
content_list = pipe_result.get_content_list(self.work_image_dir)
|
||||
pipe_result.dump_content_list(md_writer, f"{self.name_without_suff}_content_list.json", self.work_image_dir)
|
||||
|
||||
middle_json = pipe_result.get_middle_json()
|
||||
pipe_result.dump_middle_json(md_writer, f'{self.name_without_suff}_middle.json')
|
||||
|
||||
return md_content
|
||||
|
||||
# def process_content(self, content: str) -> str:
|
||||
# logger.info("Starting content masking process")
|
||||
# sentences = content.split("。")
|
||||
# final_md = ""
|
||||
# for sentence in sentences:
|
||||
# if not sentence.strip(): # Skip empty sentences
|
||||
# continue
|
||||
# formatted_prompt = get_masking_mapping_prompt(sentence)
|
||||
# logger.info("Calling ollama to generate response, prompt: %s", formatted_prompt)
|
||||
# response = self.ollama_client.generate(formatted_prompt)
|
||||
# logger.info(f"Response generated: {response}")
|
||||
# final_md += response + "。"
|
||||
# return final_md
|
||||
|
||||
def save_content(self, content: str) -> None:
|
||||
# Ensure output path has .md extension
|
||||
output_dir = os.path.dirname(self.output_path)
|
||||
base_name = os.path.splitext(os.path.basename(self.output_path))[0]
|
||||
md_output_path = os.path.join(output_dir, f"{base_name}.md")
|
||||
|
||||
logger.info(f"Saving masked content to: {md_output_path}")
|
||||
with open(md_output_path, 'w', encoding='utf-8') as file:
|
||||
file.write(content)
|
||||
22
src/main.py
22
src/main.py
|
|
@ -1,22 +0,0 @@
|
|||
from config.logging_config import setup_logging
|
||||
|
||||
def main():
|
||||
# Setup logging first
|
||||
setup_logging()
|
||||
|
||||
from services.file_monitor import FileMonitor
|
||||
from config.settings import settings
|
||||
|
||||
import logging
|
||||
logger = logging.getLogger(__name__)
|
||||
logger.info("Starting the application")
|
||||
logger.info(f"Monitoring directory: {settings.OBJECT_STORAGE_PATH}")
|
||||
logger.info(f"Target directory: {settings.TARGET_DIRECTORY_PATH}")
|
||||
# Initialize the file monitor
|
||||
file_monitor = FileMonitor(settings.OBJECT_STORAGE_PATH, settings.TARGET_DIRECTORY_PATH)
|
||||
|
||||
# Start monitoring the directory for new files
|
||||
file_monitor.start_monitoring()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -1,81 +0,0 @@
|
|||
import textwrap
|
||||
|
||||
def get_masking_prompt(text: str) -> str:
|
||||
"""
|
||||
Returns the prompt for masking sensitive information in legal documents.
|
||||
|
||||
Args:
|
||||
text (str): The input text to be masked
|
||||
|
||||
Returns:
|
||||
str: The formatted prompt with the input text
|
||||
"""
|
||||
prompt = textwrap.dedent("""
|
||||
您是一位专业的法律文档脱敏专家。请按照以下规则对文本进行脱敏处理:
|
||||
|
||||
规则:
|
||||
1. 人名:
|
||||
- 两字名改为"姓+某"(如:张三 → 张某)
|
||||
- 三字名改为"姓+某某"(如:张三丰 → 张某某)
|
||||
2. 公司名:
|
||||
- 保留地理位置信息(如:北京、上海等)
|
||||
- 保留公司类型(如:有限公司、股份公司等)
|
||||
- 用"某"替换核心名称
|
||||
3. 保持原文其他部分不变
|
||||
4. 确保脱敏后的文本保持原有的语言流畅性和可读性
|
||||
|
||||
输入文本:
|
||||
{text}
|
||||
|
||||
请直接输出脱敏后的文本,无需解释或其他备注。
|
||||
""")
|
||||
|
||||
return prompt.format(text=text)
|
||||
|
||||
def get_masking_mapping_prompt(text: str) -> str:
|
||||
"""
|
||||
Returns a prompt that generates a mapping of original names/companies to their masked versions.
|
||||
|
||||
Args:
|
||||
text (str): The input text to be analyzed for masking
|
||||
|
||||
Returns:
|
||||
str: The formatted prompt that will generate a mapping dictionary
|
||||
"""
|
||||
prompt = textwrap.dedent("""
|
||||
您是一位专业的法律文档脱敏专家。请分析文本并生成一个脱敏映射表,遵循以下规则:
|
||||
|
||||
规则:
|
||||
1. 人名映射规则:
|
||||
- 对于同一姓氏的不同人名,使用字母区分:
|
||||
* 第一个出现的用"姓+某"(如:张三 → 张某)
|
||||
* 第二个出现的用"姓+某A"(如:张四 → 张某A)
|
||||
* 第三个出现的用"姓+某B"(如:张五 → 张某B)
|
||||
依此类推
|
||||
- 三字名同样遵循此规则(如:张三丰 → 张某某,张四海 → 张某某A)
|
||||
|
||||
2. 公司名映射规则:
|
||||
- 保留地理位置信息(如:北京、上海等)
|
||||
- 保留公司类型(如:有限公司、股份公司等)
|
||||
- 用"某"替换核心名称,但保留首尾字(如:北京智慧科技有限公司 → 北京智某科技有限公司)
|
||||
- 对于多个相似公司名,使用字母区分(如:
|
||||
北京智慧科技有限公司 → 北京某科技有限公司
|
||||
北京智能科技有限公司 → 北京某科技有限公司A)
|
||||
|
||||
3. 公权机关不做脱敏处理(如:公安局、法院、检察院、中国人民银行、银监会及其他未列明的公权机关)
|
||||
|
||||
请分析以下文本,并生成一个JSON格式的映射表,包含所有需要脱敏的名称及其对应的脱敏后的形式:
|
||||
|
||||
{text}
|
||||
|
||||
请直接输出JSON格式的映射表,格式如下:
|
||||
{{
|
||||
"原文1": "脱敏后1",
|
||||
"原文2": "脱敏后2",
|
||||
...
|
||||
}}
|
||||
如无需要输出的映射,请输出空json,如下:
|
||||
{{}}
|
||||
""")
|
||||
|
||||
return prompt.format(text=text)
|
||||
|
|
@ -1,54 +0,0 @@
|
|||
import logging
|
||||
import os
|
||||
from services.document_service import DocumentService
|
||||
from services.ollama_client import OllamaClient
|
||||
from config.settings import settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class FileMonitor:
|
||||
def __init__(self, input_directory: str, output_directory: str):
|
||||
self.input_directory = input_directory
|
||||
self.output_directory = output_directory
|
||||
|
||||
# Create OllamaClient instance using settings
|
||||
ollama_client = OllamaClient(
|
||||
model_name=settings.OLLAMA_MODEL,
|
||||
base_url=settings.OLLAMA_API_URL
|
||||
)
|
||||
# Inject OllamaClient into DocumentService
|
||||
self.document_service = DocumentService(ollama_client=ollama_client)
|
||||
|
||||
def process_new_file(self, file_path: str) -> None:
|
||||
try:
|
||||
# Get the filename without directory path
|
||||
filename = os.path.basename(file_path)
|
||||
# Create output path
|
||||
output_path = os.path.join(self.output_directory, filename)
|
||||
|
||||
logger.info(f"Processing file: {filename}")
|
||||
# Process the document using document service
|
||||
self.document_service.process_document(file_path, output_path)
|
||||
logger.info(f"File processed successfully: {filename}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing file {file_path}: {str(e)}")
|
||||
|
||||
def start_monitoring(self):
|
||||
import time
|
||||
|
||||
# Ensure output directory exists
|
||||
os.makedirs(self.output_directory, exist_ok=True)
|
||||
|
||||
already_seen = set(os.listdir(self.input_directory))
|
||||
while True:
|
||||
time.sleep(1) # Check every second
|
||||
current_files = set(os.listdir(self.input_directory))
|
||||
new_files = current_files - already_seen
|
||||
|
||||
for new_file in new_files:
|
||||
file_path = os.path.join(self.input_directory, new_file)
|
||||
logger.info(f"New file found: {new_file}")
|
||||
self.process_new_file(file_path)
|
||||
|
||||
already_seen = current_files
|
||||
Loading…
Reference in New Issue