Initial commit
This commit is contained in:
parent
0904ab5073
commit
7233176ab9
19
.env.example
19
.env.example
|
|
@ -1,19 +0,0 @@
|
|||
# Storage paths
|
||||
OBJECT_STORAGE_PATH=/path/to/mounted/object/storage
|
||||
TARGET_DIRECTORY_PATH=/path/to/target/directory
|
||||
|
||||
# Ollama API Configuration
|
||||
OLLAMA_API_URL=https://api.ollama.com
|
||||
OLLAMA_API_KEY=your_api_key_here
|
||||
OLLAMA_MODEL=llama2
|
||||
|
||||
# Application Settings
|
||||
MONITOR_INTERVAL=5
|
||||
|
||||
# Logging Configuration
|
||||
LOG_LEVEL=INFO
|
||||
LOG_FILE=app.log
|
||||
|
||||
# Optional: Additional security settings
|
||||
# MAX_FILE_SIZE=10485760 # 10MB in bytes
|
||||
# ALLOWED_FILE_TYPES=.txt,.doc,.docx,.pdf
|
||||
|
|
@ -0,0 +1,76 @@
|
|||
# Python
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
*.so
|
||||
.Python
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
downloads/
|
||||
eggs/
|
||||
.eggs/
|
||||
lib/
|
||||
lib64/
|
||||
parts/
|
||||
sdist/
|
||||
var/
|
||||
wheels/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
|
||||
# Virtual Environment
|
||||
venv/
|
||||
env/
|
||||
ENV/
|
||||
.env
|
||||
|
||||
# IDE
|
||||
.idea/
|
||||
.vscode/
|
||||
*.swp
|
||||
*.swo
|
||||
.DS_Store
|
||||
|
||||
# Logs
|
||||
*.log
|
||||
logs/
|
||||
|
||||
# Testing
|
||||
.coverage
|
||||
.pytest_cache/
|
||||
htmlcov/
|
||||
.tox/
|
||||
|
||||
# Project specific
|
||||
target_folder/
|
||||
output/
|
||||
temp/
|
||||
|
||||
# Jupyter Notebook
|
||||
.ipynb_checkpoints
|
||||
|
||||
# mypy
|
||||
.mypy_cache/
|
||||
|
||||
# Distribution / packaging
|
||||
.Python
|
||||
*.pyc
|
||||
|
||||
# Local development settings
|
||||
.env.local
|
||||
.env.development.local
|
||||
.env.test.local
|
||||
.env.production.local
|
||||
|
||||
src_folder
|
||||
target_folder
|
||||
app.log
|
||||
__pycache__
|
||||
data/doc_dest
|
||||
data/doc_src
|
||||
data/doc_intermediate
|
||||
|
||||
node_modules
|
||||
backend/storage/
|
||||
|
|
@ -0,0 +1,206 @@
|
|||
# Unified Docker Compose Setup
|
||||
|
||||
This project now includes a unified Docker Compose configuration that allows all services (mineru, backend, frontend) to run together and communicate using service names.
|
||||
|
||||
## Architecture
|
||||
|
||||
The unified setup includes the following services:
|
||||
|
||||
- **mineru-api**: Document processing service (port 8001)
|
||||
- **backend-api**: Main API service (port 8000)
|
||||
- **celery-worker**: Background task processor
|
||||
- **redis**: Message broker for Celery
|
||||
- **frontend**: React frontend application (port 3000)
|
||||
|
||||
## Network Configuration
|
||||
|
||||
All services are connected through a custom bridge network called `app-network`, allowing them to communicate using service names:
|
||||
|
||||
- Backend → Mineru: `http://mineru-api:8000`
|
||||
- Frontend → Backend: `http://localhost:8000/api/v1` (external access)
|
||||
- Backend → Redis: `redis://redis:6379/0`
|
||||
|
||||
## Usage
|
||||
|
||||
### Starting all services
|
||||
|
||||
```bash
|
||||
# From the root directory
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
### Starting specific services
|
||||
|
||||
```bash
|
||||
# Start only backend and mineru
|
||||
docker-compose up -d backend-api mineru-api redis
|
||||
|
||||
# Start only frontend and backend
|
||||
docker-compose up -d frontend backend-api redis
|
||||
```
|
||||
|
||||
### Stopping services
|
||||
|
||||
```bash
|
||||
# Stop all services
|
||||
docker-compose down
|
||||
|
||||
# Stop and remove volumes
|
||||
docker-compose down -v
|
||||
```
|
||||
|
||||
### Viewing logs
|
||||
|
||||
```bash
|
||||
# View all logs
|
||||
docker-compose logs -f
|
||||
|
||||
# View specific service logs
|
||||
docker-compose logs -f backend-api
|
||||
docker-compose logs -f mineru-api
|
||||
docker-compose logs -f frontend
|
||||
```
|
||||
|
||||
## Building Services
|
||||
|
||||
### Building all services
|
||||
|
||||
```bash
|
||||
# Build all services
|
||||
docker-compose build
|
||||
|
||||
# Build and start all services
|
||||
docker-compose up -d --build
|
||||
```
|
||||
|
||||
### Building individual services
|
||||
|
||||
```bash
|
||||
# Build only backend
|
||||
docker-compose build backend-api
|
||||
|
||||
# Build only frontend
|
||||
docker-compose build frontend
|
||||
|
||||
# Build only mineru
|
||||
docker-compose build mineru-api
|
||||
|
||||
# Build multiple specific services
|
||||
docker-compose build backend-api frontend
|
||||
```
|
||||
|
||||
### Building and restarting specific services
|
||||
|
||||
```bash
|
||||
# Build and restart only backend
|
||||
docker-compose build backend-api
|
||||
docker-compose up -d backend-api
|
||||
|
||||
# Or combine in one command
|
||||
docker-compose up -d --build backend-api
|
||||
|
||||
# Build and restart backend and celery worker
|
||||
docker-compose up -d --build backend-api celery-worker
|
||||
```
|
||||
|
||||
### Force rebuild (no cache)
|
||||
|
||||
```bash
|
||||
# Force rebuild all services
|
||||
docker-compose build --no-cache
|
||||
|
||||
# Force rebuild specific service
|
||||
docker-compose build --no-cache backend-api
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
The unified setup uses environment variables from the individual service `.env` files:
|
||||
|
||||
- `./backend/.env` - Backend configuration
|
||||
- `./frontend/.env` - Frontend configuration
|
||||
- `./mineru/.env` - Mineru configuration (if exists)
|
||||
|
||||
### Key Configuration Changes
|
||||
|
||||
1. **Backend Configuration** (`backend/app/core/config.py`):
|
||||
```python
|
||||
MINERU_API_URL: str = "http://mineru-api:8000"
|
||||
```
|
||||
|
||||
2. **Frontend Configuration**:
|
||||
```javascript
|
||||
REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
|
||||
```
|
||||
|
||||
## Service Dependencies
|
||||
|
||||
- `backend-api` depends on `redis` and `mineru-api`
|
||||
- `celery-worker` depends on `redis` and `backend-api`
|
||||
- `frontend` depends on `backend-api`
|
||||
|
||||
## Port Mapping
|
||||
|
||||
- **Frontend**: `http://localhost:3000`
|
||||
- **Backend API**: `http://localhost:8000`
|
||||
- **Mineru API**: `http://localhost:8001`
|
||||
- **Redis**: `localhost:6379`
|
||||
|
||||
## Health Checks
|
||||
|
||||
The mineru-api service includes a health check that verifies the service is running properly.
|
||||
|
||||
## Development vs Production
|
||||
|
||||
For development, you can still use the individual docker-compose files in each service directory. The unified setup is ideal for:
|
||||
|
||||
- Production deployments
|
||||
- End-to-end testing
|
||||
- Simplified development environment
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Service Communication Issues
|
||||
|
||||
If services can't communicate:
|
||||
|
||||
1. Check if all services are running: `docker-compose ps`
|
||||
2. Verify network connectivity: `docker network ls`
|
||||
3. Check service logs: `docker-compose logs [service-name]`
|
||||
|
||||
### Port Conflicts
|
||||
|
||||
If you get port conflicts, you can modify the port mappings in the `docker-compose.yml` file:
|
||||
|
||||
```yaml
|
||||
ports:
|
||||
- "8002:8000" # Change external port
|
||||
```
|
||||
|
||||
### Volume Issues
|
||||
|
||||
Make sure the storage directories exist:
|
||||
|
||||
```bash
|
||||
mkdir -p backend/storage
|
||||
mkdir -p mineru/storage/uploads
|
||||
mkdir -p mineru/storage/processed
|
||||
```
|
||||
|
||||
## Migration from Individual Compose Files
|
||||
|
||||
If you were previously using individual docker-compose files:
|
||||
|
||||
1. Stop all individual services:
|
||||
```bash
|
||||
cd backend && docker-compose down
|
||||
cd ../frontend && docker-compose down
|
||||
cd ../mineru && docker-compose down
|
||||
```
|
||||
|
||||
2. Start the unified setup:
|
||||
```bash
|
||||
cd .. && docker-compose up -d
|
||||
```
|
||||
|
||||
The unified setup maintains the same functionality while providing better service discovery and networking.
|
||||
|
|
@ -0,0 +1,399 @@
|
|||
# Docker Image Migration Guide
|
||||
|
||||
This guide explains how to export your built Docker images, transfer them to another environment, and run them without rebuilding.
|
||||
|
||||
## Overview
|
||||
|
||||
The migration process involves:
|
||||
1. **Export**: Save built images to tar files
|
||||
2. **Transfer**: Copy tar files to target environment
|
||||
3. **Import**: Load images on target environment
|
||||
4. **Run**: Start services with imported images
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Source Environment (where images are built)
|
||||
- Docker installed and running
|
||||
- All services built and working
|
||||
- Sufficient disk space for image export
|
||||
|
||||
### Target Environment (where images will run)
|
||||
- Docker installed and running
|
||||
- Sufficient disk space for image import
|
||||
- Network access to source environment (or USB drive)
|
||||
|
||||
## Step 1: Export Docker Images
|
||||
|
||||
### 1.1 List Current Images
|
||||
|
||||
First, check what images you have:
|
||||
|
||||
```bash
|
||||
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.ID}}\t{{.Size}}"
|
||||
```
|
||||
|
||||
You should see images like:
|
||||
- `legal-doc-masker-backend-api`
|
||||
- `legal-doc-masker-frontend`
|
||||
- `legal-doc-masker-mineru-api`
|
||||
- `redis:alpine`
|
||||
|
||||
### 1.2 Export Individual Images
|
||||
|
||||
Create a directory for exports:
|
||||
|
||||
```bash
|
||||
mkdir -p docker-images-export
|
||||
cd docker-images-export
|
||||
```
|
||||
|
||||
Export each image:
|
||||
|
||||
```bash
|
||||
# Export backend image
|
||||
docker save legal-doc-masker-backend-api:latest -o backend-api.tar
|
||||
|
||||
# Export frontend image
|
||||
docker save legal-doc-masker-frontend:latest -o frontend.tar
|
||||
|
||||
# Export mineru image
|
||||
docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
|
||||
|
||||
# Export redis image (if not using official)
|
||||
docker save redis:alpine -o redis.tar
|
||||
```
|
||||
|
||||
### 1.3 Export All Images at Once (Alternative)
|
||||
|
||||
If you want to export all images in one command:
|
||||
|
||||
```bash
|
||||
# Export all project images
|
||||
docker save \
|
||||
legal-doc-masker-backend-api:latest \
|
||||
legal-doc-masker-frontend:latest \
|
||||
legal-doc-masker-mineru-api:latest \
|
||||
redis:alpine \
|
||||
-o legal-doc-masker-all.tar
|
||||
```
|
||||
|
||||
### 1.4 Verify Export Files
|
||||
|
||||
Check the exported files:
|
||||
|
||||
```bash
|
||||
ls -lh *.tar
|
||||
```
|
||||
|
||||
You should see files like:
|
||||
- `backend-api.tar` (~200-500MB)
|
||||
- `frontend.tar` (~100-300MB)
|
||||
- `mineru-api.tar` (~1-3GB)
|
||||
- `redis.tar` (~30-50MB)
|
||||
|
||||
## Step 2: Transfer Images
|
||||
|
||||
### 2.1 Transfer via Network (SCP/RSYNC)
|
||||
|
||||
```bash
|
||||
# Transfer to remote server
|
||||
scp *.tar user@remote-server:/path/to/destination/
|
||||
|
||||
# Or using rsync (more efficient for large files)
|
||||
rsync -avz --progress *.tar user@remote-server:/path/to/destination/
|
||||
```
|
||||
|
||||
### 2.2 Transfer via USB Drive
|
||||
|
||||
```bash
|
||||
# Copy to USB drive
|
||||
cp *.tar /Volumes/USB_DRIVE/docker-images/
|
||||
|
||||
# Or create a compressed archive
|
||||
tar -czf legal-doc-masker-images.tar.gz *.tar
|
||||
cp legal-doc-masker-images.tar.gz /Volumes/USB_DRIVE/
|
||||
```
|
||||
|
||||
### 2.3 Transfer via Cloud Storage
|
||||
|
||||
```bash
|
||||
# Upload to cloud storage (example with AWS S3)
|
||||
aws s3 cp *.tar s3://your-bucket/docker-images/
|
||||
|
||||
# Or using Google Cloud Storage
|
||||
gsutil cp *.tar gs://your-bucket/docker-images/
|
||||
```
|
||||
|
||||
## Step 3: Import Images on Target Environment
|
||||
|
||||
### 3.1 Prepare Target Environment
|
||||
|
||||
```bash
|
||||
# Create directory for images
|
||||
mkdir -p docker-images-import
|
||||
cd docker-images-import
|
||||
|
||||
# Copy images from transfer method
|
||||
# (SCP, USB, or download from cloud storage)
|
||||
```
|
||||
|
||||
### 3.2 Import Individual Images
|
||||
|
||||
```bash
|
||||
# Import backend image
|
||||
docker load -i backend-api.tar
|
||||
|
||||
# Import frontend image
|
||||
docker load -i frontend.tar
|
||||
|
||||
# Import mineru image
|
||||
docker load -i mineru-api.tar
|
||||
|
||||
# Import redis image
|
||||
docker load -i redis.tar
|
||||
```
|
||||
|
||||
### 3.3 Import All Images at Once (if exported together)
|
||||
|
||||
```bash
|
||||
docker load -i legal-doc-masker-all.tar
|
||||
```
|
||||
|
||||
### 3.4 Verify Imported Images
|
||||
|
||||
```bash
|
||||
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.ID}}\t{{.Size}}"
|
||||
```
|
||||
|
||||
## Step 4: Prepare Target Environment
|
||||
|
||||
### 4.1 Copy Project Files
|
||||
|
||||
Transfer the following files to target environment:
|
||||
|
||||
```bash
|
||||
# Essential files to copy
|
||||
docker-compose.yml
|
||||
DOCKER_COMPOSE_README.md
|
||||
setup-unified-docker.sh
|
||||
|
||||
# Environment files (if they exist)
|
||||
backend/.env
|
||||
frontend/.env
|
||||
mineru/.env
|
||||
|
||||
# Storage directories (if you want to preserve data)
|
||||
backend/storage/
|
||||
mineru/storage/
|
||||
backend/legal_doc_masker.db
|
||||
```
|
||||
|
||||
### 4.2 Create Directory Structure
|
||||
|
||||
```bash
|
||||
# Create necessary directories
|
||||
mkdir -p backend/storage
|
||||
mkdir -p mineru/storage/uploads
|
||||
mkdir -p mineru/storage/processed
|
||||
```
|
||||
|
||||
## Step 5: Run Services
|
||||
|
||||
### 5.1 Start All Services
|
||||
|
||||
```bash
|
||||
# Start all services using imported images
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
### 5.2 Verify Services
|
||||
|
||||
```bash
|
||||
# Check service status
|
||||
docker-compose ps
|
||||
|
||||
# Check service logs
|
||||
docker-compose logs -f
|
||||
```
|
||||
|
||||
### 5.3 Test Endpoints
|
||||
|
||||
```bash
|
||||
# Test frontend
|
||||
curl -I http://localhost:3000
|
||||
|
||||
# Test backend API
|
||||
curl -I http://localhost:8000/api/v1
|
||||
|
||||
# Test mineru API
|
||||
curl -I http://localhost:8001/health
|
||||
```
|
||||
|
||||
## Automation Scripts
|
||||
|
||||
### Export Script
|
||||
|
||||
Create `export-images.sh`:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
echo "🚀 Exporting Docker Images"
|
||||
|
||||
# Create export directory
|
||||
mkdir -p docker-images-export
|
||||
cd docker-images-export
|
||||
|
||||
# Export images
|
||||
echo "📦 Exporting backend-api image..."
|
||||
docker save legal-doc-masker-backend-api:latest -o backend-api.tar
|
||||
|
||||
echo "📦 Exporting frontend image..."
|
||||
docker save legal-doc-masker-frontend:latest -o frontend.tar
|
||||
|
||||
echo "📦 Exporting mineru-api image..."
|
||||
docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
|
||||
|
||||
echo "📦 Exporting redis image..."
|
||||
docker save redis:alpine -o redis.tar
|
||||
|
||||
# Show file sizes
|
||||
echo "📊 Export complete. File sizes:"
|
||||
ls -lh *.tar
|
||||
|
||||
echo "✅ Images exported successfully!"
|
||||
```
|
||||
|
||||
### Import Script
|
||||
|
||||
Create `import-images.sh`:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
echo "🚀 Importing Docker Images"
|
||||
|
||||
# Check if tar files exist
|
||||
if [ ! -f "backend-api.tar" ]; then
|
||||
echo "❌ backend-api.tar not found"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Import images
|
||||
echo "📦 Importing backend-api image..."
|
||||
docker load -i backend-api.tar
|
||||
|
||||
echo "📦 Importing frontend image..."
|
||||
docker load -i frontend.tar
|
||||
|
||||
echo "📦 Importing mineru-api image..."
|
||||
docker load -i mineru-api.tar
|
||||
|
||||
echo "📦 Importing redis image..."
|
||||
docker load -i redis.tar
|
||||
|
||||
# Verify imports
|
||||
echo "📊 Imported images:"
|
||||
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep legal-doc-masker
|
||||
|
||||
echo "✅ Images imported successfully!"
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Image not found during import**
|
||||
```bash
|
||||
# Check if image exists
|
||||
docker images | grep image-name
|
||||
|
||||
# Re-export if needed
|
||||
docker save image-name:tag -o image-name.tar
|
||||
```
|
||||
|
||||
2. **Port conflicts on target environment**
|
||||
```bash
|
||||
# Check what's using the ports
|
||||
lsof -i :8000
|
||||
lsof -i :8001
|
||||
lsof -i :3000
|
||||
|
||||
# Modify docker-compose.yml if needed
|
||||
ports:
|
||||
- "8002:8000" # Change external port
|
||||
```
|
||||
|
||||
3. **Permission issues**
|
||||
```bash
|
||||
# Fix file permissions
|
||||
chmod +x setup-unified-docker.sh
|
||||
chmod +x export-images.sh
|
||||
chmod +x import-images.sh
|
||||
```
|
||||
|
||||
4. **Storage directory issues**
|
||||
```bash
|
||||
# Create directories with proper permissions
|
||||
sudo mkdir -p backend/storage
|
||||
sudo mkdir -p mineru/storage/uploads
|
||||
sudo mkdir -p mineru/storage/processed
|
||||
sudo chown -R $USER:$USER backend/storage mineru/storage
|
||||
```
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
1. **Compress images for transfer**
|
||||
```bash
|
||||
# Compress before transfer
|
||||
gzip *.tar
|
||||
|
||||
# Decompress on target
|
||||
gunzip *.tar.gz
|
||||
```
|
||||
|
||||
2. **Use parallel transfer**
|
||||
```bash
|
||||
# Transfer multiple files in parallel
|
||||
parallel scp {} user@server:/path/ ::: *.tar
|
||||
```
|
||||
|
||||
3. **Use Docker registry (alternative)**
|
||||
```bash
|
||||
# Push to registry
|
||||
docker tag legal-doc-masker-backend-api:latest your-registry/backend-api:latest
|
||||
docker push your-registry/backend-api:latest
|
||||
|
||||
# Pull on target
|
||||
docker pull your-registry/backend-api:latest
|
||||
```
|
||||
|
||||
## Complete Migration Checklist
|
||||
|
||||
- [ ] Export all Docker images
|
||||
- [ ] Transfer image files to target environment
|
||||
- [ ] Transfer project configuration files
|
||||
- [ ] Import images on target environment
|
||||
- [ ] Create necessary directories
|
||||
- [ ] Start services
|
||||
- [ ] Verify all services are running
|
||||
- [ ] Test all endpoints
|
||||
- [ ] Update any environment-specific configurations
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **Secure transfer**: Use encrypted transfer methods (SCP, SFTP)
|
||||
2. **Image verification**: Verify image integrity after transfer
|
||||
3. **Environment isolation**: Ensure target environment is properly secured
|
||||
4. **Access control**: Limit access to Docker daemon on target environment
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
1. **Image size**: Remove unnecessary layers before export
|
||||
2. **Compression**: Use compression for large images
|
||||
3. **Selective transfer**: Only transfer images you need
|
||||
4. **Cleanup**: Remove old images after successful migration
|
||||
48
Dockerfile
48
Dockerfile
|
|
@ -1,48 +0,0 @@
|
|||
# Build stage
|
||||
FROM python:3.12-slim AS builder
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install build dependencies
|
||||
RUN apt-get update && \
|
||||
apt-get install -y --no-install-recommends \
|
||||
build-essential \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Copy requirements first to leverage Docker cache
|
||||
COPY requirements.txt .
|
||||
RUN pip wheel --no-cache-dir --no-deps --wheel-dir /app/wheels -r requirements.txt
|
||||
|
||||
# Final stage
|
||||
FROM python:3.12-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Create non-root user
|
||||
RUN useradd -m -r appuser && \
|
||||
chown appuser:appuser /app
|
||||
|
||||
# Copy wheels from builder
|
||||
COPY --from=builder /app/wheels /wheels
|
||||
COPY --from=builder /app/requirements.txt .
|
||||
|
||||
# Install dependencies
|
||||
RUN pip install --no-cache /wheels/*
|
||||
|
||||
# Copy application code
|
||||
COPY src/ ./src/
|
||||
|
||||
# Create directories for mounted volumes
|
||||
RUN mkdir -p /data/input /data/output && \
|
||||
chown -R appuser:appuser /data
|
||||
|
||||
# Switch to non-root user
|
||||
USER appuser
|
||||
|
||||
# Environment variables
|
||||
ENV PYTHONPATH=/app \
|
||||
OBJECT_STORAGE_PATH=/data/input \
|
||||
TARGET_DIRECTORY_PATH=/data/output
|
||||
|
||||
# Run the application
|
||||
CMD ["python", "src/main.py"]
|
||||
|
|
@ -0,0 +1,178 @@
|
|||
# Docker Migration Quick Reference
|
||||
|
||||
## 🚀 Quick Migration Process
|
||||
|
||||
### Source Environment (Export)
|
||||
|
||||
```bash
|
||||
# 1. Build images first (if not already built)
|
||||
docker-compose build
|
||||
|
||||
# 2. Export all images
|
||||
./export-images.sh
|
||||
|
||||
# 3. Transfer files to target environment
|
||||
# Option A: SCP
|
||||
scp -r docker-images-export-*/ user@target-server:/path/to/destination/
|
||||
|
||||
# Option B: USB Drive
|
||||
cp -r docker-images-export-*/ /Volumes/USB_DRIVE/
|
||||
|
||||
# Option C: Compressed archive
|
||||
scp legal-doc-masker-images-*.tar.gz user@target-server:/path/to/destination/
|
||||
```
|
||||
|
||||
### Target Environment (Import)
|
||||
|
||||
```bash
|
||||
# 1. Copy project files
|
||||
scp docker-compose.yml user@target-server:/path/to/destination/
|
||||
scp DOCKER_COMPOSE_README.md user@target-server:/path/to/destination/
|
||||
|
||||
# 2. Import images
|
||||
./import-images.sh
|
||||
|
||||
# 3. Start services
|
||||
docker-compose up -d
|
||||
|
||||
# 4. Verify
|
||||
docker-compose ps
|
||||
```
|
||||
|
||||
## 📋 Essential Files to Transfer
|
||||
|
||||
### Required Files
|
||||
- `docker-compose.yml` - Unified compose configuration
|
||||
- `DOCKER_COMPOSE_README.md` - Documentation
|
||||
- `backend/.env` - Backend environment variables
|
||||
- `frontend/.env` - Frontend environment variables
|
||||
- `mineru/.env` - Mineru environment variables (if exists)
|
||||
|
||||
### Optional Files (for data preservation)
|
||||
- `backend/storage/` - Backend storage directory
|
||||
- `mineru/storage/` - Mineru storage directory
|
||||
- `backend/legal_doc_masker.db` - Database file
|
||||
|
||||
## 🔧 Common Commands
|
||||
|
||||
### Export Commands
|
||||
```bash
|
||||
# Manual export
|
||||
docker save legal-doc-masker-backend-api:latest -o backend-api.tar
|
||||
docker save legal-doc-masker-frontend:latest -o frontend.tar
|
||||
docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
|
||||
docker save redis:alpine -o redis.tar
|
||||
|
||||
# Compress for transfer
|
||||
tar -czf legal-doc-masker-images.tar.gz *.tar
|
||||
```
|
||||
|
||||
### Import Commands
|
||||
```bash
|
||||
# Manual import
|
||||
docker load -i backend-api.tar
|
||||
docker load -i frontend.tar
|
||||
docker load -i mineru-api.tar
|
||||
docker load -i redis.tar
|
||||
|
||||
# Extract compressed archive
|
||||
tar -xzf legal-doc-masker-images.tar.gz
|
||||
```
|
||||
|
||||
### Service Management
|
||||
```bash
|
||||
# Start all services
|
||||
docker-compose up -d
|
||||
|
||||
# Stop all services
|
||||
docker-compose down
|
||||
|
||||
# View logs
|
||||
docker-compose logs -f [service-name]
|
||||
|
||||
# Check status
|
||||
docker-compose ps
|
||||
```
|
||||
|
||||
### Building Individual Services
|
||||
```bash
|
||||
# Build specific service only
|
||||
docker-compose build backend-api
|
||||
docker-compose build frontend
|
||||
docker-compose build mineru-api
|
||||
|
||||
# Build and restart specific service
|
||||
docker-compose up -d --build backend-api
|
||||
|
||||
# Force rebuild (no cache)
|
||||
docker-compose build --no-cache backend-api
|
||||
|
||||
# Using the build script
|
||||
./build-service.sh backend-api --restart
|
||||
./build-service.sh frontend --no-cache
|
||||
./build-service.sh backend-api celery-worker
|
||||
```
|
||||
|
||||
## 🌐 Service URLs
|
||||
|
||||
After successful migration:
|
||||
- **Frontend**: http://localhost:3000
|
||||
- **Backend API**: http://localhost:8000
|
||||
- **Mineru API**: http://localhost:8001
|
||||
|
||||
## ⚠️ Troubleshooting
|
||||
|
||||
### Port Conflicts
|
||||
```bash
|
||||
# Check what's using ports
|
||||
lsof -i :8000
|
||||
lsof -i :8001
|
||||
lsof -i :3000
|
||||
|
||||
# Modify docker-compose.yml if needed
|
||||
ports:
|
||||
- "8002:8000" # Change external port
|
||||
```
|
||||
|
||||
### Permission Issues
|
||||
```bash
|
||||
# Fix script permissions
|
||||
chmod +x export-images.sh
|
||||
chmod +x import-images.sh
|
||||
chmod +x setup-unified-docker.sh
|
||||
|
||||
# Fix directory permissions
|
||||
sudo chown -R $USER:$USER backend/storage mineru/storage
|
||||
```
|
||||
|
||||
### Disk Space Issues
|
||||
```bash
|
||||
# Check available space
|
||||
df -h
|
||||
|
||||
# Clean up Docker
|
||||
docker system prune -a
|
||||
```
|
||||
|
||||
## 📊 Expected File Sizes
|
||||
|
||||
- `backend-api.tar`: ~200-500MB
|
||||
- `frontend.tar`: ~100-300MB
|
||||
- `mineru-api.tar`: ~1-3GB
|
||||
- `redis.tar`: ~30-50MB
|
||||
- `legal-doc-masker-images.tar.gz`: ~1-2GB (compressed)
|
||||
|
||||
## 🔒 Security Notes
|
||||
|
||||
1. Use encrypted transfer (SCP, SFTP) for sensitive environments
|
||||
2. Verify image integrity after transfer
|
||||
3. Update environment variables for target environment
|
||||
4. Ensure proper network security on target environment
|
||||
|
||||
## 📞 Support
|
||||
|
||||
If you encounter issues:
|
||||
1. Check the full `DOCKER_MIGRATION_GUIDE.md`
|
||||
2. Verify all required files are present
|
||||
3. Check Docker logs: `docker-compose logs -f`
|
||||
4. Ensure sufficient disk space and permissions
|
||||
12
README.md
12
README.md
|
|
@ -35,14 +35,20 @@ doc-processing-app
|
|||
cd doc-processing-app
|
||||
```
|
||||
|
||||
2. Install the required dependencies:
|
||||
2. Install LibreOffice (required for document processing):
|
||||
```
|
||||
brew install libreoffice
|
||||
```
|
||||
|
||||
3. Install the required dependencies:
|
||||
```
|
||||
pip install -r requirements.txt
|
||||
pip install -U magic-pdf[full]
|
||||
```
|
||||
|
||||
3. Configure the application by editing the `src/config/settings.py` file to set the paths for the object storage and target directory.
|
||||
4. Configure the application by editing the `src/config/settings.py` file to set the paths for the object storage and target directory.
|
||||
|
||||
4. Create a `.env` file based on the `.env.example` file to set up necessary environment variables.
|
||||
5. Create a `.env` file based on the `.env.example` file to set up necessary environment variables.
|
||||
|
||||
## Usage
|
||||
|
||||
|
|
|
|||
1
app.log
1
app.log
|
|
@ -1 +0,0 @@
|
|||
2025-04-20 20:14:00 - services.file_monitor - INFO - monitor: new file found: README.md
|
||||
|
|
@ -0,0 +1,20 @@
|
|||
# Storage paths
|
||||
OBJECT_STORAGE_PATH=/Users/tigeren/Dev/digisky/legal-doc-masker/data/doc_src
|
||||
TARGET_DIRECTORY_PATH=/Users/tigeren/Dev/digisky/legal-doc-masker/data/doc_dest
|
||||
INTERMEDIATE_DIR_PATH=/Users/tigeren/Dev/digisky/legal-doc-masker/data/doc_intermediate
|
||||
|
||||
# Ollama API Configuration
|
||||
OLLAMA_API_URL=http://192.168.2.245:11434
|
||||
# OLLAMA_API_KEY=your_api_key_here
|
||||
OLLAMA_MODEL=qwen3:8b
|
||||
|
||||
# Application Settings
|
||||
MONITOR_INTERVAL=5
|
||||
|
||||
# Logging Configuration
|
||||
LOG_LEVEL=INFO
|
||||
LOG_FILE=app.log
|
||||
|
||||
# Optional: Additional security settings
|
||||
# MAX_FILE_SIZE=10485760 # 10MB in bytes
|
||||
# ALLOWED_FILE_TYPES=.txt,.doc,.docx,.pdf
|
||||
|
|
@ -0,0 +1,36 @@
|
|||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install system dependencies
|
||||
RUN apt-get update && apt-get install -y \
|
||||
build-essential \
|
||||
libreoffice \
|
||||
wget \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
|
||||
# Copy requirements first to leverage Docker cache
|
||||
COPY requirements.txt .
|
||||
# RUN pip install huggingface_hub
|
||||
# RUN wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
|
||||
# RUN wget https://raw.githubusercontent.com/opendatalab/MinerU/refs/heads/release-1.3.1/scripts/download_models_hf.py -O download_models_hf.py
|
||||
|
||||
# RUN python download_models_hf.py
|
||||
|
||||
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
# RUN pip install -U magic-pdf[full]
|
||||
|
||||
|
||||
# Copy the rest of the application
|
||||
COPY . .
|
||||
|
||||
# Create storage directories
|
||||
RUN mkdir -p storage/uploads storage/processed
|
||||
|
||||
# Expose the port the app runs on
|
||||
EXPOSE 8000
|
||||
|
||||
# Command to run the application
|
||||
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||
|
|
@ -0,0 +1,202 @@
|
|||
# PDF Processor with Mineru API
|
||||
|
||||
## Overview
|
||||
|
||||
The PDF processor has been rewritten to use Mineru's REST API instead of the magic_pdf library. This provides better separation of concerns and allows for more flexible deployment options.
|
||||
|
||||
## Changes Made
|
||||
|
||||
### 1. Removed Dependencies
|
||||
- Removed all `magic_pdf` imports and dependencies
|
||||
- Removed `PyPDF2` direct usage (though kept in requirements for potential other uses)
|
||||
|
||||
### 2. New Implementation
|
||||
- **REST API Integration**: Uses HTTP requests to call Mineru's API
|
||||
- **Configurable Settings**: Mineru API URL and timeout are configurable
|
||||
- **Error Handling**: Comprehensive error handling for network issues, timeouts, and API errors
|
||||
- **Flexible Response Parsing**: Handles multiple possible response formats from Mineru API
|
||||
|
||||
### 3. Configuration
|
||||
|
||||
Add the following settings to your environment or `.env` file:
|
||||
|
||||
```bash
|
||||
# Mineru API Configuration
|
||||
MINERU_API_URL=http://mineru-api:8000
|
||||
MINERU_TIMEOUT=300
|
||||
MINERU_LANG_LIST=["ch"]
|
||||
MINERU_BACKEND=pipeline
|
||||
MINERU_PARSE_METHOD=auto
|
||||
MINERU_FORMULA_ENABLE=true
|
||||
MINERU_TABLE_ENABLE=true
|
||||
```
|
||||
|
||||
### 4. API Endpoint
|
||||
|
||||
The processor expects Mineru to provide a REST API endpoint at `/file_parse` that accepts PDF files via multipart form data and returns JSON with markdown content.
|
||||
|
||||
#### Expected Request Format:
|
||||
```
|
||||
POST /file_parse
|
||||
Content-Type: multipart/form-data
|
||||
|
||||
files: [PDF file]
|
||||
output_dir: ./output
|
||||
lang_list: ["ch"]
|
||||
backend: pipeline
|
||||
parse_method: auto
|
||||
formula_enable: true
|
||||
table_enable: true
|
||||
return_md: true
|
||||
return_middle_json: false
|
||||
return_model_output: false
|
||||
return_content_list: false
|
||||
return_images: false
|
||||
start_page_id: 0
|
||||
end_page_id: 99999
|
||||
```
|
||||
|
||||
#### Expected Response Format:
|
||||
The processor can handle multiple response formats:
|
||||
|
||||
```json
|
||||
{
|
||||
"markdown": "# Document Title\n\nContent here..."
|
||||
}
|
||||
```
|
||||
|
||||
OR
|
||||
|
||||
```json
|
||||
{
|
||||
"md": "# Document Title\n\nContent here..."
|
||||
}
|
||||
```
|
||||
|
||||
OR
|
||||
|
||||
```json
|
||||
{
|
||||
"content": "# Document Title\n\nContent here..."
|
||||
}
|
||||
```
|
||||
|
||||
OR
|
||||
|
||||
```json
|
||||
{
|
||||
"result": {
|
||||
"markdown": "# Document Title\n\nContent here..."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from app.core.document_handlers.processors.pdf_processor import PdfDocumentProcessor
|
||||
|
||||
# Create processor instance
|
||||
processor = PdfDocumentProcessor("input.pdf", "output.md")
|
||||
|
||||
# Read and convert PDF to markdown
|
||||
content = processor.read_content()
|
||||
|
||||
# Process content (apply masking)
|
||||
processed_content = processor.process_content(content)
|
||||
|
||||
# Save processed content
|
||||
processor.save_content(processed_content)
|
||||
```
|
||||
|
||||
### Through Document Service
|
||||
|
||||
```python
|
||||
from app.core.services.document_service import DocumentService
|
||||
|
||||
service = DocumentService()
|
||||
success = service.process_document("input.pdf", "output.md")
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
Run the test script to verify the implementation:
|
||||
|
||||
```bash
|
||||
cd backend
|
||||
python test_pdf_processor.py
|
||||
```
|
||||
|
||||
Make sure you have:
|
||||
1. A sample PDF file in the `sample_doc/` directory
|
||||
2. Mineru API service running and accessible
|
||||
3. Proper network connectivity between services
|
||||
|
||||
## Error Handling
|
||||
|
||||
The processor handles various error scenarios:
|
||||
|
||||
- **Network Timeouts**: Configurable timeout (default: 5 minutes)
|
||||
- **API Errors**: HTTP status code errors are logged and handled
|
||||
- **Response Parsing**: Multiple fallback strategies for extracting markdown content
|
||||
- **File Operations**: Proper error handling for file reading/writing
|
||||
|
||||
## Logging
|
||||
|
||||
The processor provides detailed logging for debugging:
|
||||
|
||||
- API call attempts and responses
|
||||
- Content extraction results
|
||||
- Error conditions and stack traces
|
||||
- Processing statistics
|
||||
|
||||
## Deployment
|
||||
|
||||
### Docker Compose
|
||||
|
||||
Ensure your Mineru service is running and accessible. The default configuration expects it at `http://mineru-api:8000`.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
Set the following environment variables in your deployment:
|
||||
|
||||
```bash
|
||||
MINERU_API_URL=http://your-mineru-service:8000
|
||||
MINERU_TIMEOUT=300
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Connection Refused**: Check if Mineru service is running and accessible
|
||||
2. **Timeout Errors**: Increase `MINERU_TIMEOUT` for large PDF files
|
||||
3. **Empty Content**: Check Mineru API response format and logs
|
||||
4. **Network Issues**: Verify network connectivity between services
|
||||
|
||||
### Debug Mode
|
||||
|
||||
Enable debug logging to see detailed API interactions:
|
||||
|
||||
```python
|
||||
import logging
|
||||
logging.getLogger('app.core.document_handlers.processors.pdf_processor').setLevel(logging.DEBUG)
|
||||
```
|
||||
|
||||
## Migration from magic_pdf
|
||||
|
||||
If you were previously using magic_pdf:
|
||||
|
||||
1. **No Code Changes Required**: The interface remains the same
|
||||
2. **Configuration Update**: Add Mineru API settings
|
||||
3. **Service Dependencies**: Ensure Mineru service is running
|
||||
4. **Testing**: Run the test script to verify functionality
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Timeout**: Large PDFs may require longer timeouts
|
||||
- **Memory**: The processor loads the entire PDF into memory for API calls
|
||||
- **Network**: API calls add network latency to processing time
|
||||
- **Caching**: Consider implementing caching for frequently processed documents
|
||||
|
|
@ -0,0 +1,103 @@
|
|||
# Legal Document Masker API
|
||||
|
||||
This is the backend API for the Legal Document Masking system. It provides endpoints for file upload, processing status tracking, and file download.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.8+
|
||||
- Redis (for Celery)
|
||||
|
||||
## File Storage
|
||||
|
||||
Files are stored in the following structure:
|
||||
```
|
||||
backend/
|
||||
├── storage/
|
||||
│ ├── uploads/ # Original uploaded files
|
||||
│ └── processed/ # Masked/processed files
|
||||
```
|
||||
|
||||
## Setup
|
||||
|
||||
### Option 1: Local Development
|
||||
|
||||
1. Create a virtual environment:
|
||||
```bash
|
||||
python -m venv venv
|
||||
source venv/bin/activate # On Windows: venv\Scripts\activate
|
||||
```
|
||||
|
||||
2. Install dependencies:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
3. Set up environment variables:
|
||||
Create a `.env` file in the backend directory with the following variables:
|
||||
```env
|
||||
SECRET_KEY=your-secret-key-here
|
||||
```
|
||||
|
||||
The database (SQLite) will be automatically created when you first run the application.
|
||||
|
||||
4. Start Redis (required for Celery):
|
||||
```bash
|
||||
redis-server
|
||||
```
|
||||
|
||||
5. Start Celery worker:
|
||||
```bash
|
||||
celery -A app.services.file_service worker --loglevel=info
|
||||
```
|
||||
|
||||
6. Start the FastAPI server:
|
||||
```bash
|
||||
uvicorn app.main:app --reload
|
||||
```
|
||||
|
||||
### Option 2: Docker Deployment
|
||||
|
||||
1. Build and start the services:
|
||||
```bash
|
||||
docker-compose up --build
|
||||
```
|
||||
|
||||
This will start:
|
||||
- FastAPI server on port 8000
|
||||
- Celery worker for background processing
|
||||
- Redis for task queue
|
||||
|
||||
## API Documentation
|
||||
|
||||
Once the server is running, you can access:
|
||||
- Swagger UI: `http://localhost:8000/docs`
|
||||
- ReDoc: `http://localhost:8000/redoc`
|
||||
|
||||
## API Endpoints
|
||||
|
||||
- `POST /api/v1/files/upload` - Upload a new file
|
||||
- `GET /api/v1/files` - List all files
|
||||
- `GET /api/v1/files/{file_id}` - Get file details
|
||||
- `GET /api/v1/files/{file_id}/download` - Download processed file
|
||||
- `WS /api/v1/files/ws/status/{file_id}` - WebSocket for real-time status updates
|
||||
|
||||
## Development
|
||||
|
||||
### Running Tests
|
||||
```bash
|
||||
pytest
|
||||
```
|
||||
|
||||
### Code Style
|
||||
The project uses Black for code formatting:
|
||||
```bash
|
||||
black .
|
||||
```
|
||||
|
||||
### Docker Commands
|
||||
|
||||
- Start services: `docker-compose up`
|
||||
- Start in background: `docker-compose up -d`
|
||||
- Stop services: `docker-compose down`
|
||||
- View logs: `docker-compose logs -f`
|
||||
- Rebuild: `docker-compose up --build`
|
||||
|
|
@ -0,0 +1,166 @@
|
|||
from fastapi import APIRouter, Depends, HTTPException, UploadFile, File, WebSocket, Response
|
||||
from fastapi.responses import FileResponse
|
||||
from sqlalchemy.orm import Session
|
||||
from typing import List
|
||||
import os
|
||||
from ...core.config import settings
|
||||
from ...core.database import get_db
|
||||
from ...models.file import File as FileModel, FileStatus
|
||||
from ...services.file_service import process_file, delete_file
|
||||
from ...schemas.file import FileResponse as FileResponseSchema, FileList
|
||||
import asyncio
|
||||
from fastapi import WebSocketDisconnect
|
||||
import uuid
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
@router.post("/upload", response_model=FileResponseSchema)
|
||||
async def upload_file(
|
||||
file: UploadFile = File(...),
|
||||
db: Session = Depends(get_db)
|
||||
):
|
||||
if not file.filename:
|
||||
raise HTTPException(status_code=400, detail="No file provided")
|
||||
|
||||
if not any(file.filename.lower().endswith(ext) for ext in settings.ALLOWED_EXTENSIONS):
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"File type not allowed. Allowed types: {', '.join(settings.ALLOWED_EXTENSIONS)}"
|
||||
)
|
||||
|
||||
# Generate unique file ID
|
||||
file_id = str(uuid.uuid4())
|
||||
file_extension = os.path.splitext(file.filename)[1]
|
||||
unique_filename = f"{file_id}{file_extension}"
|
||||
|
||||
# Save file with unique name
|
||||
file_path = settings.UPLOAD_FOLDER / unique_filename
|
||||
with open(file_path, "wb") as buffer:
|
||||
content = await file.read()
|
||||
buffer.write(content)
|
||||
|
||||
# Create database entry
|
||||
db_file = FileModel(
|
||||
id=file_id,
|
||||
filename=file.filename,
|
||||
original_path=str(file_path),
|
||||
status=FileStatus.NOT_STARTED
|
||||
)
|
||||
db.add(db_file)
|
||||
db.commit()
|
||||
db.refresh(db_file)
|
||||
|
||||
# Start processing
|
||||
process_file.delay(str(db_file.id))
|
||||
|
||||
return db_file
|
||||
|
||||
@router.get("/files", response_model=List[FileResponseSchema])
|
||||
def list_files(
|
||||
skip: int = 0,
|
||||
limit: int = 100,
|
||||
db: Session = Depends(get_db)
|
||||
):
|
||||
files = db.query(FileModel).offset(skip).limit(limit).all()
|
||||
return files
|
||||
|
||||
@router.get("/files/{file_id}", response_model=FileResponseSchema)
|
||||
def get_file(
|
||||
file_id: str,
|
||||
db: Session = Depends(get_db)
|
||||
):
|
||||
file = db.query(FileModel).filter(FileModel.id == file_id).first()
|
||||
if not file:
|
||||
raise HTTPException(status_code=404, detail="File not found")
|
||||
return file
|
||||
|
||||
@router.get("/files/{file_id}/download")
|
||||
async def download_file(
|
||||
file_id: str,
|
||||
db: Session = Depends(get_db)
|
||||
):
|
||||
print(f"=== DOWNLOAD REQUEST ===")
|
||||
print(f"File ID: {file_id}")
|
||||
|
||||
file = db.query(FileModel).filter(FileModel.id == file_id).first()
|
||||
if not file:
|
||||
print(f"❌ File not found for ID: {file_id}")
|
||||
raise HTTPException(status_code=404, detail="File not found")
|
||||
|
||||
print(f"✅ File found: {file.filename}")
|
||||
print(f"File status: {file.status}")
|
||||
print(f"Original path: {file.original_path}")
|
||||
print(f"Processed path: {file.processed_path}")
|
||||
|
||||
if file.status != FileStatus.SUCCESS:
|
||||
print(f"❌ File not ready for download. Status: {file.status}")
|
||||
raise HTTPException(status_code=400, detail="File is not ready for download")
|
||||
|
||||
if not os.path.exists(file.processed_path):
|
||||
print(f"❌ Processed file not found at: {file.processed_path}")
|
||||
raise HTTPException(status_code=404, detail="Processed file not found")
|
||||
|
||||
print(f"✅ Processed file exists at: {file.processed_path}")
|
||||
|
||||
# Get the original filename without extension and add .md extension
|
||||
original_filename = file.filename
|
||||
filename_without_ext = os.path.splitext(original_filename)[0]
|
||||
download_filename = f"{filename_without_ext}.md"
|
||||
|
||||
print(f"Original filename: {original_filename}")
|
||||
print(f"Filename without extension: {filename_without_ext}")
|
||||
print(f"Download filename: {download_filename}")
|
||||
|
||||
|
||||
response = FileResponse(
|
||||
path=file.processed_path,
|
||||
filename=download_filename,
|
||||
media_type="text/markdown"
|
||||
)
|
||||
|
||||
print(f"Response headers: {dict(response.headers)}")
|
||||
print(f"=== END DOWNLOAD REQUEST ===")
|
||||
|
||||
return response
|
||||
|
||||
@router.websocket("/ws/status/{file_id}")
|
||||
async def websocket_endpoint(websocket: WebSocket, file_id: str, db: Session = Depends(get_db)):
|
||||
await websocket.accept()
|
||||
try:
|
||||
while True:
|
||||
file = db.query(FileModel).filter(FileModel.id == file_id).first()
|
||||
if not file:
|
||||
await websocket.send_json({"error": "File not found"})
|
||||
break
|
||||
|
||||
await websocket.send_json({
|
||||
"status": file.status,
|
||||
"error": file.error_message
|
||||
})
|
||||
|
||||
if file.status in [FileStatus.SUCCESS, FileStatus.FAILED]:
|
||||
break
|
||||
|
||||
await asyncio.sleep(1)
|
||||
except WebSocketDisconnect:
|
||||
pass
|
||||
|
||||
@router.delete("/files/{file_id}")
|
||||
async def delete_file_endpoint(
|
||||
file_id: str,
|
||||
db: Session = Depends(get_db)
|
||||
):
|
||||
"""
|
||||
Delete a file and its associated records.
|
||||
This will remove:
|
||||
1. The database record
|
||||
2. The original uploaded file
|
||||
3. The processed markdown file (if it exists)
|
||||
"""
|
||||
try:
|
||||
delete_file(file_id)
|
||||
return {"message": "File deleted successfully"}
|
||||
except HTTPException as e:
|
||||
raise e
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
|
@ -0,0 +1,65 @@
|
|||
from pydantic_settings import BaseSettings
|
||||
from typing import Optional
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
class Settings(BaseSettings):
|
||||
# API Settings
|
||||
API_V1_STR: str = "/api/v1"
|
||||
PROJECT_NAME: str = "Legal Document Masker API"
|
||||
|
||||
# Security
|
||||
SECRET_KEY: str = "your-secret-key-here" # Change in production
|
||||
ACCESS_TOKEN_EXPIRE_MINUTES: int = 60 * 24 * 8 # 8 days
|
||||
|
||||
# Database
|
||||
BASE_DIR: Path = Path(__file__).parent.parent.parent
|
||||
DATABASE_URL: str = f"sqlite:///{BASE_DIR}/storage/legal_doc_masker.db"
|
||||
|
||||
# File Storage
|
||||
UPLOAD_FOLDER: Path = BASE_DIR / "storage" / "uploads"
|
||||
PROCESSED_FOLDER: Path = BASE_DIR / "storage" / "processed"
|
||||
MAX_FILE_SIZE: int = 50 * 1024 * 1024 # 50MB
|
||||
ALLOWED_EXTENSIONS: set = {"pdf", "docx", "doc", "md"}
|
||||
|
||||
# Celery
|
||||
CELERY_BROKER_URL: str = "redis://redis:6379/0"
|
||||
CELERY_RESULT_BACKEND: str = "redis://redis:6379/0"
|
||||
|
||||
# Ollama API settings
|
||||
OLLAMA_API_URL: str = "https://api.ollama.com"
|
||||
OLLAMA_API_KEY: str = ""
|
||||
OLLAMA_MODEL: str = "llama2"
|
||||
|
||||
# Mineru API settings
|
||||
MINERU_API_URL: str = "http://mineru-api:8000"
|
||||
# MINERU_API_URL: str = "http://host.docker.internal:8001"
|
||||
|
||||
MINERU_TIMEOUT: int = 300 # 5 minutes timeout
|
||||
MINERU_LANG_LIST: list = ["ch"] # Language list for parsing
|
||||
MINERU_BACKEND: str = "pipeline" # Backend to use
|
||||
MINERU_PARSE_METHOD: str = "auto" # Parse method
|
||||
MINERU_FORMULA_ENABLE: bool = True # Enable formula parsing
|
||||
MINERU_TABLE_ENABLE: bool = True # Enable table parsing
|
||||
|
||||
# Logging settings
|
||||
LOG_LEVEL: str = "INFO"
|
||||
LOG_FORMAT: str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
||||
LOG_DATE_FORMAT: str = "%Y-%m-%d %H:%M:%S"
|
||||
LOG_FILE: str = "app.log"
|
||||
|
||||
class Config:
|
||||
case_sensitive = True
|
||||
env_file = ".env"
|
||||
env_file_encoding = "utf-8"
|
||||
extra = "allow"
|
||||
|
||||
def __init__(self, **kwargs):
|
||||
super().__init__(**kwargs)
|
||||
# Create storage directories if they don't exist
|
||||
self.UPLOAD_FOLDER.mkdir(parents=True, exist_ok=True)
|
||||
self.PROCESSED_FOLDER.mkdir(parents=True, exist_ok=True)
|
||||
# Create storage directory for database
|
||||
(self.BASE_DIR / "storage").mkdir(parents=True, exist_ok=True)
|
||||
|
||||
settings = Settings()
|
||||
|
|
@ -1,5 +1,6 @@
|
|||
import logging.config
|
||||
from config.settings import settings
|
||||
# from config.settings import settings
|
||||
from .settings import settings
|
||||
|
||||
LOGGING_CONFIG = {
|
||||
"version": 1,
|
||||
|
|
@ -0,0 +1,21 @@
|
|||
from sqlalchemy import create_engine
|
||||
from sqlalchemy.ext.declarative import declarative_base
|
||||
from sqlalchemy.orm import sessionmaker
|
||||
from .config import settings
|
||||
|
||||
# Create SQLite engine with check_same_thread=False for FastAPI
|
||||
engine = create_engine(
|
||||
settings.DATABASE_URL,
|
||||
connect_args={"check_same_thread": False}
|
||||
)
|
||||
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
|
||||
|
||||
Base = declarative_base()
|
||||
|
||||
# Dependency
|
||||
def get_db():
|
||||
db = SessionLocal()
|
||||
try:
|
||||
yield db
|
||||
finally:
|
||||
db.close()
|
||||
|
|
@ -1,10 +1,11 @@
|
|||
import os
|
||||
from typing import Optional
|
||||
from models.document_processor import DocumentProcessor
|
||||
from models.processors import (
|
||||
from .document_processor import DocumentProcessor
|
||||
from .processors import (
|
||||
TxtDocumentProcessor,
|
||||
DocxDocumentProcessor,
|
||||
PdfDocumentProcessor
|
||||
# DocxDocumentProcessor,
|
||||
PdfDocumentProcessor,
|
||||
MarkdownDocumentProcessor
|
||||
)
|
||||
|
||||
class DocumentProcessorFactory:
|
||||
|
|
@ -14,9 +15,11 @@ class DocumentProcessorFactory:
|
|||
|
||||
processors = {
|
||||
'.txt': TxtDocumentProcessor,
|
||||
'.docx': DocxDocumentProcessor,
|
||||
'.doc': DocxDocumentProcessor,
|
||||
'.pdf': PdfDocumentProcessor
|
||||
# '.docx': DocxDocumentProcessor,
|
||||
# '.doc': DocxDocumentProcessor,
|
||||
'.pdf': PdfDocumentProcessor,
|
||||
'.md': MarkdownDocumentProcessor,
|
||||
'.markdown': MarkdownDocumentProcessor
|
||||
}
|
||||
|
||||
processor_class = processors.get(file_extension)
|
||||
|
|
@ -0,0 +1,71 @@
|
|||
from abc import ABC, abstractmethod
|
||||
from typing import Any, Dict
|
||||
import logging
|
||||
from .ner_processor import NerProcessor
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class DocumentProcessor(ABC):
|
||||
|
||||
def __init__(self):
|
||||
self.max_chunk_size = 1000 # Maximum number of characters per chunk
|
||||
self.ner_processor = NerProcessor()
|
||||
|
||||
@abstractmethod
|
||||
def read_content(self) -> str:
|
||||
"""Read document content"""
|
||||
pass
|
||||
|
||||
def _split_into_chunks(self, sentences: list[str]) -> list[str]:
|
||||
"""Split sentences into chunks that don't exceed max_chunk_size"""
|
||||
chunks = []
|
||||
current_chunk = ""
|
||||
|
||||
for sentence in sentences:
|
||||
if not sentence.strip():
|
||||
continue
|
||||
|
||||
if len(current_chunk) + len(sentence) > self.max_chunk_size and current_chunk:
|
||||
chunks.append(current_chunk)
|
||||
current_chunk = sentence
|
||||
else:
|
||||
if current_chunk:
|
||||
current_chunk += "。" + sentence
|
||||
else:
|
||||
current_chunk = sentence
|
||||
|
||||
if current_chunk:
|
||||
chunks.append(current_chunk)
|
||||
logger.info(f"Split content into {len(chunks)} chunks")
|
||||
|
||||
return chunks
|
||||
|
||||
def _apply_mapping(self, text: str, mapping: Dict[str, str]) -> str:
|
||||
"""Apply the mapping to replace sensitive information"""
|
||||
masked_text = text
|
||||
for original, masked in mapping.items():
|
||||
if isinstance(masked, dict):
|
||||
masked = next(iter(masked.values()), "某")
|
||||
elif not isinstance(masked, str):
|
||||
masked = str(masked) if masked is not None else "某"
|
||||
masked_text = masked_text.replace(original, masked)
|
||||
return masked_text
|
||||
|
||||
def process_content(self, content: str) -> str:
|
||||
"""Process document content by masking sensitive information"""
|
||||
sentences = content.split("。")
|
||||
|
||||
chunks = self._split_into_chunks(sentences)
|
||||
logger.info(f"Split content into {len(chunks)} chunks")
|
||||
|
||||
final_mapping = self.ner_processor.process(chunks)
|
||||
|
||||
masked_content = self._apply_mapping(content, final_mapping)
|
||||
logger.info("Successfully masked content")
|
||||
|
||||
return masked_content
|
||||
|
||||
@abstractmethod
|
||||
def save_content(self, content: str) -> None:
|
||||
"""Save processed content"""
|
||||
pass
|
||||
|
|
@ -0,0 +1,305 @@
|
|||
from typing import Any, Dict
|
||||
from ..prompts.masking_prompts import get_ner_name_prompt, get_ner_company_prompt, get_ner_address_prompt, get_ner_project_prompt, get_ner_case_number_prompt, get_entity_linkage_prompt
|
||||
import logging
|
||||
import json
|
||||
from ..services.ollama_client import OllamaClient
|
||||
from ...core.config import settings
|
||||
from ..utils.json_extractor import LLMJsonExtractor
|
||||
from ..utils.llm_validator import LLMResponseValidator
|
||||
import re
|
||||
from .regs.entity_regex import extract_id_number_entities, extract_social_credit_code_entities
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class NerProcessor:
|
||||
def __init__(self):
|
||||
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
|
||||
self.max_retries = 3
|
||||
|
||||
def _validate_mapping_format(self, mapping: Dict[str, Any]) -> bool:
|
||||
return LLMResponseValidator.validate_entity_extraction(mapping)
|
||||
|
||||
def _process_entity_type(self, chunk: str, prompt_func, entity_type: str) -> Dict[str, str]:
|
||||
for attempt in range(self.max_retries):
|
||||
try:
|
||||
formatted_prompt = prompt_func(chunk)
|
||||
logger.info(f"Calling ollama to generate {entity_type} mapping for chunk (attempt {attempt + 1}/{self.max_retries}): {formatted_prompt}")
|
||||
response = self.ollama_client.generate(formatted_prompt)
|
||||
logger.info(f"Raw response from LLM: {response}")
|
||||
|
||||
mapping = LLMJsonExtractor.parse_raw_json_str(response)
|
||||
logger.info(f"Parsed mapping: {mapping}")
|
||||
|
||||
if mapping and self._validate_mapping_format(mapping):
|
||||
return mapping
|
||||
else:
|
||||
logger.warning(f"Invalid mapping format received on attempt {attempt + 1}, retrying...")
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating {entity_type} mapping on attempt {attempt + 1}: {e}")
|
||||
if attempt < self.max_retries - 1:
|
||||
logger.info("Retrying...")
|
||||
else:
|
||||
logger.error(f"Max retries reached for {entity_type}, returning empty mapping")
|
||||
|
||||
return {}
|
||||
|
||||
def build_mapping(self, chunk: str) -> list[Dict[str, str]]:
|
||||
mapping_pipeline = []
|
||||
|
||||
entity_configs = [
|
||||
(get_ner_name_prompt, "people names"),
|
||||
(get_ner_company_prompt, "company names"),
|
||||
(get_ner_address_prompt, "addresses"),
|
||||
(get_ner_project_prompt, "project names"),
|
||||
(get_ner_case_number_prompt, "case numbers")
|
||||
]
|
||||
for prompt_func, entity_type in entity_configs:
|
||||
mapping = self._process_entity_type(chunk, prompt_func, entity_type)
|
||||
if mapping:
|
||||
mapping_pipeline.append(mapping)
|
||||
|
||||
regex_entity_extractors = [
|
||||
extract_id_number_entities,
|
||||
extract_social_credit_code_entities
|
||||
]
|
||||
for extractor in regex_entity_extractors:
|
||||
mapping = extractor(chunk)
|
||||
if mapping and LLMResponseValidator.validate_regex_entity(mapping):
|
||||
mapping_pipeline.append(mapping)
|
||||
elif mapping:
|
||||
logger.warning(f"Invalid regex entity mapping format: {mapping}")
|
||||
|
||||
return mapping_pipeline
|
||||
|
||||
def _merge_entity_mappings(self, chunk_mappings: list[Dict[str, Any]]) -> list[Dict[str, str]]:
|
||||
all_entities = []
|
||||
for mapping in chunk_mappings:
|
||||
if isinstance(mapping, dict) and 'entities' in mapping:
|
||||
entities = mapping['entities']
|
||||
if isinstance(entities, list):
|
||||
all_entities.extend(entities)
|
||||
|
||||
unique_entities = []
|
||||
seen_texts = set()
|
||||
|
||||
for entity in all_entities:
|
||||
if isinstance(entity, dict) and 'text' in entity:
|
||||
text = entity['text'].strip()
|
||||
if text and text not in seen_texts:
|
||||
seen_texts.add(text)
|
||||
unique_entities.append(entity)
|
||||
elif text and text in seen_texts:
|
||||
# 暂时记录下可能存在冲突的entity
|
||||
logger.info(f"Duplicate entity found: {entity}")
|
||||
continue
|
||||
|
||||
logger.info(f"Merged {len(unique_entities)} unique entities")
|
||||
return unique_entities
|
||||
|
||||
def _generate_masked_mapping(self, unique_entities: list[Dict[str, str]], linkage: Dict[str, Any]) -> Dict[str, str]:
|
||||
"""
|
||||
结合 linkage 信息,按实体分组映射同一脱敏名,并实现如下规则:
|
||||
1. 人名/简称:保留姓,名变为某,同姓编号;
|
||||
2. 公司名:同组公司名映射为大写字母公司(A公司、B公司...);
|
||||
3. 英文人名:每个单词首字母+***;
|
||||
4. 英文公司名:替换为所属行业名称,英文大写(如无行业信息,默认 COMPANY);
|
||||
5. 项目名:项目名称变为小写英文字母(如 a项目、b项目...);
|
||||
6. 案号:只替换案号中的数字部分为***,保留前后结构和“号”字,支持中间有空格;
|
||||
7. 身份证号:6位X;
|
||||
8. 社会信用代码:8位X;
|
||||
9. 地址:保留区级及以上行政区划,去除详细位置;
|
||||
10. 其他类型按原有逻辑。
|
||||
"""
|
||||
import re
|
||||
entity_mapping = {}
|
||||
used_masked_names = set()
|
||||
group_mask_map = {}
|
||||
surname_counter = {}
|
||||
company_letter = ord('A')
|
||||
project_letter = ord('a')
|
||||
# 优先区县级单位,后市、省等
|
||||
admin_keywords = [
|
||||
'市辖区', '自治县', '自治旗', '林区', '区', '县', '旗', '州', '盟', '地区', '自治州',
|
||||
'市', '省', '自治区', '特别行政区'
|
||||
]
|
||||
admin_pattern = r"^(.*?(?:" + '|'.join(admin_keywords) + r"))"
|
||||
for group in linkage.get('entity_groups', []):
|
||||
group_type = group.get('group_type', '')
|
||||
entities = group.get('entities', [])
|
||||
if '公司' in group_type or 'Company' in group_type:
|
||||
masked = chr(company_letter) + '公司'
|
||||
company_letter += 1
|
||||
for entity in entities:
|
||||
group_mask_map[entity['text']] = masked
|
||||
elif '人名' in group_type:
|
||||
surname_local_counter = {}
|
||||
for entity in entities:
|
||||
name = entity['text']
|
||||
if not name:
|
||||
continue
|
||||
surname = name[0]
|
||||
surname_local_counter.setdefault(surname, 0)
|
||||
surname_local_counter[surname] += 1
|
||||
if surname_local_counter[surname] == 1:
|
||||
masked = f"{surname}某"
|
||||
else:
|
||||
masked = f"{surname}某{surname_local_counter[surname]}"
|
||||
group_mask_map[name] = masked
|
||||
elif '英文人名' in group_type:
|
||||
for entity in entities:
|
||||
name = entity['text']
|
||||
if not name:
|
||||
continue
|
||||
masked = ' '.join([n[0] + '***' if n else '' for n in name.split()])
|
||||
group_mask_map[name] = masked
|
||||
for entity in unique_entities:
|
||||
text = entity['text']
|
||||
entity_type = entity.get('type', '')
|
||||
if text in group_mask_map:
|
||||
entity_mapping[text] = group_mask_map[text]
|
||||
used_masked_names.add(group_mask_map[text])
|
||||
elif '英文公司名' in entity_type or 'English Company' in entity_type:
|
||||
industry = entity.get('industry', 'COMPANY')
|
||||
masked = industry.upper()
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '项目名' in entity_type:
|
||||
masked = chr(project_letter) + '项目'
|
||||
project_letter += 1
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '案号' in entity_type:
|
||||
masked = re.sub(r'(\d[\d\s]*)(号)', r'***\2', text)
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '身份证号' in entity_type:
|
||||
masked = 'X' * 6
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '社会信用代码' in entity_type:
|
||||
masked = 'X' * 8
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '地址' in entity_type:
|
||||
# 保留区级及以上行政区划,去除详细位置
|
||||
match = re.match(admin_pattern, text)
|
||||
if match:
|
||||
masked = match.group(1)
|
||||
else:
|
||||
masked = text # fallback
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '人名' in entity_type:
|
||||
name = text
|
||||
if not name:
|
||||
masked = '某'
|
||||
else:
|
||||
surname = name[0]
|
||||
surname_counter.setdefault(surname, 0)
|
||||
surname_counter[surname] += 1
|
||||
if surname_counter[surname] == 1:
|
||||
masked = f"{surname}某"
|
||||
else:
|
||||
masked = f"{surname}某{surname_counter[surname]}"
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '公司' in entity_type or 'Company' in entity_type:
|
||||
masked = chr(company_letter) + '公司'
|
||||
company_letter += 1
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
elif '英文人名' in entity_type:
|
||||
name = text
|
||||
masked = ' '.join([n[0] + '***' if n else '' for n in name.split()])
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
else:
|
||||
base_name = '某'
|
||||
masked = base_name
|
||||
counter = 1
|
||||
while masked in used_masked_names:
|
||||
if counter <= 10:
|
||||
suffixes = ['甲', '乙', '丙', '丁', '戊', '己', '庚', '辛', '壬', '癸']
|
||||
masked = base_name + suffixes[counter - 1]
|
||||
else:
|
||||
masked = f"{base_name}{counter}"
|
||||
counter += 1
|
||||
entity_mapping[text] = masked
|
||||
used_masked_names.add(masked)
|
||||
return entity_mapping
|
||||
|
||||
def _validate_linkage_format(self, linkage: Dict[str, Any]) -> bool:
|
||||
return LLMResponseValidator.validate_entity_linkage(linkage)
|
||||
|
||||
def _create_entity_linkage(self, unique_entities: list[Dict[str, str]]) -> Dict[str, Any]:
|
||||
linkable_entities = []
|
||||
for entity in unique_entities:
|
||||
entity_type = entity.get('type', '')
|
||||
if any(keyword in entity_type for keyword in ['公司', 'Company', '人名', '英文人名']):
|
||||
linkable_entities.append(entity)
|
||||
|
||||
if not linkable_entities:
|
||||
logger.info("No linkable entities found")
|
||||
return {"entity_groups": []}
|
||||
|
||||
entities_text = "\n".join([
|
||||
f"- {entity['text']} (类型: {entity['type']})"
|
||||
for entity in linkable_entities
|
||||
])
|
||||
|
||||
for attempt in range(self.max_retries):
|
||||
try:
|
||||
formatted_prompt = get_entity_linkage_prompt(entities_text)
|
||||
logger.info(f"Calling ollama to generate entity linkage (attempt {attempt + 1}/{self.max_retries})")
|
||||
response = self.ollama_client.generate(formatted_prompt)
|
||||
logger.info(f"Raw entity linkage response from LLM: {response}")
|
||||
|
||||
linkage = LLMJsonExtractor.parse_raw_json_str(response)
|
||||
logger.info(f"Parsed entity linkage: {linkage}")
|
||||
|
||||
if linkage and self._validate_linkage_format(linkage):
|
||||
logger.info(f"Successfully created entity linkage with {len(linkage.get('entity_groups', []))} groups")
|
||||
return linkage
|
||||
else:
|
||||
logger.warning(f"Invalid entity linkage format received on attempt {attempt + 1}, retrying...")
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating entity linkage on attempt {attempt + 1}: {e}")
|
||||
if attempt < self.max_retries - 1:
|
||||
logger.info("Retrying...")
|
||||
else:
|
||||
logger.error("Max retries reached for entity linkage, returning empty linkage")
|
||||
|
||||
return {"entity_groups": []}
|
||||
|
||||
def _apply_entity_linkage_to_mapping(self, entity_mapping: Dict[str, str], entity_linkage: Dict[str, Any]) -> Dict[str, str]:
|
||||
"""
|
||||
linkage 已在 _generate_masked_mapping 中处理,此处直接返回 entity_mapping。
|
||||
"""
|
||||
return entity_mapping
|
||||
|
||||
def process(self, chunks: list[str]) -> Dict[str, str]:
|
||||
chunk_mappings = []
|
||||
for i, chunk in enumerate(chunks):
|
||||
logger.info(f"Processing chunk {i+1}/{len(chunks)}")
|
||||
chunk_mapping = self.build_mapping(chunk)
|
||||
logger.info(f"Chunk mapping: {chunk_mapping}")
|
||||
chunk_mappings.extend(chunk_mapping)
|
||||
|
||||
logger.info(f"Final chunk mappings: {chunk_mappings}")
|
||||
|
||||
unique_entities = self._merge_entity_mappings(chunk_mappings)
|
||||
logger.info(f"Unique entities: {unique_entities}")
|
||||
|
||||
entity_linkage = self._create_entity_linkage(unique_entities)
|
||||
logger.info(f"Entity linkage: {entity_linkage}")
|
||||
|
||||
# for quick test
|
||||
# unique_entities = [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}, {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}, {'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}, {'text': '服务合同', 'type': '项目名'}, {'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '(2020)京0105 民初69754 号', 'type': '案号'}, {'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}, {'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}, {'text': '《计算机设备采购合同》', 'type': '项目名'}, {'text': '《服务合同书》', 'type': '项目名'}]
|
||||
# entity_linkage = {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
|
||||
combined_mapping = self._generate_masked_mapping(unique_entities, entity_linkage)
|
||||
logger.info(f"Combined mapping: {combined_mapping}")
|
||||
|
||||
final_mapping = self._apply_entity_linkage_to_mapping(combined_mapping, entity_linkage)
|
||||
logger.info(f"Final mapping: {final_mapping}")
|
||||
|
||||
return final_mapping
|
||||
|
|
@ -0,0 +1,7 @@
|
|||
from .txt_processor import TxtDocumentProcessor
|
||||
# from .docx_processor import DocxDocumentProcessor
|
||||
from .pdf_processor import PdfDocumentProcessor
|
||||
from .md_processor import MarkdownDocumentProcessor
|
||||
|
||||
# __all__ = ['TxtDocumentProcessor', 'DocxDocumentProcessor', 'PdfDocumentProcessor', 'MarkdownDocumentProcessor']
|
||||
__all__ = ['TxtDocumentProcessor', 'PdfDocumentProcessor', 'MarkdownDocumentProcessor']
|
||||
|
|
@ -0,0 +1,77 @@
|
|||
import os
|
||||
import docx
|
||||
from ...document_handlers.document_processor import DocumentProcessor
|
||||
from magic_pdf.data.data_reader_writer import FileBasedDataWriter
|
||||
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
|
||||
from magic_pdf.data.read_api import read_local_office
|
||||
import logging
|
||||
from ...services.ollama_client import OllamaClient
|
||||
from ...config import settings
|
||||
from ...prompts.masking_prompts import get_masking_mapping_prompt
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class DocxDocumentProcessor(DocumentProcessor):
|
||||
def __init__(self, input_path: str, output_path: str):
|
||||
super().__init__() # Call parent class's __init__
|
||||
self.input_path = input_path
|
||||
self.output_path = output_path
|
||||
self.output_dir = os.path.dirname(output_path)
|
||||
self.name_without_suff = os.path.splitext(os.path.basename(input_path))[0]
|
||||
|
||||
# Setup output directories
|
||||
self.local_image_dir = os.path.join(self.output_dir, "images")
|
||||
self.image_dir = os.path.basename(self.local_image_dir)
|
||||
os.makedirs(self.local_image_dir, exist_ok=True)
|
||||
|
||||
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
|
||||
|
||||
def read_content(self) -> str:
|
||||
try:
|
||||
# Initialize writers
|
||||
image_writer = FileBasedDataWriter(self.local_image_dir)
|
||||
md_writer = FileBasedDataWriter(self.output_dir)
|
||||
|
||||
# Create Dataset Instance and process
|
||||
ds = read_local_office(self.input_path)[0]
|
||||
pipe_result = ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer)
|
||||
|
||||
# Generate markdown
|
||||
md_content = pipe_result.get_markdown(self.image_dir)
|
||||
pipe_result.dump_md(md_writer, f"{self.name_without_suff}.md", self.image_dir)
|
||||
|
||||
return md_content
|
||||
except Exception as e:
|
||||
logger.error(f"Error converting DOCX to MD: {e}")
|
||||
raise
|
||||
|
||||
# def process_content(self, content: str) -> str:
|
||||
# logger.info("Processing DOCX content")
|
||||
|
||||
# # Split content into sentences and apply masking
|
||||
# sentences = content.split("。")
|
||||
# final_md = ""
|
||||
# for sentence in sentences:
|
||||
# if sentence.strip(): # Only process non-empty sentences
|
||||
# formatted_prompt = get_masking_mapping_prompt(sentence)
|
||||
# logger.info("Calling ollama to generate response, prompt: %s", formatted_prompt)
|
||||
# response = self.ollama_client.generate(formatted_prompt)
|
||||
# logger.info(f"Response generated: {response}")
|
||||
# final_md += response + "。"
|
||||
|
||||
# return final_md
|
||||
|
||||
def save_content(self, content: str) -> None:
|
||||
# Ensure output path has .md extension
|
||||
output_dir = os.path.dirname(self.output_path)
|
||||
base_name = os.path.splitext(os.path.basename(self.output_path))[0]
|
||||
md_output_path = os.path.join(output_dir, f"{base_name}.md")
|
||||
|
||||
logger.info(f"Saving masked content to: {md_output_path}")
|
||||
try:
|
||||
with open(md_output_path, 'w', encoding='utf-8') as file:
|
||||
file.write(content)
|
||||
logger.info(f"Successfully saved content to {md_output_path}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error saving content: {e}")
|
||||
raise
|
||||
|
|
@ -0,0 +1,39 @@
|
|||
import os
|
||||
from ...document_handlers.document_processor import DocumentProcessor
|
||||
from ...services.ollama_client import OllamaClient
|
||||
import logging
|
||||
from ...config import settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class MarkdownDocumentProcessor(DocumentProcessor):
|
||||
def __init__(self, input_path: str, output_path: str):
|
||||
super().__init__() # Call parent class's __init__
|
||||
self.input_path = input_path
|
||||
self.output_path = output_path
|
||||
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
|
||||
|
||||
def read_content(self) -> str:
|
||||
"""Read markdown content from file"""
|
||||
try:
|
||||
with open(self.input_path, 'r', encoding='utf-8') as file:
|
||||
content = file.read()
|
||||
logger.info(f"Successfully read markdown content from {self.input_path}")
|
||||
return content
|
||||
except Exception as e:
|
||||
logger.error(f"Error reading markdown file {self.input_path}: {e}")
|
||||
raise
|
||||
|
||||
def save_content(self, content: str) -> None:
|
||||
"""Save processed markdown content"""
|
||||
try:
|
||||
# Ensure output directory exists
|
||||
output_dir = os.path.dirname(self.output_path)
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
with open(self.output_path, 'w', encoding='utf-8') as file:
|
||||
file.write(content)
|
||||
logger.info(f"Successfully saved masked content to {self.output_path}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error saving content to {self.output_path}: {e}")
|
||||
raise
|
||||
|
|
@ -0,0 +1,204 @@
|
|||
import os
|
||||
import requests
|
||||
import logging
|
||||
from typing import Dict, Any, Optional
|
||||
from ...document_handlers.document_processor import DocumentProcessor
|
||||
from ...services.ollama_client import OllamaClient
|
||||
from ...config import settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class PdfDocumentProcessor(DocumentProcessor):
|
||||
def __init__(self, input_path: str, output_path: str):
|
||||
super().__init__() # Call parent class's __init__
|
||||
self.input_path = input_path
|
||||
self.output_path = output_path
|
||||
self.output_dir = os.path.dirname(output_path)
|
||||
self.name_without_suff = os.path.splitext(os.path.basename(input_path))[0]
|
||||
|
||||
# Setup work directory for temporary files
|
||||
self.work_dir = os.path.join(
|
||||
os.path.dirname(output_path),
|
||||
".work",
|
||||
os.path.splitext(os.path.basename(input_path))[0]
|
||||
)
|
||||
os.makedirs(self.work_dir, exist_ok=True)
|
||||
|
||||
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
|
||||
|
||||
# Mineru API configuration
|
||||
self.mineru_base_url = getattr(settings, 'MINERU_API_URL', 'http://mineru-api:8000')
|
||||
self.mineru_timeout = getattr(settings, 'MINERU_TIMEOUT', 300) # 5 minutes timeout
|
||||
self.mineru_lang_list = getattr(settings, 'MINERU_LANG_LIST', ['ch'])
|
||||
self.mineru_backend = getattr(settings, 'MINERU_BACKEND', 'pipeline')
|
||||
self.mineru_parse_method = getattr(settings, 'MINERU_PARSE_METHOD', 'auto')
|
||||
self.mineru_formula_enable = getattr(settings, 'MINERU_FORMULA_ENABLE', True)
|
||||
self.mineru_table_enable = getattr(settings, 'MINERU_TABLE_ENABLE', True)
|
||||
|
||||
def _call_mineru_api(self, file_path: str) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Call Mineru API to convert PDF to markdown
|
||||
|
||||
Args:
|
||||
file_path: Path to the PDF file
|
||||
|
||||
Returns:
|
||||
API response as dictionary or None if failed
|
||||
"""
|
||||
try:
|
||||
url = f"{self.mineru_base_url}/file_parse"
|
||||
|
||||
with open(file_path, 'rb') as file:
|
||||
files = {'files': (os.path.basename(file_path), file, 'application/pdf')}
|
||||
|
||||
# Prepare form data according to Mineru API specification
|
||||
data = {
|
||||
'output_dir': './output',
|
||||
'lang_list': self.mineru_lang_list,
|
||||
'backend': self.mineru_backend,
|
||||
'parse_method': self.mineru_parse_method,
|
||||
'formula_enable': self.mineru_formula_enable,
|
||||
'table_enable': self.mineru_table_enable,
|
||||
'return_md': True,
|
||||
'return_middle_json': False,
|
||||
'return_model_output': False,
|
||||
'return_content_list': False,
|
||||
'return_images': False,
|
||||
'start_page_id': 0,
|
||||
'end_page_id': 99999
|
||||
}
|
||||
|
||||
logger.info(f"Calling Mineru API at {url}")
|
||||
response = requests.post(
|
||||
url,
|
||||
files=files,
|
||||
data=data,
|
||||
timeout=self.mineru_timeout
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
result = response.json()
|
||||
logger.info("Successfully received response from Mineru API")
|
||||
return result
|
||||
else:
|
||||
logger.error(f"Mineru API returned status code {response.status_code}: {response.text}")
|
||||
return None
|
||||
|
||||
except requests.exceptions.Timeout:
|
||||
logger.error(f"Mineru API request timed out after {self.mineru_timeout} seconds")
|
||||
return None
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error calling Mineru API: {str(e)}")
|
||||
return None
|
||||
except Exception as e:
|
||||
logger.error(f"Unexpected error calling Mineru API: {str(e)}")
|
||||
return None
|
||||
|
||||
def _extract_markdown_from_response(self, response: Dict[str, Any]) -> str:
|
||||
"""
|
||||
Extract markdown content from Mineru API response
|
||||
|
||||
Args:
|
||||
response: Mineru API response dictionary
|
||||
|
||||
Returns:
|
||||
Extracted markdown content as string
|
||||
"""
|
||||
try:
|
||||
logger.debug(f"Mineru API response structure: {response}")
|
||||
|
||||
# Try different possible response formats based on Mineru API
|
||||
if 'markdown' in response:
|
||||
return response['markdown']
|
||||
elif 'md' in response:
|
||||
return response['md']
|
||||
elif 'content' in response:
|
||||
return response['content']
|
||||
elif 'text' in response:
|
||||
return response['text']
|
||||
elif 'result' in response and isinstance(response['result'], dict):
|
||||
result = response['result']
|
||||
if 'markdown' in result:
|
||||
return result['markdown']
|
||||
elif 'md' in result:
|
||||
return result['md']
|
||||
elif 'content' in result:
|
||||
return result['content']
|
||||
elif 'text' in result:
|
||||
return result['text']
|
||||
elif 'data' in response and isinstance(response['data'], dict):
|
||||
data = response['data']
|
||||
if 'markdown' in data:
|
||||
return data['markdown']
|
||||
elif 'md' in data:
|
||||
return data['md']
|
||||
elif 'content' in data:
|
||||
return data['content']
|
||||
elif 'text' in data:
|
||||
return data['text']
|
||||
elif isinstance(response, list) and len(response) > 0:
|
||||
# If response is a list, try to extract from first item
|
||||
first_item = response[0]
|
||||
if isinstance(first_item, dict):
|
||||
return self._extract_markdown_from_response(first_item)
|
||||
elif isinstance(first_item, str):
|
||||
return first_item
|
||||
else:
|
||||
# If no standard format found, try to extract from the response structure
|
||||
logger.warning("Could not find standard markdown field in Mineru response")
|
||||
|
||||
# Return the response as string if it's simple, or empty string
|
||||
if isinstance(response, str):
|
||||
return response
|
||||
elif isinstance(response, dict):
|
||||
# Try to find any text-like content
|
||||
for key, value in response.items():
|
||||
if isinstance(value, str) and len(value) > 100: # Likely content
|
||||
return value
|
||||
elif isinstance(value, dict):
|
||||
# Recursively search in nested dictionaries
|
||||
nested_content = self._extract_markdown_from_response(value)
|
||||
if nested_content:
|
||||
return nested_content
|
||||
|
||||
return ""
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error extracting markdown from Mineru response: {str(e)}")
|
||||
return ""
|
||||
|
||||
def read_content(self) -> str:
|
||||
logger.info("Starting PDF content processing with Mineru API")
|
||||
|
||||
# Call Mineru API to convert PDF to markdown
|
||||
mineru_response = self._call_mineru_api(self.input_path)
|
||||
|
||||
if not mineru_response:
|
||||
raise Exception("Failed to get response from Mineru API")
|
||||
|
||||
# Extract markdown content from the response
|
||||
markdown_content = self._extract_markdown_from_response(mineru_response)
|
||||
|
||||
if not markdown_content:
|
||||
raise Exception("No markdown content found in Mineru API response")
|
||||
|
||||
logger.info(f"Successfully extracted {len(markdown_content)} characters of markdown content")
|
||||
|
||||
# Save the raw markdown content to work directory for reference
|
||||
md_output_path = os.path.join(self.work_dir, f"{self.name_without_suff}.md")
|
||||
with open(md_output_path, 'w', encoding='utf-8') as file:
|
||||
file.write(markdown_content)
|
||||
|
||||
logger.info(f"Saved raw markdown content to {md_output_path}")
|
||||
|
||||
return markdown_content
|
||||
|
||||
def save_content(self, content: str) -> None:
|
||||
# Ensure output path has .md extension
|
||||
output_dir = os.path.dirname(self.output_path)
|
||||
base_name = os.path.splitext(os.path.basename(self.output_path))[0]
|
||||
md_output_path = os.path.join(output_dir, f"{base_name}.md")
|
||||
|
||||
logger.info(f"Saving masked content to: {md_output_path}")
|
||||
with open(md_output_path, 'w', encoding='utf-8') as file:
|
||||
file.write(content)
|
||||
|
|
@ -0,0 +1,28 @@
|
|||
from ...document_handlers.document_processor import DocumentProcessor
|
||||
from ...services.ollama_client import OllamaClient
|
||||
import logging
|
||||
# from ...prompts.masking_prompts import get_masking_prompt
|
||||
from ...config import settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
class TxtDocumentProcessor(DocumentProcessor):
|
||||
def __init__(self, input_path: str, output_path: str):
|
||||
super().__init__()
|
||||
self.input_path = input_path
|
||||
self.output_path = output_path
|
||||
self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
|
||||
|
||||
def read_content(self) -> str:
|
||||
with open(self.input_path, 'r', encoding='utf-8') as file:
|
||||
return file.read()
|
||||
|
||||
# def process_content(self, content: str) -> str:
|
||||
|
||||
# formatted_prompt = get_masking_prompt(content)
|
||||
# response = self.ollama_client.generate(formatted_prompt)
|
||||
# logger.debug(f"Processed content: {response}")
|
||||
# return response
|
||||
|
||||
def save_content(self, content: str) -> None:
|
||||
with open(self.output_path, 'w', encoding='utf-8') as file:
|
||||
file.write(content)
|
||||
|
|
@ -0,0 +1,18 @@
|
|||
import re
|
||||
|
||||
def extract_id_number_entities(chunk: str) -> dict:
|
||||
"""Extract Chinese ID numbers and return in entity mapping format."""
|
||||
id_pattern = r'\b\d{17}[\dXx]\b'
|
||||
entities = []
|
||||
for match in re.findall(id_pattern, chunk):
|
||||
entities.append({"text": match, "type": "身份证号"})
|
||||
return {"entities": entities} if entities else {}
|
||||
|
||||
|
||||
def extract_social_credit_code_entities(chunk: str) -> dict:
|
||||
"""Extract social credit codes and return in entity mapping format."""
|
||||
credit_pattern = r'\b[0-9A-Z]{18}\b'
|
||||
entities = []
|
||||
for match in re.findall(credit_pattern, chunk):
|
||||
entities.append({"text": match, "type": "统一社会信用代码"})
|
||||
return {"entities": entities} if entities else {}
|
||||
|
|
@ -0,0 +1,225 @@
|
|||
import textwrap
|
||||
|
||||
|
||||
def get_ner_name_prompt(text: str) -> str:
|
||||
"""
|
||||
Returns a prompt that generates a mapping of original names/companies to their masked versions.
|
||||
|
||||
Args:
|
||||
text (str): The input text to be analyzed for masking
|
||||
|
||||
Returns:
|
||||
str: The formatted prompt that will generate a mapping dictionary
|
||||
"""
|
||||
prompt = textwrap.dedent("""
|
||||
你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息,并按照指定的类别进行分类。请严格按照JSON格式输出结果。
|
||||
|
||||
实体类别包括:
|
||||
- 人名 (不包括律师、法官、书记员、检察官等公职人员)
|
||||
- 英文人名
|
||||
|
||||
|
||||
待处理文本:
|
||||
{text}
|
||||
|
||||
输出格式:
|
||||
{{
|
||||
"entities": [
|
||||
{{"text": "原始文本内容", "type": "人名"}},
|
||||
{{"text": "原始文本内容", "type": "英文人名"}},
|
||||
...
|
||||
]
|
||||
}}
|
||||
|
||||
|
||||
|
||||
请严格按照JSON格式输出结果。
|
||||
|
||||
""")
|
||||
return prompt.format(text=text)
|
||||
|
||||
|
||||
def get_ner_company_prompt(text: str) -> str:
|
||||
"""
|
||||
Returns a prompt that generates a mapping of original companies to their masked versions.
|
||||
|
||||
Args:
|
||||
text (str): The input text to be analyzed for masking
|
||||
|
||||
Returns:
|
||||
str: The formatted prompt that will generate a mapping dictionary
|
||||
"""
|
||||
prompt = textwrap.dedent("""
|
||||
你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息,并按照指定的类别进行分类。请严格按照JSON格式输出结果。
|
||||
|
||||
实体类别包括:
|
||||
- 公司名称
|
||||
- 英文公司名称
|
||||
- Company with English name
|
||||
- 公司名称简称
|
||||
- 公司英文名称简称
|
||||
|
||||
|
||||
待处理文本:
|
||||
{text}
|
||||
|
||||
输出格式:
|
||||
{{
|
||||
"entities": [
|
||||
{{"text": "原始文本内容", "type": "公司名称"}},
|
||||
{{"text": "原始文本内容", "type": "英文公司名称"}},
|
||||
{{"text": "原始文本内容", "type": "公司名称简称"}},
|
||||
{{"text": "原始文本内容", "type": "公司英文名称简称"}},
|
||||
...
|
||||
]
|
||||
}}
|
||||
|
||||
请严格按照JSON格式输出结果。
|
||||
""")
|
||||
return prompt.format(text=text)
|
||||
|
||||
|
||||
def get_ner_address_prompt(text: str) -> str:
|
||||
"""
|
||||
Returns a prompt that generates a mapping of original addresses to their masked versions.
|
||||
|
||||
Args:
|
||||
text (str): The input text to be analyzed for masking
|
||||
|
||||
Returns:
|
||||
str: The formatted prompt that will generate a mapping dictionary
|
||||
"""
|
||||
prompt = textwrap.dedent("""
|
||||
你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息,并按照指定的类别进行分类。请严格按照JSON格式输出结果。
|
||||
|
||||
实体类别包括:
|
||||
- 地址
|
||||
|
||||
|
||||
待处理文本:
|
||||
{text}
|
||||
|
||||
输出格式:
|
||||
{{
|
||||
"entities": [
|
||||
{{"text": "原始文本内容", "type": "地址"}},
|
||||
...
|
||||
]
|
||||
}}
|
||||
|
||||
请严格按照JSON格式输出结果。
|
||||
""")
|
||||
return prompt.format(text=text)
|
||||
|
||||
|
||||
def get_ner_project_prompt(text: str) -> str:
|
||||
"""
|
||||
Returns a prompt that generates a mapping of original project names to their masked versions.
|
||||
"""
|
||||
prompt = textwrap.dedent("""
|
||||
你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息,并按照指定的类别进行分类。请严格按照JSON格式输出结果。
|
||||
|
||||
实体类别包括:
|
||||
- 项目名(此处项目特指商业、工程、合同等项目)
|
||||
|
||||
待处理文本:
|
||||
{text}
|
||||
|
||||
输出格式:
|
||||
{{
|
||||
"entities": [
|
||||
{{"text": "原始文本内容", "type": "项目名"}},
|
||||
...
|
||||
]
|
||||
}}
|
||||
|
||||
请严格按照JSON格式输出结果。
|
||||
""")
|
||||
return prompt.format(text=text)
|
||||
|
||||
|
||||
def get_ner_case_number_prompt(text: str) -> str:
|
||||
"""
|
||||
Returns a prompt that generates a mapping of original case numbers to their masked versions.
|
||||
"""
|
||||
prompt = textwrap.dedent("""
|
||||
你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息,并按照指定的类别进行分类。请严格按照JSON格式输出结果。
|
||||
|
||||
实体类别包括:
|
||||
- 案号
|
||||
|
||||
待处理文本:
|
||||
{text}
|
||||
|
||||
输出格式:
|
||||
{{
|
||||
"entities": [
|
||||
{{"text": "原始文本内容", "type": "案号"}},
|
||||
...
|
||||
]
|
||||
}}
|
||||
|
||||
请严格按照JSON格式输出结果。
|
||||
""")
|
||||
return prompt.format(text=text)
|
||||
|
||||
|
||||
def get_entity_linkage_prompt(entities_text: str) -> str:
|
||||
"""
|
||||
Returns a prompt that identifies related entities and groups them together.
|
||||
|
||||
Args:
|
||||
entities_text (str): The list of entities to be analyzed for linkage
|
||||
|
||||
Returns:
|
||||
str: The formatted prompt that will generate entity linkage information
|
||||
"""
|
||||
prompt = textwrap.dedent("""
|
||||
你是一个专业的法律文本实体关联分析助手。请分析以下实体列表,识别出相互关联的实体(如全称与简称、中文名与英文名等),并将它们分组。
|
||||
|
||||
关联规则:
|
||||
1. 公司名称关联:
|
||||
- 全称与简称(如:"阿里巴巴集团控股有限公司" 与 "阿里巴巴")
|
||||
- 中文名与英文名(如:"腾讯科技有限公司" 与 "Tencent Technology Ltd.")
|
||||
- 母公司与子公司(如:"腾讯" 与 "腾讯音乐")
|
||||
|
||||
|
||||
2. 每个组中应指定一个主要实体(is_primary: true),通常是:
|
||||
- 对于公司:选择最正式的全称
|
||||
- 对于人名:选择最常用的称呼
|
||||
|
||||
待分析实体列表:
|
||||
{entities_text}
|
||||
|
||||
输出格式:
|
||||
{{
|
||||
"entity_groups": [
|
||||
{{
|
||||
"group_id": "group_1",
|
||||
"group_type": "公司名称",
|
||||
"entities": [
|
||||
{{
|
||||
"text": "阿里巴巴集团控股有限公司",
|
||||
"type": "公司名称",
|
||||
"is_primary": true
|
||||
}},
|
||||
{{
|
||||
"text": "阿里巴巴",
|
||||
"type": "公司名称简称",
|
||||
"is_primary": false
|
||||
}}
|
||||
]
|
||||
}}
|
||||
]
|
||||
}}
|
||||
|
||||
注意事项:
|
||||
1. 只对确实有关联的实体进行分组
|
||||
2. 每个实体只能属于一个组
|
||||
3. 每个组必须有且仅有一个主要实体(is_primary: true)
|
||||
4. 如果实体之间没有明显关联,不要强制分组
|
||||
5. group_type 应该是 "公司名称"
|
||||
|
||||
请严格按照JSON格式输出结果。
|
||||
""")
|
||||
return prompt.format(entities_text=entities_text)
|
||||
|
|
@ -1,12 +1,12 @@
|
|||
import logging
|
||||
from models.document_factory import DocumentProcessorFactory
|
||||
from services.ollama_client import OllamaClient
|
||||
from ..document_handlers.document_factory import DocumentProcessorFactory
|
||||
from ..services.ollama_client import OllamaClient
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class DocumentService:
|
||||
def __init__(self, ollama_client: OllamaClient):
|
||||
self.ollama_client = ollama_client
|
||||
def __init__(self):
|
||||
pass
|
||||
|
||||
def process_document(self, input_path: str, output_path: str) -> bool:
|
||||
try:
|
||||
|
|
@ -19,10 +19,10 @@ class DocumentService:
|
|||
content = processor.read_content()
|
||||
|
||||
# Process with Ollama
|
||||
processed_content = self.ollama_client.process_document(content)
|
||||
masked_content = processor.process_content(content)
|
||||
|
||||
# Save processed content
|
||||
processor.save_content(processed_content)
|
||||
processor.save_content(masked_content)
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
|
|
@ -0,0 +1,91 @@
|
|||
import requests
|
||||
import logging
|
||||
from typing import Dict, Any
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class OllamaClient:
|
||||
def __init__(self, model_name: str, base_url: str = "http://localhost:11434"):
|
||||
"""Initialize Ollama client.
|
||||
|
||||
Args:
|
||||
model_name (str): Name of the Ollama model to use
|
||||
host (str): Ollama server host address
|
||||
port (int): Ollama server port
|
||||
"""
|
||||
self.model_name = model_name
|
||||
self.base_url = base_url
|
||||
self.headers = {"Content-Type": "application/json"}
|
||||
|
||||
def generate(self, prompt: str, strip_think: bool = True) -> str:
|
||||
"""Process a document using the Ollama API.
|
||||
|
||||
Args:
|
||||
document_text (str): The text content to process
|
||||
|
||||
Returns:
|
||||
str: Processed text response from the model
|
||||
|
||||
Raises:
|
||||
RequestException: If the API call fails
|
||||
"""
|
||||
try:
|
||||
url = f"{self.base_url}/api/generate"
|
||||
payload = {
|
||||
"model": self.model_name,
|
||||
"prompt": prompt,
|
||||
"stream": False
|
||||
}
|
||||
|
||||
logger.debug(f"Sending request to Ollama API: {url}")
|
||||
response = requests.post(url, json=payload, headers=self.headers)
|
||||
response.raise_for_status()
|
||||
|
||||
result = response.json()
|
||||
logger.debug(f"Received response from Ollama API: {result}")
|
||||
if strip_think:
|
||||
# Remove the "thinking" part from the response
|
||||
# the response is expected to be <think>...</think>response_text
|
||||
# Check if the response contains <think> tag
|
||||
if "<think>" in result.get("response", ""):
|
||||
# Split the response and take the part after </think>
|
||||
response_parts = result["response"].split("</think>")
|
||||
if len(response_parts) > 1:
|
||||
# Return the part after </think>
|
||||
return response_parts[1].strip()
|
||||
else:
|
||||
# If no closing tag, return the full response
|
||||
return result.get("response", "").strip()
|
||||
else:
|
||||
# If no <think> tag, return the full response
|
||||
return result.get("response", "").strip()
|
||||
else:
|
||||
# If strip_think is False, return the full response
|
||||
return result.get("response", "")
|
||||
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error calling Ollama API: {str(e)}")
|
||||
raise
|
||||
|
||||
def get_model_info(self) -> Dict[str, Any]:
|
||||
"""Get information about the current model.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: Model information
|
||||
|
||||
Raises:
|
||||
RequestException: If the API call fails
|
||||
"""
|
||||
try:
|
||||
url = f"{self.base_url}/api/show"
|
||||
payload = {"name": self.model_name}
|
||||
|
||||
response = requests.post(url, json=payload, headers=self.headers)
|
||||
response.raise_for_status()
|
||||
|
||||
return response.json()
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error getting model info: {str(e)}")
|
||||
raise
|
||||
|
|
@ -0,0 +1,141 @@
|
|||
import json
|
||||
import re
|
||||
from typing import Any, Optional, Dict, TypeVar, Type
|
||||
|
||||
|
||||
T = TypeVar('T')
|
||||
|
||||
class LLMJsonExtractor:
|
||||
"""Utility class for extracting and parsing JSON from LLM outputs"""
|
||||
|
||||
@staticmethod
|
||||
def extract_json(text: str) -> Optional[str]:
|
||||
"""
|
||||
Extracts JSON string from text using regex pattern matching.
|
||||
Handles both single and multiple JSON objects in text.
|
||||
|
||||
Args:
|
||||
text (str): Raw text containing JSON
|
||||
|
||||
Returns:
|
||||
Optional[str]: Extracted JSON string or None if no valid JSON found
|
||||
"""
|
||||
# Pattern to match JSON objects with balanced braces
|
||||
pattern = r'{[^{}]*(?:{[^{}]*}[^{}]*)*}'
|
||||
matches = re.findall(pattern, text)
|
||||
|
||||
if not matches:
|
||||
return None
|
||||
|
||||
# Return the first valid JSON match
|
||||
for match in matches:
|
||||
try:
|
||||
# Verify it's valid JSON
|
||||
json.loads(match)
|
||||
return match
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
|
||||
return None
|
||||
|
||||
@staticmethod
|
||||
def parse_json(text: str) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Extracts and parses JSON from text into a Python dictionary.
|
||||
|
||||
Args:
|
||||
text (str): Raw text containing JSON
|
||||
|
||||
Returns:
|
||||
Optional[Dict[str, Any]]: Parsed JSON as dictionary or None if parsing fails
|
||||
"""
|
||||
try:
|
||||
json_str = LLMJsonExtractor.extract_json(text)
|
||||
if json_str:
|
||||
return json.loads(json_str)
|
||||
return None
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
|
||||
@staticmethod
|
||||
def parse_to_dataclass(text: str, dataclass_type: Type[T]) -> Optional[T]:
|
||||
"""
|
||||
Extracts JSON and converts it to a specified dataclass type.
|
||||
|
||||
Args:
|
||||
text (str): Raw text containing JSON
|
||||
dataclass_type (Type[T]): Target dataclass type
|
||||
|
||||
Returns:
|
||||
Optional[T]: Instance of specified dataclass or None if conversion fails
|
||||
"""
|
||||
try:
|
||||
data = LLMJsonExtractor.parse_json(text)
|
||||
if data:
|
||||
return dataclass_type(**data)
|
||||
return None
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
return None
|
||||
|
||||
@staticmethod
|
||||
def parse_raw_json_str(text: str) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Extracts and parses JSON from text into a Python dictionary.
|
||||
|
||||
Args:
|
||||
text (str): Raw text containing JSON
|
||||
|
||||
Returns:
|
||||
Optional[Dict[str, Any]]: Parsed JSON as dictionary or None if parsing fails
|
||||
"""
|
||||
try:
|
||||
json_str = LLMJsonExtractor.extract_json_max(text)
|
||||
if json_str:
|
||||
return json.loads(json_str)
|
||||
return None
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
|
||||
@staticmethod
|
||||
def extract_json_max(text: str) -> Optional[str]:
|
||||
"""
|
||||
Extracts the maximum valid JSON object from text using stack-based brace matching.
|
||||
|
||||
Args:
|
||||
text (str): Raw text containing JSON
|
||||
|
||||
Returns:
|
||||
Optional[str]: Maximum valid JSON object as string or None if no valid JSON found
|
||||
"""
|
||||
max_json = None
|
||||
max_length = 0
|
||||
|
||||
# Iterate through each character as a potential start of JSON
|
||||
for start in range(len(text)):
|
||||
if text[start] != '{':
|
||||
continue
|
||||
|
||||
stack = []
|
||||
for end in range(start, len(text)):
|
||||
if text[end] == '{':
|
||||
stack.append(end)
|
||||
elif text[end] == '}':
|
||||
if not stack: # Unmatched closing brace
|
||||
break
|
||||
|
||||
opening_pos = stack.pop()
|
||||
|
||||
# If stack is empty, we have a complete JSON object
|
||||
if not stack:
|
||||
json_candidate = text[opening_pos:end + 1]
|
||||
try:
|
||||
# Verify it's valid JSON
|
||||
json.loads(json_candidate)
|
||||
if len(json_candidate) > max_length:
|
||||
max_length = len(json_candidate)
|
||||
max_json = json_candidate
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
|
||||
return max_json
|
||||
|
||||
|
|
@ -0,0 +1,240 @@
|
|||
import logging
|
||||
from typing import Any, Dict, Optional
|
||||
from jsonschema import validate, ValidationError
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class LLMResponseValidator:
|
||||
"""Validator for LLM JSON responses with different schemas for different entity types"""
|
||||
|
||||
# Schema for basic entity extraction responses
|
||||
ENTITY_EXTRACTION_SCHEMA = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"entities": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"text": {"type": "string"},
|
||||
"type": {"type": "string"}
|
||||
},
|
||||
"required": ["text", "type"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["entities"]
|
||||
}
|
||||
|
||||
# Schema for entity linkage responses
|
||||
ENTITY_LINKAGE_SCHEMA = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"entity_groups": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"group_id": {"type": "string"},
|
||||
"group_type": {"type": "string"},
|
||||
"entities": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"text": {"type": "string"},
|
||||
"type": {"type": "string"},
|
||||
"is_primary": {"type": "boolean"}
|
||||
},
|
||||
"required": ["text", "type", "is_primary"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["group_id", "group_type", "entities"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["entity_groups"]
|
||||
}
|
||||
|
||||
# Schema for regex-based entity extraction (from entity_regex.py)
|
||||
REGEX_ENTITY_SCHEMA = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"entities": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"text": {"type": "string"},
|
||||
"type": {"type": "string"}
|
||||
},
|
||||
"required": ["text", "type"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["entities"]
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def validate_entity_extraction(cls, response: Dict[str, Any]) -> bool:
|
||||
"""
|
||||
Validate entity extraction response from LLM.
|
||||
|
||||
Args:
|
||||
response: The parsed JSON response from LLM
|
||||
|
||||
Returns:
|
||||
bool: True if valid, False otherwise
|
||||
"""
|
||||
try:
|
||||
validate(instance=response, schema=cls.ENTITY_EXTRACTION_SCHEMA)
|
||||
logger.debug(f"Entity extraction validation passed for response: {response}")
|
||||
return True
|
||||
except ValidationError as e:
|
||||
logger.warning(f"Entity extraction validation failed: {e}")
|
||||
logger.warning(f"Response that failed validation: {response}")
|
||||
return False
|
||||
|
||||
@classmethod
|
||||
def validate_entity_linkage(cls, response: Dict[str, Any]) -> bool:
|
||||
"""
|
||||
Validate entity linkage response from LLM.
|
||||
|
||||
Args:
|
||||
response: The parsed JSON response from LLM
|
||||
|
||||
Returns:
|
||||
bool: True if valid, False otherwise
|
||||
"""
|
||||
try:
|
||||
validate(instance=response, schema=cls.ENTITY_LINKAGE_SCHEMA)
|
||||
content_valid = cls._validate_linkage_content(response)
|
||||
if content_valid:
|
||||
logger.debug(f"Entity linkage validation passed for response: {response}")
|
||||
return True
|
||||
else:
|
||||
logger.warning(f"Entity linkage content validation failed for response: {response}")
|
||||
return False
|
||||
except ValidationError as e:
|
||||
logger.warning(f"Entity linkage validation failed: {e}")
|
||||
logger.warning(f"Response that failed validation: {response}")
|
||||
return False
|
||||
|
||||
@classmethod
|
||||
def validate_regex_entity(cls, response: Dict[str, Any]) -> bool:
|
||||
"""
|
||||
Validate regex-based entity extraction response.
|
||||
|
||||
Args:
|
||||
response: The parsed JSON response from regex extractors
|
||||
|
||||
Returns:
|
||||
bool: True if valid, False otherwise
|
||||
"""
|
||||
try:
|
||||
validate(instance=response, schema=cls.REGEX_ENTITY_SCHEMA)
|
||||
logger.debug(f"Regex entity validation passed for response: {response}")
|
||||
return True
|
||||
except ValidationError as e:
|
||||
logger.warning(f"Regex entity validation failed: {e}")
|
||||
logger.warning(f"Response that failed validation: {response}")
|
||||
return False
|
||||
|
||||
@classmethod
|
||||
def _validate_linkage_content(cls, response: Dict[str, Any]) -> bool:
|
||||
"""
|
||||
Additional content validation for entity linkage responses.
|
||||
|
||||
Args:
|
||||
response: The parsed JSON response from LLM
|
||||
|
||||
Returns:
|
||||
bool: True if content is valid, False otherwise
|
||||
"""
|
||||
entity_groups = response.get('entity_groups', [])
|
||||
|
||||
for group in entity_groups:
|
||||
# Validate group type
|
||||
group_type = group.get('group_type', '')
|
||||
if group_type not in ['公司名称', '人名']:
|
||||
logger.warning(f"Invalid group_type: {group_type}")
|
||||
return False
|
||||
|
||||
# Validate entities in group
|
||||
entities = group.get('entities', [])
|
||||
if not entities:
|
||||
logger.warning("Empty entity group found")
|
||||
return False
|
||||
|
||||
# Check that exactly one entity is marked as primary
|
||||
primary_count = sum(1 for entity in entities if entity.get('is_primary', False))
|
||||
if primary_count != 1:
|
||||
logger.warning(f"Group must have exactly one primary entity, found {primary_count}")
|
||||
return False
|
||||
|
||||
# Validate entity types within group
|
||||
for entity in entities:
|
||||
entity_type = entity.get('type', '')
|
||||
if group_type == '公司名称' and not any(keyword in entity_type for keyword in ['公司', 'Company']):
|
||||
logger.warning(f"Company group contains non-company entity: {entity_type}")
|
||||
return False
|
||||
elif group_type == '人名' and not any(keyword in entity_type for keyword in ['人名', '英文人名']):
|
||||
logger.warning(f"Person group contains non-person entity: {entity_type}")
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
@classmethod
|
||||
def validate_response_by_type(cls, response: Dict[str, Any], response_type: str) -> bool:
|
||||
"""
|
||||
Generic validator that routes to appropriate validation method based on response type.
|
||||
|
||||
Args:
|
||||
response: The parsed JSON response from LLM
|
||||
response_type: Type of response ('entity_extraction', 'entity_linkage', 'regex_entity')
|
||||
|
||||
Returns:
|
||||
bool: True if valid, False otherwise
|
||||
"""
|
||||
validators = {
|
||||
'entity_extraction': cls.validate_entity_extraction,
|
||||
'entity_linkage': cls.validate_entity_linkage,
|
||||
'regex_entity': cls.validate_regex_entity
|
||||
}
|
||||
|
||||
validator = validators.get(response_type)
|
||||
if not validator:
|
||||
logger.error(f"Unknown response type: {response_type}")
|
||||
return False
|
||||
|
||||
return validator(response)
|
||||
|
||||
@classmethod
|
||||
def get_validation_errors(cls, response: Dict[str, Any], response_type: str) -> Optional[str]:
|
||||
"""
|
||||
Get detailed validation errors for debugging.
|
||||
|
||||
Args:
|
||||
response: The parsed JSON response from LLM
|
||||
response_type: Type of response
|
||||
|
||||
Returns:
|
||||
Optional[str]: Error message or None if valid
|
||||
"""
|
||||
try:
|
||||
if response_type == 'entity_extraction':
|
||||
validate(instance=response, schema=cls.ENTITY_EXTRACTION_SCHEMA)
|
||||
elif response_type == 'entity_linkage':
|
||||
validate(instance=response, schema=cls.ENTITY_LINKAGE_SCHEMA)
|
||||
if not cls._validate_linkage_content(response):
|
||||
return "Content validation failed for entity linkage"
|
||||
elif response_type == 'regex_entity':
|
||||
validate(instance=response, schema=cls.REGEX_ENTITY_SCHEMA)
|
||||
else:
|
||||
return f"Unknown response type: {response_type}"
|
||||
|
||||
return None
|
||||
except ValidationError as e:
|
||||
return f"Schema validation error: {e}"
|
||||
|
|
@ -0,0 +1,33 @@
|
|||
from fastapi import FastAPI
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
from .core.config import settings
|
||||
from .api.endpoints import files
|
||||
from .core.database import engine, Base
|
||||
|
||||
# Create database tables
|
||||
Base.metadata.create_all(bind=engine)
|
||||
|
||||
app = FastAPI(
|
||||
title=settings.PROJECT_NAME,
|
||||
openapi_url=f"{settings.API_V1_STR}/openapi.json"
|
||||
)
|
||||
|
||||
# Set up CORS
|
||||
app.add_middleware(
|
||||
CORSMiddleware,
|
||||
allow_origins=["*"], # In production, replace with specific origins
|
||||
allow_credentials=True,
|
||||
allow_methods=["*"],
|
||||
allow_headers=["*"],
|
||||
)
|
||||
|
||||
# Include routers
|
||||
app.include_router(
|
||||
files.router,
|
||||
prefix=f"{settings.API_V1_STR}/files",
|
||||
tags=["files"]
|
||||
)
|
||||
|
||||
@app.get("/")
|
||||
async def root():
|
||||
return {"message": "Welcome to Legal Document Masker API"}
|
||||
|
|
@ -0,0 +1,22 @@
|
|||
from sqlalchemy import Column, String, DateTime, Text
|
||||
from datetime import datetime
|
||||
import uuid
|
||||
from ..core.database import Base
|
||||
|
||||
class FileStatus(str):
|
||||
NOT_STARTED = "not_started"
|
||||
PROCESSING = "processing"
|
||||
SUCCESS = "success"
|
||||
FAILED = "failed"
|
||||
|
||||
class File(Base):
|
||||
__tablename__ = "files"
|
||||
|
||||
id = Column(String(36), primary_key=True, default=lambda: str(uuid.uuid4()))
|
||||
filename = Column(String(255), nullable=False)
|
||||
original_path = Column(String(255), nullable=False)
|
||||
processed_path = Column(String(255))
|
||||
status = Column(String(20), nullable=False, default=FileStatus.NOT_STARTED)
|
||||
error_message = Column(Text)
|
||||
created_at = Column(DateTime, nullable=False, default=datetime.utcnow)
|
||||
updated_at = Column(DateTime, nullable=False, default=datetime.utcnow, onupdate=datetime.utcnow)
|
||||
|
|
@ -0,0 +1,21 @@
|
|||
from pydantic import BaseModel
|
||||
from datetime import datetime
|
||||
from typing import Optional
|
||||
from uuid import UUID
|
||||
|
||||
class FileBase(BaseModel):
|
||||
filename: str
|
||||
status: str
|
||||
error_message: Optional[str] = None
|
||||
|
||||
class FileResponse(FileBase):
|
||||
id: UUID
|
||||
created_at: datetime
|
||||
updated_at: datetime
|
||||
|
||||
class Config:
|
||||
from_attributes = True
|
||||
|
||||
class FileList(BaseModel):
|
||||
files: list[FileResponse]
|
||||
total: int
|
||||
|
|
@ -0,0 +1,87 @@
|
|||
from celery import Celery
|
||||
from ..core.config import settings
|
||||
from ..models.file import File, FileStatus
|
||||
from sqlalchemy.orm import Session
|
||||
from ..core.database import SessionLocal
|
||||
import sys
|
||||
import os
|
||||
from ..core.services.document_service import DocumentService
|
||||
from pathlib import Path
|
||||
from fastapi import HTTPException
|
||||
|
||||
|
||||
celery = Celery(
|
||||
'file_service',
|
||||
broker=settings.CELERY_BROKER_URL,
|
||||
backend=settings.CELERY_RESULT_BACKEND
|
||||
)
|
||||
|
||||
def delete_file(file_id: str):
|
||||
"""
|
||||
Delete a file and its associated records.
|
||||
This will:
|
||||
1. Delete the database record
|
||||
2. Delete the original uploaded file
|
||||
3. Delete the processed markdown file (if it exists)
|
||||
"""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
# Get the file record
|
||||
file = db.query(File).filter(File.id == file_id).first()
|
||||
if not file:
|
||||
raise HTTPException(status_code=404, detail="File not found")
|
||||
|
||||
# Delete the original file if it exists
|
||||
if file.original_path and os.path.exists(file.original_path):
|
||||
os.remove(file.original_path)
|
||||
|
||||
# Delete the processed file if it exists
|
||||
if file.processed_path and os.path.exists(file.processed_path):
|
||||
os.remove(file.processed_path)
|
||||
|
||||
# Delete the database record
|
||||
db.delete(file)
|
||||
db.commit()
|
||||
|
||||
except Exception as e:
|
||||
db.rollback()
|
||||
raise HTTPException(status_code=500, detail=f"Error deleting file: {str(e)}")
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
@celery.task
|
||||
def process_file(file_id: str):
|
||||
db = SessionLocal()
|
||||
try:
|
||||
file = db.query(File).filter(File.id == file_id).first()
|
||||
if not file:
|
||||
return
|
||||
|
||||
# Update status to processing
|
||||
file.status = FileStatus.PROCESSING
|
||||
db.commit()
|
||||
|
||||
try:
|
||||
# Process the file using your existing masking system
|
||||
process_service = DocumentService()
|
||||
|
||||
# Determine output path using file_id with .md extension
|
||||
output_filename = f"{file_id}.md"
|
||||
output_path = str(settings.PROCESSED_FOLDER / output_filename)
|
||||
|
||||
# Process document with both input and output paths
|
||||
process_service.process_document(file.original_path, output_path)
|
||||
|
||||
# Update file record with processed path
|
||||
file.processed_path = output_path
|
||||
file.status = FileStatus.SUCCESS
|
||||
db.commit()
|
||||
|
||||
except Exception as e:
|
||||
file.status = FileStatus.FAILED
|
||||
file.error_message = str(e)
|
||||
db.commit()
|
||||
raise
|
||||
|
||||
finally:
|
||||
db.close()
|
||||
|
|
@ -0,0 +1,37 @@
|
|||
version: '3.8'
|
||||
|
||||
services:
|
||||
api:
|
||||
build: .
|
||||
ports:
|
||||
- "8000:8000"
|
||||
volumes:
|
||||
- ./storage:/app/storage
|
||||
- ./legal_doc_masker.db:/app/legal_doc_masker.db
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- CELERY_BROKER_URL=redis://redis:6379/0
|
||||
- CELERY_RESULT_BACKEND=redis://redis:6379/0
|
||||
depends_on:
|
||||
- redis
|
||||
|
||||
celery_worker:
|
||||
build: .
|
||||
command: celery -A app.services.file_service worker --loglevel=info
|
||||
volumes:
|
||||
- ./storage:/app/storage
|
||||
- ./legal_doc_masker.db:/app/legal_doc_masker.db
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- CELERY_BROKER_URL=redis://redis:6379/0
|
||||
- CELERY_RESULT_BACKEND=redis://redis:6379/0
|
||||
depends_on:
|
||||
- redis
|
||||
- api
|
||||
|
||||
redis:
|
||||
image: redis:alpine
|
||||
ports:
|
||||
- "6379:6379"
|
||||
|
|
@ -0,0 +1,127 @@
|
|||
[2025-07-14 14:20:19,015: INFO/ForkPoolWorker-4] Raw response from LLM: {
|
||||
celery_worker-1 | "entities": []
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | [2025-07-14 14:20:19,016: INFO/ForkPoolWorker-4] Parsed mapping: {'entities': []}
|
||||
celery_worker-1 | [2025-07-14 14:20:19,020: INFO/ForkPoolWorker-4] Calling ollama to generate case numbers mapping for chunk (attempt 1/3):
|
||||
celery_worker-1 | 你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息,并按照指定的类别进行分类。请严格按照JSON格式输出结果。
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 实体类别包括:
|
||||
celery_worker-1 | - 案号
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 待处理文本:
|
||||
celery_worker-1 |
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 二审案件受理费450892 元,由北京丰复久信营销科技有限公司负担(已交纳)。
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 29. 本判决为终审判决。
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 审 判 长 史晓霞审 判 员 邓青菁审 判 员 李 淼二〇二二年七月七日法 官 助 理 黎 铧书 记 员 郑海兴
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 输出格式:
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "entities": [
|
||||
celery_worker-1 | {"text": "原始文本内容", "type": "案号"},
|
||||
celery_worker-1 | ...
|
||||
celery_worker-1 | ]
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 |
|
||||
celery_worker-1 | 请严格按照JSON格式输出结果。
|
||||
celery_worker-1 |
|
||||
api-1 | INFO: 192.168.65.1:60045 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:34054 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:34054 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:22084 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
celery_worker-1 | [2025-07-14 14:20:31,279: INFO/ForkPoolWorker-4] Raw response from LLM: {
|
||||
celery_worker-1 | "entities": []
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | [2025-07-14 14:20:31,281: INFO/ForkPoolWorker-4] Parsed mapping: {'entities': []}
|
||||
celery_worker-1 | [2025-07-14 14:20:31,287: INFO/ForkPoolWorker-4] Chunk mapping: [{'entities': []}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': []}, {'entities': []}]
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Final chunk mappings: [{'entities': [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}]}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}]}, {'entities': [{'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}]}, {'entities': [{'text': '服务合同', 'type': '项目名'}]}, {'entities': [{'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '(2020)京0105 民初69754 号', 'type': '案号'}]}, {'entities': [{'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}]}, {'entities': [{'text': '丰复久信公司', 'type': '公司名称'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}]}, {'entities': [{'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}]}, {'entities': [{'text': '《计算机设备采购合同》', 'type': '项目名'}]}, {'entities': []}, {'entities': []}, {'entities': [{'text': '丰复久信公司', 'type': '公司名称'}, {'text': '中研智创公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': [{'text': '《服务合同书》', 'type': '项目名'}]}, {'entities': []}, {'entities': []}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': []}, {'entities': []}]
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '丰复久信公司', 'type': '公司名称'}
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '丰复久信公司', 'type': '公司名称'}
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '中研智创公司', 'type': '公司名称'}
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Merged 22 unique entities
|
||||
celery_worker-1 | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Unique entities: [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}, {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}, {'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}, {'text': '服务合同', 'type': '项目名'}, {'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '(2020)京0105 民初69754 号', 'type': '案号'}, {'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}, {'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}, {'text': '《计算机设备采购合同》', 'type': '项目名'}, {'text': '《服务合同书》', 'type': '项目名'}]
|
||||
celery_worker-1 | [2025-07-14 14:20:31,289: INFO/ForkPoolWorker-4] Calling ollama to generate entity linkage (attempt 1/3)
|
||||
api-1 | INFO: 192.168.65.1:52168 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:61426 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:30702 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:48159 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:16860 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:21262 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:45564 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:32142 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:27769 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:21196 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
celery_worker-1 | [2025-07-14 14:21:21,436: INFO/ForkPoolWorker-4] Raw entity linkage response from LLM: {
|
||||
celery_worker-1 | "entity_groups": [
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "group_id": "group_1",
|
||||
celery_worker-1 | "group_type": "公司名称",
|
||||
celery_worker-1 | "entities": [
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "北京丰复久信营销科技有限公司",
|
||||
celery_worker-1 | "type": "公司名称",
|
||||
celery_worker-1 | "is_primary": true
|
||||
celery_worker-1 | },
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "丰复久信公司",
|
||||
celery_worker-1 | "type": "公司名称简称",
|
||||
celery_worker-1 | "is_primary": false
|
||||
celery_worker-1 | },
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "丰复久信",
|
||||
celery_worker-1 | "type": "公司名称简称",
|
||||
celery_worker-1 | "is_primary": false
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | ]
|
||||
celery_worker-1 | },
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "group_id": "group_2",
|
||||
celery_worker-1 | "group_type": "公司名称",
|
||||
celery_worker-1 | "entities": [
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "中研智创区块链技术有限公司",
|
||||
celery_worker-1 | "type": "公司名称",
|
||||
celery_worker-1 | "is_primary": true
|
||||
celery_worker-1 | },
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "中研智创公司",
|
||||
celery_worker-1 | "type": "公司名称简称",
|
||||
celery_worker-1 | "is_primary": false
|
||||
celery_worker-1 | },
|
||||
celery_worker-1 | {
|
||||
celery_worker-1 | "text": "中研智创",
|
||||
celery_worker-1 | "type": "公司名称简称",
|
||||
celery_worker-1 | "is_primary": false
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | ]
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | ]
|
||||
celery_worker-1 | }
|
||||
celery_worker-1 | [2025-07-14 14:21:21,437: INFO/ForkPoolWorker-4] Parsed entity linkage: {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
|
||||
celery_worker-1 | [2025-07-14 14:21:21,445: INFO/ForkPoolWorker-4] Successfully created entity linkage with 2 groups
|
||||
celery_worker-1 | [2025-07-14 14:21:21,445: INFO/ForkPoolWorker-4] Entity linkage: {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Generated masked mapping for 22 entities
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Combined mapping: {'郭东军': '某', '王欢子': '某甲', '北京丰复久信营销科技有限公司': '某公司', '丰复久信公司': '某公司甲', '中研智创区块链技术有限公司': '某公司乙', '中研智才公司': '某公司丙', '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室': '某乙', '天津市津南区双港镇工业园区优谷产业园5 号楼-1505': '某丙', '服务合同': '某丁', '(2022)京 03 民终 3852 号': '某戊', '(2020)京0105 民初69754 号': '某己', '李圣艳': '某庚', '闫向东': '某辛', '李敏': '某壬', '布兰登·斯密特': '某癸', '中研智创公司': '某公司丁', '丰复久信': '某公司戊', '中研智创': '某公司己', '上海市': '某11', '北京': '某12', '《计算机设备采购合同》': '某13', '《服务合同书》': '某14'}
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '北京丰复久信营销科技有限公司' to '北京丰复久信营销科技有限公司' with masked name '某公司'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '丰复久信公司' to '北京丰复久信营销科技有限公司' with masked name '某公司'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '丰复久信' to '北京丰复久信营销科技有限公司' with masked name '某公司'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创区块链技术有限公司' to '中研智创区块链技术有限公司' with masked name '某公司乙'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创公司' to '中研智创区块链技术有限公司' with masked name '某公司乙'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创' to '中研智创区块链技术有限公司' with masked name '某公司乙'
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Final mapping: {'郭东军': '某', '王欢子': '某甲', '北京丰复久信营销科技有限公司': '某公司', '丰复久信公司': '某公司', '中研智创区块链技术有限公司': '某公司乙', '中研智才公司': '某公司丙', '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室': '某乙', '天津市津南区双港镇工业园区优谷产业园5 号楼-1505': '某丙', '服务合同': '某丁', '(2022)京 03 民终 3852 号': '某戊', '(2020)京0105 民初69754 号': '某己', '李圣艳': '某庚', '闫向东': '某辛', '李敏': '某壬', '布兰登·斯密特': '某癸', '中研智创公司': '某公司乙', '丰复久信': '某公司', '中研智创': '某公司乙', '上海市': '某11', '北京': '某12', '《计算机设备采购合同》': '某13', '《服务合同书》': '某14'}
|
||||
celery_worker-1 | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Successfully masked content
|
||||
celery_worker-1 | [2025-07-14 14:21:21,449: INFO/ForkPoolWorker-4] Successfully saved masked content to /app/storage/processed/47522ea9-c259-4304-bfe4-1d3ed6902ede.md
|
||||
celery_worker-1 | [2025-07-14 14:21:21,470: INFO/ForkPoolWorker-4] Task app.services.file_service.process_file[5cfbca4c-0f6f-4c71-a66b-b22ee2d28139] succeeded in 311.847165101s: None
|
||||
api-1 | INFO: 192.168.65.1:33432 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:40073 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:29550 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:61350 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:61755 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:63726 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:43446 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:45624 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:25256 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
api-1 | INFO: 192.168.65.1:43464 - "GET /api/v1/files/files HTTP/1.1" 200 OK
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
{
|
||||
"name": "backend",
|
||||
"lockfileVersion": 3,
|
||||
"requires": true,
|
||||
"packages": {}
|
||||
}
|
||||
|
|
@ -0,0 +1,32 @@
|
|||
# FastAPI and server
|
||||
fastapi>=0.104.0
|
||||
uvicorn>=0.24.0
|
||||
python-multipart>=0.0.6
|
||||
websockets>=12.0
|
||||
|
||||
# Database
|
||||
sqlalchemy>=2.0.0
|
||||
alembic>=1.12.0
|
||||
|
||||
# Background tasks
|
||||
celery>=5.3.0
|
||||
redis>=5.0.0
|
||||
|
||||
# Security
|
||||
python-jose[cryptography]>=3.3.0
|
||||
passlib[bcrypt]>=1.7.4
|
||||
python-dotenv>=1.0.0
|
||||
|
||||
# Testing
|
||||
pytest>=7.4.0
|
||||
httpx>=0.25.0
|
||||
|
||||
# Existing project dependencies
|
||||
pydantic-settings>=2.0.0
|
||||
watchdog==2.1.6
|
||||
requests==2.28.1
|
||||
python-docx>=0.8.11
|
||||
PyPDF2>=3.0.0
|
||||
pandas>=2.0.0
|
||||
# magic-pdf[full]
|
||||
jsonschema>=4.20.0
|
||||
|
|
@ -0,0 +1 @@
|
|||
关于张三天和北京易见天树有限公司的劳动纠纷
|
||||
|
|
@ -0,0 +1,62 @@
|
|||
import pytest
|
||||
from app.core.document_handlers.ner_processor import NerProcessor
|
||||
|
||||
def test_generate_masked_mapping():
|
||||
processor = NerProcessor()
|
||||
unique_entities = [
|
||||
{'text': '李雷', 'type': '人名'},
|
||||
{'text': '李明', 'type': '人名'},
|
||||
{'text': '王强', 'type': '人名'},
|
||||
{'text': 'Acme Manufacturing Inc.', 'type': '英文公司名', 'industry': 'manufacturing'},
|
||||
{'text': 'Google LLC', 'type': '英文公司名'},
|
||||
{'text': 'A公司', 'type': '公司名称'},
|
||||
{'text': 'B公司', 'type': '公司名称'},
|
||||
{'text': 'John Smith', 'type': '英文人名'},
|
||||
{'text': 'Elizabeth Windsor', 'type': '英文人名'},
|
||||
{'text': '华梦龙光伏项目', 'type': '项目名'},
|
||||
{'text': '案号12345', 'type': '案号'},
|
||||
{'text': '310101198802080000', 'type': '身份证号'},
|
||||
{'text': '9133021276453538XT', 'type': '社会信用代码'},
|
||||
]
|
||||
linkage = {
|
||||
'entity_groups': [
|
||||
{
|
||||
'group_id': 'g1',
|
||||
'group_type': '公司名称',
|
||||
'entities': [
|
||||
{'text': 'A公司', 'type': '公司名称', 'is_primary': True},
|
||||
{'text': 'B公司', 'type': '公司名称', 'is_primary': False},
|
||||
]
|
||||
},
|
||||
{
|
||||
'group_id': 'g2',
|
||||
'group_type': '人名',
|
||||
'entities': [
|
||||
{'text': '李雷', 'type': '人名', 'is_primary': True},
|
||||
{'text': '李明', 'type': '人名', 'is_primary': False},
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
mapping = processor._generate_masked_mapping(unique_entities, linkage)
|
||||
# 人名
|
||||
assert mapping['李雷'].startswith('李某')
|
||||
assert mapping['李明'].startswith('李某')
|
||||
assert mapping['王强'].startswith('王某')
|
||||
# 英文公司名
|
||||
assert mapping['Acme Manufacturing Inc.'] == 'MANUFACTURING'
|
||||
assert mapping['Google LLC'] == 'COMPANY'
|
||||
# 公司名同组
|
||||
assert mapping['A公司'] == mapping['B公司']
|
||||
assert mapping['A公司'].endswith('公司')
|
||||
# 英文人名
|
||||
assert mapping['John Smith'] == 'J*** S***'
|
||||
assert mapping['Elizabeth Windsor'] == 'E*** W***'
|
||||
# 项目名
|
||||
assert mapping['华梦龙光伏项目'].endswith('项目')
|
||||
# 案号
|
||||
assert mapping['案号12345'] == '***'
|
||||
# 身份证号
|
||||
assert mapping['310101198802080000'] == 'XXXXXX'
|
||||
# 社会信用代码
|
||||
assert mapping['9133021276453538XT'] == 'XXXXXXXX'
|
||||
|
|
@ -0,0 +1,105 @@
|
|||
version: '3.8'
|
||||
|
||||
services:
|
||||
# Mineru API Service
|
||||
mineru-api:
|
||||
build:
|
||||
context: ./mineru
|
||||
dockerfile: Dockerfile
|
||||
platform: linux/arm64
|
||||
ports:
|
||||
- "8001:8000"
|
||||
volumes:
|
||||
- ./mineru/storage/uploads:/app/storage/uploads
|
||||
- ./mineru/storage/processed:/app/storage/processed
|
||||
environment:
|
||||
- PYTHONUNBUFFERED=1
|
||||
- MINERU_MODEL_SOURCE=local
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 60s
|
||||
networks:
|
||||
- app-network
|
||||
|
||||
# Backend API Service
|
||||
backend-api:
|
||||
build:
|
||||
context: ./backend
|
||||
dockerfile: Dockerfile
|
||||
ports:
|
||||
- "8000:8000"
|
||||
volumes:
|
||||
- ./backend/storage:/app/storage
|
||||
- ./backend/legal_doc_masker.db:/app/legal_doc_masker.db
|
||||
env_file:
|
||||
- ./backend/.env
|
||||
environment:
|
||||
- CELERY_BROKER_URL=redis://redis:6379/0
|
||||
- CELERY_RESULT_BACKEND=redis://redis:6379/0
|
||||
- MINERU_API_URL=http://mineru-api:8000
|
||||
depends_on:
|
||||
- redis
|
||||
- mineru-api
|
||||
networks:
|
||||
- app-network
|
||||
|
||||
# Celery Worker
|
||||
celery-worker:
|
||||
build:
|
||||
context: ./backend
|
||||
dockerfile: Dockerfile
|
||||
command: celery -A app.services.file_service worker --loglevel=info
|
||||
volumes:
|
||||
- ./backend/storage:/app/storage
|
||||
- ./backend/legal_doc_masker.db:/app/legal_doc_masker.db
|
||||
env_file:
|
||||
- ./backend/.env
|
||||
environment:
|
||||
- CELERY_BROKER_URL=redis://redis:6379/0
|
||||
- CELERY_RESULT_BACKEND=redis://redis:6379/0
|
||||
- MINERU_API_URL=http://mineru-api:8000
|
||||
depends_on:
|
||||
- redis
|
||||
- backend-api
|
||||
networks:
|
||||
- app-network
|
||||
|
||||
# Redis Service
|
||||
redis:
|
||||
image: redis:alpine
|
||||
ports:
|
||||
- "6379:6379"
|
||||
networks:
|
||||
- app-network
|
||||
|
||||
# Frontend Service
|
||||
frontend:
|
||||
build:
|
||||
context: ./frontend
|
||||
dockerfile: Dockerfile
|
||||
args:
|
||||
- REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
|
||||
ports:
|
||||
- "3000:80"
|
||||
env_file:
|
||||
- ./frontend/.env
|
||||
environment:
|
||||
- NODE_ENV=production
|
||||
- REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
|
||||
restart: unless-stopped
|
||||
depends_on:
|
||||
- backend-api
|
||||
networks:
|
||||
- app-network
|
||||
|
||||
networks:
|
||||
app-network:
|
||||
driver: bridge
|
||||
|
||||
volumes:
|
||||
uploads:
|
||||
processed:
|
||||
|
|
@ -0,0 +1,67 @@
|
|||
import json
|
||||
import shutil
|
||||
import os
|
||||
|
||||
import requests
|
||||
from modelscope import snapshot_download
|
||||
|
||||
|
||||
def download_json(url):
|
||||
# 下载JSON文件
|
||||
response = requests.get(url)
|
||||
response.raise_for_status() # 检查请求是否成功
|
||||
return response.json()
|
||||
|
||||
|
||||
def download_and_modify_json(url, local_filename, modifications):
|
||||
if os.path.exists(local_filename):
|
||||
data = json.load(open(local_filename))
|
||||
config_version = data.get('config_version', '0.0.0')
|
||||
if config_version < '1.2.0':
|
||||
data = download_json(url)
|
||||
else:
|
||||
data = download_json(url)
|
||||
|
||||
# 修改内容
|
||||
for key, value in modifications.items():
|
||||
data[key] = value
|
||||
|
||||
# 保存修改后的内容
|
||||
with open(local_filename, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, ensure_ascii=False, indent=4)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
mineru_patterns = [
|
||||
# "models/Layout/LayoutLMv3/*",
|
||||
"models/Layout/YOLO/*",
|
||||
"models/MFD/YOLO/*",
|
||||
"models/MFR/unimernet_hf_small_2503/*",
|
||||
"models/OCR/paddleocr_torch/*",
|
||||
# "models/TabRec/TableMaster/*",
|
||||
# "models/TabRec/StructEqTable/*",
|
||||
]
|
||||
model_dir = snapshot_download('opendatalab/PDF-Extract-Kit-1.0', allow_patterns=mineru_patterns)
|
||||
layoutreader_model_dir = snapshot_download('ppaanngggg/layoutreader')
|
||||
model_dir = model_dir + '/models'
|
||||
print(f'model_dir is: {model_dir}')
|
||||
print(f'layoutreader_model_dir is: {layoutreader_model_dir}')
|
||||
|
||||
# paddleocr_model_dir = model_dir + '/OCR/paddleocr'
|
||||
# user_paddleocr_dir = os.path.expanduser('~/.paddleocr')
|
||||
# if os.path.exists(user_paddleocr_dir):
|
||||
# shutil.rmtree(user_paddleocr_dir)
|
||||
# shutil.copytree(paddleocr_model_dir, user_paddleocr_dir)
|
||||
|
||||
json_url = 'https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/magic-pdf.template.json'
|
||||
config_file_name = 'magic-pdf.json'
|
||||
home_dir = os.path.expanduser('~')
|
||||
config_file = os.path.join(home_dir, config_file_name)
|
||||
|
||||
json_mods = {
|
||||
'models-dir': model_dir,
|
||||
'layoutreader-model-dir': layoutreader_model_dir,
|
||||
}
|
||||
|
||||
download_and_modify_json(json_url, config_file, json_mods)
|
||||
print(f'The configuration file has been configured successfully, the path is: {config_file}')
|
||||
|
|
@ -0,0 +1,168 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Docker Image Export Script
|
||||
# Exports all project Docker images for migration to another environment
|
||||
|
||||
set -e
|
||||
|
||||
echo "🚀 Legal Document Masker - Docker Image Export"
|
||||
echo "=============================================="
|
||||
|
||||
# Function to check if Docker is running
|
||||
check_docker() {
|
||||
if ! docker info > /dev/null 2>&1; then
|
||||
echo "❌ Docker is not running. Please start Docker and try again."
|
||||
exit 1
|
||||
fi
|
||||
echo "✅ Docker is running"
|
||||
}
|
||||
|
||||
# Function to check if images exist
|
||||
check_images() {
|
||||
echo "🔍 Checking for required images..."
|
||||
|
||||
local missing_images=()
|
||||
|
||||
if ! docker images | grep -q "legal-doc-masker-backend-api"; then
|
||||
missing_images+=("legal-doc-masker-backend-api")
|
||||
fi
|
||||
|
||||
if ! docker images | grep -q "legal-doc-masker-frontend"; then
|
||||
missing_images+=("legal-doc-masker-frontend")
|
||||
fi
|
||||
|
||||
if ! docker images | grep -q "legal-doc-masker-mineru-api"; then
|
||||
missing_images+=("legal-doc-masker-mineru-api")
|
||||
fi
|
||||
|
||||
if ! docker images | grep -q "redis:alpine"; then
|
||||
missing_images+=("redis:alpine")
|
||||
fi
|
||||
|
||||
if [ ${#missing_images[@]} -ne 0 ]; then
|
||||
echo "❌ Missing images: ${missing_images[*]}"
|
||||
echo "Please build the images first using: docker-compose build"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "✅ All required images found"
|
||||
}
|
||||
|
||||
# Function to create export directory
|
||||
create_export_dir() {
|
||||
local export_dir="docker-images-export-$(date +%Y%m%d-%H%M%S)"
|
||||
mkdir -p "$export_dir"
|
||||
cd "$export_dir"
|
||||
echo "📁 Created export directory: $export_dir"
|
||||
echo "$export_dir"
|
||||
}
|
||||
|
||||
# Function to export images
|
||||
export_images() {
|
||||
local export_dir="$1"
|
||||
|
||||
echo "📦 Exporting Docker images..."
|
||||
|
||||
# Export backend image
|
||||
echo " 📦 Exporting backend-api image..."
|
||||
docker save legal-doc-masker-backend-api:latest -o backend-api.tar
|
||||
|
||||
# Export frontend image
|
||||
echo " 📦 Exporting frontend image..."
|
||||
docker save legal-doc-masker-frontend:latest -o frontend.tar
|
||||
|
||||
# Export mineru image
|
||||
echo " 📦 Exporting mineru-api image..."
|
||||
docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
|
||||
|
||||
# Export redis image
|
||||
echo " 📦 Exporting redis image..."
|
||||
docker save redis:alpine -o redis.tar
|
||||
|
||||
echo "✅ All images exported successfully!"
|
||||
}
|
||||
|
||||
# Function to show export summary
|
||||
show_summary() {
|
||||
echo ""
|
||||
echo "📊 Export Summary:"
|
||||
echo "=================="
|
||||
ls -lh *.tar
|
||||
|
||||
echo ""
|
||||
echo "📋 Files to transfer:"
|
||||
echo "===================="
|
||||
for file in *.tar; do
|
||||
echo " - $file"
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "💾 Total size: $(du -sh . | cut -f1)"
|
||||
}
|
||||
|
||||
# Function to create compressed archive
|
||||
create_archive() {
|
||||
echo ""
|
||||
echo "🗜️ Creating compressed archive..."
|
||||
|
||||
local archive_name="legal-doc-masker-images-$(date +%Y%m%d-%H%M%S).tar.gz"
|
||||
tar -czf "$archive_name" *.tar
|
||||
|
||||
echo "✅ Created archive: $archive_name"
|
||||
echo "📊 Archive size: $(du -sh "$archive_name" | cut -f1)"
|
||||
|
||||
echo ""
|
||||
echo "📋 Transfer options:"
|
||||
echo "==================="
|
||||
echo "1. Transfer individual .tar files"
|
||||
echo "2. Transfer compressed archive: $archive_name"
|
||||
}
|
||||
|
||||
# Function to show transfer instructions
|
||||
show_transfer_instructions() {
|
||||
echo ""
|
||||
echo "📤 Transfer Instructions:"
|
||||
echo "========================"
|
||||
echo ""
|
||||
echo "Option 1: Transfer individual files"
|
||||
echo "-----------------------------------"
|
||||
echo "scp *.tar user@target-server:/path/to/destination/"
|
||||
echo ""
|
||||
echo "Option 2: Transfer compressed archive"
|
||||
echo "-------------------------------------"
|
||||
echo "scp legal-doc-masker-images-*.tar.gz user@target-server:/path/to/destination/"
|
||||
echo ""
|
||||
echo "Option 3: USB Drive"
|
||||
echo "-------------------"
|
||||
echo "cp *.tar /Volumes/USB_DRIVE/docker-images/"
|
||||
echo "cp legal-doc-masker-images-*.tar.gz /Volumes/USB_DRIVE/"
|
||||
echo ""
|
||||
echo "Option 4: Cloud Storage"
|
||||
echo "----------------------"
|
||||
echo "aws s3 cp *.tar s3://your-bucket/docker-images/"
|
||||
echo "aws s3 cp legal-doc-masker-images-*.tar.gz s3://your-bucket/docker-images/"
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
check_docker
|
||||
check_images
|
||||
|
||||
local export_dir=$(create_export_dir)
|
||||
export_images "$export_dir"
|
||||
show_summary
|
||||
create_archive
|
||||
show_transfer_instructions
|
||||
|
||||
echo ""
|
||||
echo "🎉 Export completed successfully!"
|
||||
echo "📁 Export location: $(pwd)"
|
||||
echo ""
|
||||
echo "Next steps:"
|
||||
echo "1. Transfer the files to your target environment"
|
||||
echo "2. Use import-images.sh on the target environment"
|
||||
echo "3. Copy docker-compose.yml and other config files"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
|
|
@ -0,0 +1,11 @@
|
|||
node_modules
|
||||
npm-debug.log
|
||||
build
|
||||
.git
|
||||
.gitignore
|
||||
README.md
|
||||
.env
|
||||
.env.local
|
||||
.env.development.local
|
||||
.env.test.local
|
||||
.env.production.local
|
||||
|
|
@ -0,0 +1,2 @@
|
|||
# REACT_APP_API_BASE_URL=http://192.168.2.203:8000/api/v1
|
||||
REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
|
||||
|
|
@ -0,0 +1,33 @@
|
|||
# Build stage
|
||||
FROM node:18-alpine as build
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Copy package files
|
||||
COPY package*.json ./
|
||||
|
||||
# Install dependencies
|
||||
RUN npm ci
|
||||
|
||||
# Copy source code
|
||||
COPY . .
|
||||
|
||||
# Build the app with environment variables
|
||||
ARG REACT_APP_API_BASE_URL
|
||||
ENV REACT_APP_API_BASE_URL=$REACT_APP_API_BASE_URL
|
||||
RUN npm run build
|
||||
|
||||
# Production stage
|
||||
FROM nginx:alpine
|
||||
|
||||
# Copy built assets from build stage
|
||||
COPY --from=build /app/build /usr/share/nginx/html
|
||||
|
||||
# Copy nginx configuration
|
||||
COPY nginx.conf /etc/nginx/conf.d/default.conf
|
||||
|
||||
# Expose port 80
|
||||
EXPOSE 80
|
||||
|
||||
# Start nginx
|
||||
CMD ["nginx", "-g", "daemon off;"]
|
||||
|
|
@ -0,0 +1,55 @@
|
|||
# Legal Document Masker Frontend
|
||||
|
||||
This is the frontend application for the Legal Document Masker service. It provides a user interface for uploading legal documents, monitoring their processing status, and downloading the masked versions.
|
||||
|
||||
## Features
|
||||
|
||||
- Drag and drop file upload
|
||||
- Real-time status updates
|
||||
- File list with processing status
|
||||
- Multi-file selection and download
|
||||
- Modern Material-UI interface
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Node.js (v14 or higher)
|
||||
- npm (v6 or higher)
|
||||
|
||||
## Installation
|
||||
|
||||
1. Install dependencies:
|
||||
```bash
|
||||
npm install
|
||||
```
|
||||
|
||||
2. Start the development server:
|
||||
```bash
|
||||
npm start
|
||||
```
|
||||
|
||||
The application will be available at http://localhost:3000
|
||||
|
||||
## Development
|
||||
|
||||
The frontend is built with:
|
||||
- React 18
|
||||
- TypeScript
|
||||
- Material-UI
|
||||
- React Query for data fetching
|
||||
- React Dropzone for file uploads
|
||||
|
||||
## Building for Production
|
||||
|
||||
To create a production build:
|
||||
|
||||
```bash
|
||||
npm run build
|
||||
```
|
||||
|
||||
The build artifacts will be stored in the `build/` directory.
|
||||
|
||||
## Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `REACT_APP_API_URL`: The URL of the backend API (default: http://localhost:8000/api/v1)
|
||||
|
|
@ -0,0 +1,24 @@
|
|||
version: '3.8'
|
||||
|
||||
services:
|
||||
frontend:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile
|
||||
args:
|
||||
- REACT_APP_API_BASE_URL=${REACT_APP_API_BASE_URL}
|
||||
ports:
|
||||
- "3000:80"
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- NODE_ENV=production
|
||||
- REACT_APP_API_BASE_URL=${REACT_APP_API_BASE_URL}
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- app-network
|
||||
|
||||
|
||||
networks:
|
||||
app-network:
|
||||
driver: bridge
|
||||
|
|
@ -0,0 +1,25 @@
|
|||
server {
|
||||
listen 80;
|
||||
server_name localhost;
|
||||
|
||||
location / {
|
||||
root /usr/share/nginx/html;
|
||||
index index.html;
|
||||
try_files $uri $uri/ /index.html;
|
||||
}
|
||||
|
||||
# Cache static assets
|
||||
location /static/ {
|
||||
root /usr/share/nginx/html;
|
||||
expires 1y;
|
||||
add_header Cache-Control "public, no-transform";
|
||||
}
|
||||
|
||||
# Enable gzip compression
|
||||
gzip on;
|
||||
gzip_vary on;
|
||||
gzip_min_length 10240;
|
||||
gzip_proxied expired no-cache no-store private auth;
|
||||
gzip_types text/plain text/css text/xml text/javascript application/x-javascript application/xml application/javascript;
|
||||
gzip_disable "MSIE [1-6]\.";
|
||||
}
|
||||
File diff suppressed because it is too large
Load Diff
|
|
@ -0,0 +1,50 @@
|
|||
{
|
||||
"name": "legal-doc-masker-frontend",
|
||||
"version": "0.1.0",
|
||||
"private": true,
|
||||
"dependencies": {
|
||||
"@emotion/react": "^11.11.3",
|
||||
"@emotion/styled": "^11.11.0",
|
||||
"@mui/icons-material": "^5.15.10",
|
||||
"@mui/material": "^5.15.10",
|
||||
"@testing-library/jest-dom": "^5.17.0",
|
||||
"@testing-library/react": "^13.4.0",
|
||||
"@testing-library/user-event": "^13.5.0",
|
||||
"@types/jest": "^27.5.2",
|
||||
"@types/node": "^16.18.80",
|
||||
"@types/react": "^18.2.55",
|
||||
"@types/react-dom": "^18.2.19",
|
||||
"axios": "^1.6.7",
|
||||
"react": "^18.2.0",
|
||||
"react-dom": "^18.2.0",
|
||||
"react-dropzone": "^14.2.3",
|
||||
"react-query": "^3.39.3",
|
||||
"react-scripts": "5.0.1",
|
||||
"typescript": "^4.9.5",
|
||||
"web-vitals": "^2.1.4"
|
||||
},
|
||||
"scripts": {
|
||||
"start": "react-scripts start",
|
||||
"build": "react-scripts build",
|
||||
"test": "react-scripts test",
|
||||
"eject": "react-scripts eject"
|
||||
},
|
||||
"eslintConfig": {
|
||||
"extends": [
|
||||
"react-app",
|
||||
"react-app/jest"
|
||||
]
|
||||
},
|
||||
"browserslist": {
|
||||
"production": [
|
||||
">0.2%",
|
||||
"not dead",
|
||||
"not op_mini all"
|
||||
],
|
||||
"development": [
|
||||
"last 1 chrome version",
|
||||
"last 1 firefox version",
|
||||
"last 1 safari version"
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,20 @@
|
|||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<link rel="icon" href="%PUBLIC_URL%/favicon.ico" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<meta name="theme-color" content="#000000" />
|
||||
<meta
|
||||
name="description"
|
||||
content="Legal Document Masker - Upload and process legal documents"
|
||||
/>
|
||||
<link rel="apple-touch-icon" href="%PUBLIC_URL%/logo192.png" />
|
||||
<link rel="manifest" href="%PUBLIC_URL%/manifest.json" />
|
||||
<title>Legal Document Masker</title>
|
||||
</head>
|
||||
<body>
|
||||
<noscript>You need to enable JavaScript to run this app.</noscript>
|
||||
<div id="root"></div>
|
||||
</body>
|
||||
</html>
|
||||
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"short_name": "Legal Doc Masker",
|
||||
"name": "Legal Document Masker",
|
||||
"icons": [
|
||||
{
|
||||
"src": "favicon.ico",
|
||||
"sizes": "64x64 32x32 24x24 16x16",
|
||||
"type": "image/x-icon"
|
||||
}
|
||||
],
|
||||
"start_url": ".",
|
||||
"display": "standalone",
|
||||
"theme_color": "#000000",
|
||||
"background_color": "#ffffff"
|
||||
}
|
||||
|
|
@ -0,0 +1,58 @@
|
|||
import React, { useEffect, useState } from 'react';
|
||||
import { Container, Typography, Box } from '@mui/material';
|
||||
import { useQuery, useQueryClient } from 'react-query';
|
||||
import FileUpload from './components/FileUpload';
|
||||
import FileList from './components/FileList';
|
||||
import { File } from './types/file';
|
||||
import { api } from './services/api';
|
||||
|
||||
function App() {
|
||||
const queryClient = useQueryClient();
|
||||
const [files, setFiles] = useState<File[]>([]);
|
||||
|
||||
const { data, isLoading, error } = useQuery<File[]>('files', api.listFiles, {
|
||||
refetchInterval: 5000, // Poll every 5 seconds
|
||||
});
|
||||
|
||||
useEffect(() => {
|
||||
if (data) {
|
||||
setFiles(data);
|
||||
}
|
||||
}, [data]);
|
||||
|
||||
const handleUploadComplete = () => {
|
||||
queryClient.invalidateQueries('files');
|
||||
};
|
||||
|
||||
if (isLoading) {
|
||||
return (
|
||||
<Container>
|
||||
<Typography>Loading...</Typography>
|
||||
</Container>
|
||||
);
|
||||
}
|
||||
|
||||
if (error) {
|
||||
return (
|
||||
<Container>
|
||||
<Typography color="error">Error loading files</Typography>
|
||||
</Container>
|
||||
);
|
||||
}
|
||||
|
||||
return (
|
||||
<Container maxWidth="lg">
|
||||
<Box sx={{ my: 4 }}>
|
||||
<Typography variant="h4" component="h1" gutterBottom>
|
||||
Legal Document Masker
|
||||
</Typography>
|
||||
<Box sx={{ mb: 4 }}>
|
||||
<FileUpload onUploadComplete={handleUploadComplete} />
|
||||
</Box>
|
||||
<FileList files={files} onFileStatusChange={handleUploadComplete} />
|
||||
</Box>
|
||||
</Container>
|
||||
);
|
||||
}
|
||||
|
||||
export default App;
|
||||
|
|
@ -0,0 +1,230 @@
|
|||
import React, { useState } from 'react';
|
||||
import {
|
||||
Table,
|
||||
TableBody,
|
||||
TableCell,
|
||||
TableContainer,
|
||||
TableHead,
|
||||
TableRow,
|
||||
Paper,
|
||||
IconButton,
|
||||
Checkbox,
|
||||
Button,
|
||||
Chip,
|
||||
Dialog,
|
||||
DialogTitle,
|
||||
DialogContent,
|
||||
DialogActions,
|
||||
Typography,
|
||||
} from '@mui/material';
|
||||
import { Download as DownloadIcon, Delete as DeleteIcon } from '@mui/icons-material';
|
||||
import { File, FileStatus } from '../types/file';
|
||||
import { api } from '../services/api';
|
||||
|
||||
interface FileListProps {
|
||||
files: File[];
|
||||
onFileStatusChange: () => void;
|
||||
}
|
||||
|
||||
const FileList: React.FC<FileListProps> = ({ files, onFileStatusChange }) => {
|
||||
const [selectedFiles, setSelectedFiles] = useState<string[]>([]);
|
||||
const [deleteDialogOpen, setDeleteDialogOpen] = useState(false);
|
||||
const [fileToDelete, setFileToDelete] = useState<string | null>(null);
|
||||
|
||||
const handleSelectFile = (fileId: string) => {
|
||||
setSelectedFiles((prev) =>
|
||||
prev.includes(fileId)
|
||||
? prev.filter((id) => id !== fileId)
|
||||
: [...prev, fileId]
|
||||
);
|
||||
};
|
||||
|
||||
const handleSelectAll = () => {
|
||||
setSelectedFiles((prev) =>
|
||||
prev.length === files.length ? [] : files.map((file) => file.id)
|
||||
);
|
||||
};
|
||||
|
||||
const handleDownload = async (fileId: string) => {
|
||||
try {
|
||||
console.log('=== FRONTEND DOWNLOAD START ===');
|
||||
console.log('File ID:', fileId);
|
||||
|
||||
const file = files.find((f) => f.id === fileId);
|
||||
console.log('File object:', file);
|
||||
|
||||
const blob = await api.downloadFile(fileId);
|
||||
console.log('Blob received:', blob);
|
||||
console.log('Blob type:', blob.type);
|
||||
console.log('Blob size:', blob.size);
|
||||
|
||||
const url = window.URL.createObjectURL(blob);
|
||||
const a = document.createElement('a');
|
||||
a.href = url;
|
||||
|
||||
// Match backend behavior: change extension to .md
|
||||
const originalFilename = file?.filename || 'downloaded-file';
|
||||
const filenameWithoutExt = originalFilename.replace(/\.[^/.]+$/, ''); // Remove extension
|
||||
const downloadFilename = `${filenameWithoutExt}.md`;
|
||||
|
||||
console.log('Original filename:', originalFilename);
|
||||
console.log('Filename without extension:', filenameWithoutExt);
|
||||
console.log('Download filename:', downloadFilename);
|
||||
|
||||
a.download = downloadFilename;
|
||||
document.body.appendChild(a);
|
||||
a.click();
|
||||
window.URL.revokeObjectURL(url);
|
||||
document.body.removeChild(a);
|
||||
|
||||
console.log('=== FRONTEND DOWNLOAD END ===');
|
||||
} catch (error) {
|
||||
console.error('Error downloading file:', error);
|
||||
}
|
||||
};
|
||||
|
||||
const handleDownloadSelected = async () => {
|
||||
for (const fileId of selectedFiles) {
|
||||
await handleDownload(fileId);
|
||||
}
|
||||
};
|
||||
|
||||
const handleDeleteClick = (fileId: string) => {
|
||||
setFileToDelete(fileId);
|
||||
setDeleteDialogOpen(true);
|
||||
};
|
||||
|
||||
const handleDeleteConfirm = async () => {
|
||||
if (fileToDelete) {
|
||||
try {
|
||||
await api.deleteFile(fileToDelete);
|
||||
onFileStatusChange();
|
||||
} catch (error) {
|
||||
console.error('Error deleting file:', error);
|
||||
}
|
||||
}
|
||||
setDeleteDialogOpen(false);
|
||||
setFileToDelete(null);
|
||||
};
|
||||
|
||||
const handleDeleteCancel = () => {
|
||||
setDeleteDialogOpen(false);
|
||||
setFileToDelete(null);
|
||||
};
|
||||
|
||||
const getStatusColor = (status: FileStatus) => {
|
||||
switch (status) {
|
||||
case FileStatus.SUCCESS:
|
||||
return 'success';
|
||||
case FileStatus.FAILED:
|
||||
return 'error';
|
||||
case FileStatus.PROCESSING:
|
||||
return 'warning';
|
||||
default:
|
||||
return 'default';
|
||||
}
|
||||
};
|
||||
|
||||
return (
|
||||
<div>
|
||||
<div style={{ marginBottom: '1rem' }}>
|
||||
<Button
|
||||
variant="contained"
|
||||
color="primary"
|
||||
onClick={handleDownloadSelected}
|
||||
disabled={selectedFiles.length === 0}
|
||||
sx={{ mr: 1 }}
|
||||
>
|
||||
Download Selected
|
||||
</Button>
|
||||
</div>
|
||||
<TableContainer component={Paper}>
|
||||
<Table>
|
||||
<TableHead>
|
||||
<TableRow>
|
||||
<TableCell padding="checkbox">
|
||||
<Checkbox
|
||||
checked={selectedFiles.length === files.length}
|
||||
indeterminate={selectedFiles.length > 0 && selectedFiles.length < files.length}
|
||||
onChange={handleSelectAll}
|
||||
/>
|
||||
</TableCell>
|
||||
<TableCell>Filename</TableCell>
|
||||
<TableCell>Status</TableCell>
|
||||
<TableCell>Created At</TableCell>
|
||||
<TableCell>Finished At</TableCell>
|
||||
<TableCell>Actions</TableCell>
|
||||
</TableRow>
|
||||
</TableHead>
|
||||
<TableBody>
|
||||
{files.map((file) => (
|
||||
<TableRow key={file.id}>
|
||||
<TableCell padding="checkbox">
|
||||
<Checkbox
|
||||
checked={selectedFiles.includes(file.id)}
|
||||
onChange={() => handleSelectFile(file.id)}
|
||||
/>
|
||||
</TableCell>
|
||||
<TableCell>{file.filename}</TableCell>
|
||||
<TableCell>
|
||||
<Chip
|
||||
label={file.status}
|
||||
color={getStatusColor(file.status) as any}
|
||||
size="small"
|
||||
/>
|
||||
</TableCell>
|
||||
<TableCell>
|
||||
{new Date(file.created_at).toLocaleString()}
|
||||
</TableCell>
|
||||
<TableCell>
|
||||
{(file.status === FileStatus.SUCCESS || file.status === FileStatus.FAILED)
|
||||
? new Date(file.updated_at).toLocaleString()
|
||||
: '—'}
|
||||
</TableCell>
|
||||
<TableCell>
|
||||
<IconButton
|
||||
onClick={() => handleDeleteClick(file.id)}
|
||||
size="small"
|
||||
color="error"
|
||||
sx={{ mr: 1 }}
|
||||
>
|
||||
<DeleteIcon />
|
||||
</IconButton>
|
||||
{file.status === FileStatus.SUCCESS && (
|
||||
<IconButton
|
||||
onClick={() => handleDownload(file.id)}
|
||||
size="small"
|
||||
color="primary"
|
||||
>
|
||||
<DownloadIcon />
|
||||
</IconButton>
|
||||
)}
|
||||
</TableCell>
|
||||
</TableRow>
|
||||
))}
|
||||
</TableBody>
|
||||
</Table>
|
||||
</TableContainer>
|
||||
|
||||
<Dialog
|
||||
open={deleteDialogOpen}
|
||||
onClose={handleDeleteCancel}
|
||||
>
|
||||
<DialogTitle>Confirm Delete</DialogTitle>
|
||||
<DialogContent>
|
||||
<Typography>
|
||||
Are you sure you want to delete this file? This action cannot be undone.
|
||||
</Typography>
|
||||
</DialogContent>
|
||||
<DialogActions>
|
||||
<Button onClick={handleDeleteCancel}>Cancel</Button>
|
||||
<Button onClick={handleDeleteConfirm} color="error" variant="contained">
|
||||
Delete
|
||||
</Button>
|
||||
</DialogActions>
|
||||
</Dialog>
|
||||
</div>
|
||||
);
|
||||
};
|
||||
|
||||
export default FileList;
|
||||
|
|
@ -0,0 +1,66 @@
|
|||
import React, { useCallback } from 'react';
|
||||
import { useDropzone } from 'react-dropzone';
|
||||
import { Box, Typography, CircularProgress } from '@mui/material';
|
||||
import { api } from '../services/api';
|
||||
|
||||
interface FileUploadProps {
|
||||
onUploadComplete: () => void;
|
||||
}
|
||||
|
||||
const FileUpload: React.FC<FileUploadProps> = ({ onUploadComplete }) => {
|
||||
const [isUploading, setIsUploading] = React.useState(false);
|
||||
|
||||
const onDrop = useCallback(async (acceptedFiles: File[]) => {
|
||||
setIsUploading(true);
|
||||
try {
|
||||
for (const file of acceptedFiles) {
|
||||
await api.uploadFile(file);
|
||||
}
|
||||
onUploadComplete();
|
||||
} catch (error) {
|
||||
console.error('Error uploading files:', error);
|
||||
} finally {
|
||||
setIsUploading(false);
|
||||
}
|
||||
}, [onUploadComplete]);
|
||||
|
||||
const { getRootProps, getInputProps, isDragActive } = useDropzone({
|
||||
onDrop,
|
||||
accept: {
|
||||
'application/pdf': ['.pdf'],
|
||||
'application/msword': ['.doc'],
|
||||
'application/vnd.openxmlformats-officedocument.wordprocessingml.document': ['.docx'],
|
||||
'text/markdown': ['.md'],
|
||||
},
|
||||
});
|
||||
|
||||
return (
|
||||
<Box
|
||||
{...getRootProps()}
|
||||
sx={{
|
||||
border: '2px dashed #ccc',
|
||||
borderRadius: 2,
|
||||
p: 3,
|
||||
textAlign: 'center',
|
||||
cursor: 'pointer',
|
||||
bgcolor: isDragActive ? 'action.hover' : 'background.paper',
|
||||
'&:hover': {
|
||||
bgcolor: 'action.hover',
|
||||
},
|
||||
}}
|
||||
>
|
||||
<input {...getInputProps()} />
|
||||
{isUploading ? (
|
||||
<CircularProgress />
|
||||
) : (
|
||||
<Typography>
|
||||
{isDragActive
|
||||
? 'Drop the files here...'
|
||||
: 'Drag and drop files here, or click to select files'}
|
||||
</Typography>
|
||||
)}
|
||||
</Box>
|
||||
);
|
||||
};
|
||||
|
||||
export default FileUpload;
|
||||
|
|
@ -0,0 +1,8 @@
|
|||
/// <reference types="react-scripts" />
|
||||
|
||||
declare namespace NodeJS {
|
||||
interface ProcessEnv {
|
||||
readonly REACT_APP_API_BASE_URL: string;
|
||||
// Add other environment variables here
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,29 @@
|
|||
import React from 'react';
|
||||
import ReactDOM from 'react-dom/client';
|
||||
import { QueryClient, QueryClientProvider } from 'react-query';
|
||||
import { ThemeProvider, createTheme } from '@mui/material';
|
||||
import CssBaseline from '@mui/material/CssBaseline';
|
||||
import App from './App';
|
||||
|
||||
const queryClient = new QueryClient();
|
||||
|
||||
const theme = createTheme({
|
||||
palette: {
|
||||
mode: 'light',
|
||||
},
|
||||
});
|
||||
|
||||
const root = ReactDOM.createRoot(
|
||||
document.getElementById('root') as HTMLElement
|
||||
);
|
||||
|
||||
root.render(
|
||||
<React.StrictMode>
|
||||
<QueryClientProvider client={queryClient}>
|
||||
<ThemeProvider theme={theme}>
|
||||
<CssBaseline />
|
||||
<App />
|
||||
</ThemeProvider>
|
||||
</QueryClientProvider>
|
||||
</React.StrictMode>
|
||||
);
|
||||
|
|
@ -0,0 +1,44 @@
|
|||
import axios from 'axios';
|
||||
import { File, FileUploadResponse } from '../types/file';
|
||||
|
||||
const API_BASE_URL = process.env.REACT_APP_API_BASE_URL || 'http://localhost:8000/api/v1';
|
||||
|
||||
// Create axios instance with default config
|
||||
const axiosInstance = axios.create({
|
||||
baseURL: API_BASE_URL,
|
||||
timeout: 30000, // 30 seconds timeout
|
||||
});
|
||||
|
||||
export const api = {
|
||||
uploadFile: async (file: globalThis.File): Promise<FileUploadResponse> => {
|
||||
const formData = new FormData();
|
||||
formData.append('file', file);
|
||||
const response = await axiosInstance.post('/files/upload', formData, {
|
||||
headers: {
|
||||
'Content-Type': 'multipart/form-data',
|
||||
},
|
||||
});
|
||||
return response.data;
|
||||
},
|
||||
|
||||
listFiles: async (): Promise<File[]> => {
|
||||
const response = await axiosInstance.get('/files/files');
|
||||
return response.data;
|
||||
},
|
||||
|
||||
getFile: async (fileId: string): Promise<File> => {
|
||||
const response = await axiosInstance.get(`/files/files/${fileId}`);
|
||||
return response.data;
|
||||
},
|
||||
|
||||
downloadFile: async (fileId: string): Promise<Blob> => {
|
||||
const response = await axiosInstance.get(`/files/files/${fileId}/download`, {
|
||||
responseType: 'blob',
|
||||
});
|
||||
return response.data;
|
||||
},
|
||||
|
||||
deleteFile: async (fileId: string): Promise<void> => {
|
||||
await axiosInstance.delete(`/files/files/${fileId}`);
|
||||
},
|
||||
};
|
||||
|
|
@ -0,0 +1,23 @@
|
|||
export enum FileStatus {
|
||||
NOT_STARTED = "not_started",
|
||||
PROCESSING = "processing",
|
||||
SUCCESS = "success",
|
||||
FAILED = "failed"
|
||||
}
|
||||
|
||||
export interface File {
|
||||
id: string;
|
||||
filename: string;
|
||||
status: FileStatus;
|
||||
error_message?: string;
|
||||
created_at: string;
|
||||
updated_at: string;
|
||||
}
|
||||
|
||||
export interface FileUploadResponse {
|
||||
id: string;
|
||||
filename: string;
|
||||
status: FileStatus;
|
||||
created_at: string;
|
||||
updated_at: string;
|
||||
}
|
||||
|
|
@ -0,0 +1,26 @@
|
|||
{
|
||||
"compilerOptions": {
|
||||
"target": "es5",
|
||||
"lib": [
|
||||
"dom",
|
||||
"dom.iterable",
|
||||
"esnext"
|
||||
],
|
||||
"allowJs": true,
|
||||
"skipLibCheck": true,
|
||||
"esModuleInterop": true,
|
||||
"allowSyntheticDefaultImports": true,
|
||||
"strict": true,
|
||||
"forceConsistentCasingInFileNames": true,
|
||||
"noFallthroughCasesInSwitch": true,
|
||||
"module": "esnext",
|
||||
"moduleResolution": "node",
|
||||
"resolveJsonModule": true,
|
||||
"isolatedModules": true,
|
||||
"noEmit": true,
|
||||
"jsx": "react-jsx"
|
||||
},
|
||||
"include": [
|
||||
"src"
|
||||
]
|
||||
}
|
||||
|
|
@ -0,0 +1,232 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Docker Image Import Script
|
||||
# Imports Docker images on target environment for migration
|
||||
|
||||
set -e
|
||||
|
||||
echo "🚀 Legal Document Masker - Docker Image Import"
|
||||
echo "=============================================="
|
||||
|
||||
# Function to check if Docker is running
|
||||
check_docker() {
|
||||
if ! docker info > /dev/null 2>&1; then
|
||||
echo "❌ Docker is not running. Please start Docker and try again."
|
||||
exit 1
|
||||
fi
|
||||
echo "✅ Docker is running"
|
||||
}
|
||||
|
||||
# Function to check for tar files
|
||||
check_tar_files() {
|
||||
echo "🔍 Checking for Docker image files..."
|
||||
|
||||
local missing_files=()
|
||||
|
||||
if [ ! -f "backend-api.tar" ]; then
|
||||
missing_files+=("backend-api.tar")
|
||||
fi
|
||||
|
||||
if [ ! -f "frontend.tar" ]; then
|
||||
missing_files+=("frontend.tar")
|
||||
fi
|
||||
|
||||
if [ ! -f "mineru-api.tar" ]; then
|
||||
missing_files+=("mineru-api.tar")
|
||||
fi
|
||||
|
||||
if [ ! -f "redis.tar" ]; then
|
||||
missing_files+=("redis.tar")
|
||||
fi
|
||||
|
||||
if [ ${#missing_files[@]} -ne 0 ]; then
|
||||
echo "❌ Missing files: ${missing_files[*]}"
|
||||
echo ""
|
||||
echo "Please ensure all .tar files are in the current directory."
|
||||
echo "If you have a compressed archive, extract it first:"
|
||||
echo " tar -xzf legal-doc-masker-images-*.tar.gz"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "✅ All required files found"
|
||||
}
|
||||
|
||||
# Function to check available disk space
|
||||
check_disk_space() {
|
||||
echo "💾 Checking available disk space..."
|
||||
|
||||
local required_space=0
|
||||
for file in *.tar; do
|
||||
local file_size=$(stat -f%z "$file" 2>/dev/null || stat -c%s "$file" 2>/dev/null || echo 0)
|
||||
required_space=$((required_space + file_size))
|
||||
done
|
||||
|
||||
local available_space=$(df . | awk 'NR==2 {print $4}')
|
||||
available_space=$((available_space * 1024)) # Convert to bytes
|
||||
|
||||
if [ $required_space -gt $available_space ]; then
|
||||
echo "❌ Insufficient disk space"
|
||||
echo "Required: $(numfmt --to=iec $required_space)"
|
||||
echo "Available: $(numfmt --to=iec $available_space)"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "✅ Sufficient disk space available"
|
||||
}
|
||||
|
||||
# Function to import images
|
||||
import_images() {
|
||||
echo "📦 Importing Docker images..."
|
||||
|
||||
# Import backend image
|
||||
echo " 📦 Importing backend-api image..."
|
||||
docker load -i backend-api.tar
|
||||
|
||||
# Import frontend image
|
||||
echo " 📦 Importing frontend image..."
|
||||
docker load -i frontend.tar
|
||||
|
||||
# Import mineru image
|
||||
echo " 📦 Importing mineru-api image..."
|
||||
docker load -i mineru-api.tar
|
||||
|
||||
# Import redis image
|
||||
echo " 📦 Importing redis image..."
|
||||
docker load -i redis.tar
|
||||
|
||||
echo "✅ All images imported successfully!"
|
||||
}
|
||||
|
||||
# Function to verify imported images
|
||||
verify_images() {
|
||||
echo "🔍 Verifying imported images..."
|
||||
|
||||
local missing_images=()
|
||||
|
||||
if ! docker images | grep -q "legal-doc-masker-backend-api"; then
|
||||
missing_images+=("legal-doc-masker-backend-api")
|
||||
fi
|
||||
|
||||
if ! docker images | grep -q "legal-doc-masker-frontend"; then
|
||||
missing_images+=("legal-doc-masker-frontend")
|
||||
fi
|
||||
|
||||
if ! docker images | grep -q "legal-doc-masker-mineru-api"; then
|
||||
missing_images+=("legal-doc-masker-mineru-api")
|
||||
fi
|
||||
|
||||
if ! docker images | grep -q "redis:alpine"; then
|
||||
missing_images+=("redis:alpine")
|
||||
fi
|
||||
|
||||
if [ ${#missing_images[@]} -ne 0 ]; then
|
||||
echo "❌ Missing imported images: ${missing_images[*]}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "✅ All images verified successfully!"
|
||||
}
|
||||
|
||||
# Function to show imported images
|
||||
show_imported_images() {
|
||||
echo ""
|
||||
echo "📊 Imported Images:"
|
||||
echo "==================="
|
||||
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep legal-doc-masker
|
||||
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep redis
|
||||
}
|
||||
|
||||
# Function to create necessary directories
|
||||
create_directories() {
|
||||
echo ""
|
||||
echo "📁 Creating necessary directories..."
|
||||
|
||||
mkdir -p backend/storage
|
||||
mkdir -p mineru/storage/uploads
|
||||
mkdir -p mineru/storage/processed
|
||||
|
||||
echo "✅ Directories created"
|
||||
}
|
||||
|
||||
# Function to check for required files
|
||||
check_required_files() {
|
||||
echo ""
|
||||
echo "🔍 Checking for required configuration files..."
|
||||
|
||||
local missing_files=()
|
||||
|
||||
if [ ! -f "docker-compose.yml" ]; then
|
||||
missing_files+=("docker-compose.yml")
|
||||
fi
|
||||
|
||||
if [ ! -f "DOCKER_COMPOSE_README.md" ]; then
|
||||
missing_files+=("DOCKER_COMPOSE_README.md")
|
||||
fi
|
||||
|
||||
if [ ${#missing_files[@]} -ne 0 ]; then
|
||||
echo "⚠️ Missing files: ${missing_files[*]}"
|
||||
echo "Please copy these files from the source environment:"
|
||||
echo " - docker-compose.yml"
|
||||
echo " - DOCKER_COMPOSE_README.md"
|
||||
echo " - backend/.env (if exists)"
|
||||
echo " - frontend/.env (if exists)"
|
||||
echo " - mineru/.env (if exists)"
|
||||
else
|
||||
echo "✅ All required configuration files found"
|
||||
fi
|
||||
}
|
||||
|
||||
# Function to show next steps
|
||||
show_next_steps() {
|
||||
echo ""
|
||||
echo "🎉 Import completed successfully!"
|
||||
echo ""
|
||||
echo "📋 Next Steps:"
|
||||
echo "=============="
|
||||
echo ""
|
||||
echo "1. Copy configuration files (if not already present):"
|
||||
echo " - docker-compose.yml"
|
||||
echo " - backend/.env"
|
||||
echo " - frontend/.env"
|
||||
echo " - mineru/.env"
|
||||
echo ""
|
||||
echo "2. Start the services:"
|
||||
echo " docker-compose up -d"
|
||||
echo ""
|
||||
echo "3. Verify services are running:"
|
||||
echo " docker-compose ps"
|
||||
echo ""
|
||||
echo "4. Test the endpoints:"
|
||||
echo " - Frontend: http://localhost:3000"
|
||||
echo " - Backend API: http://localhost:8000"
|
||||
echo " - Mineru API: http://localhost:8001"
|
||||
echo ""
|
||||
echo "5. View logs if needed:"
|
||||
echo " docker-compose logs -f [service-name]"
|
||||
}
|
||||
|
||||
# Function to handle compressed archive
|
||||
handle_compressed_archive() {
|
||||
if ls legal-doc-masker-images-*.tar.gz 1> /dev/null 2>&1; then
|
||||
echo "🗜️ Found compressed archive, extracting..."
|
||||
tar -xzf legal-doc-masker-images-*.tar.gz
|
||||
echo "✅ Archive extracted"
|
||||
fi
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
check_docker
|
||||
handle_compressed_archive
|
||||
check_tar_files
|
||||
check_disk_space
|
||||
import_images
|
||||
verify_images
|
||||
show_imported_images
|
||||
create_directories
|
||||
check_required_files
|
||||
show_next_steps
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
|
|
@ -0,0 +1,46 @@
|
|||
FROM python:3.12-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install system dependencies
|
||||
RUN apt-get update && apt-get install -y \
|
||||
build-essential \
|
||||
libreoffice \
|
||||
wget \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
RUN pip install --upgrade pip
|
||||
RUN pip install uv
|
||||
|
||||
# Configure uv and install mineru
|
||||
ENV UV_SYSTEM_PYTHON=1
|
||||
RUN uv pip install --system -U "mineru[core]"
|
||||
|
||||
|
||||
|
||||
# Copy requirements first to leverage Docker cache
|
||||
# COPY requirements.txt .
|
||||
# RUN pip install huggingface_hub
|
||||
# RUN wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
|
||||
# RUN wget https://raw.githubusercontent.com/opendatalab/MinerU/refs/heads/release-1.3.1/scripts/download_models_hf.py -O download_models_hf.py
|
||||
|
||||
# RUN python download_models_hf.py
|
||||
RUN mineru-models-download -s modelscope -m pipeline
|
||||
|
||||
|
||||
|
||||
|
||||
# RUN pip install --no-cache-dir -r requirements.txt
|
||||
# RUN pip install -U magic-pdf[full]
|
||||
|
||||
|
||||
# Copy the rest of the application
|
||||
# COPY . .
|
||||
|
||||
# Create storage directories
|
||||
# RUN mkdir -p storage/uploads storage/processed
|
||||
|
||||
# Expose the port the app runs on
|
||||
EXPOSE 8000
|
||||
|
||||
# Command to run the application
|
||||
CMD ["mineru-api", "--host", "0.0.0.0", "--port", "8000"]
|
||||
|
|
@ -0,0 +1,27 @@
|
|||
version: '3.8'
|
||||
|
||||
services:
|
||||
mineru-api:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile
|
||||
platform: linux/arm64
|
||||
ports:
|
||||
- "8001:8000"
|
||||
volumes:
|
||||
- ./storage/uploads:/app/storage/uploads
|
||||
- ./storage/processed:/app/storage/processed
|
||||
environment:
|
||||
- PYTHONUNBUFFERED=1
|
||||
- MINERU_MODEL_SOURCE=local
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 60s
|
||||
|
||||
volumes:
|
||||
uploads:
|
||||
processed:
|
||||
|
|
@ -1,10 +0,0 @@
|
|||
# Base dependencies
|
||||
pydantic-settings>=2.0.0
|
||||
python-dotenv==1.0.0
|
||||
watchdog==2.1.6
|
||||
requests==2.26.0
|
||||
|
||||
# Document processing
|
||||
python-docx>=0.8.11
|
||||
PyPDF2>=3.0.0
|
||||
pandas>=2.0.0
|
||||
Binary file not shown.
|
|
@ -0,0 +1,101 @@
|
|||
# 北京市第三中级人民法院民事判决书
|
||||
|
||||
(2022)京 03 民终 3852 号
|
||||
|
||||
上诉人(原审原告):北京丰复久信营销科技有限公司,住所地北京市海淀区北小马厂6 号1 号楼华天大厦1306 室。
|
||||
|
||||
法定代表人:郭东军,执行董事、经理。委托诉讼代理人:周大海,北京市康达律师事务所律师。委托诉讼代理人:王乃哲,北京市康达律师事务所律师。
|
||||
|
||||
被上诉人(原审被告):中研智创区块链技术有限公司,住所地天津市津南区双港镇工业园区优谷产业园5 号楼-1505。
|
||||
|
||||
法定代表人:王欢子,总经理。
|
||||
|
||||
委托诉讼代理人:魏鑫,北京市昊衡律师事务所律师。
|
||||
|
||||
1.上诉人北京丰复久信营销科技有限公司(以下简称丰复久信公司)因与被上诉人中研智创区块链技术有限公司(以下简称中研智创公司)服务合同纠纷一案,不服北京市朝阳区人民法院(2020)京0105 民初69754 号民事判决,向本院提起上诉。本院立案后,依法组成合议庭开庭进行了审理。上诉人丰复久信公司之委托诉讼代理人周大海、王乃哲,被上诉人中研智创公司之委托诉讼代理人魏鑫到庭参加诉讼。本案现已审理终结。
|
||||
|
||||
2.丰复久信公司上诉请求:1.撤销一审判决,发回重审或依法改判支持丰复久信公司一审全部诉讼请求;2.或在维持原判的同时判令中研智创公司向丰复久信公司返还 1000 万元款项,并赔偿丰复久信公司因此支付的律师费 220 万元;3.判令中研智创公司承担本案一审、二审全部诉讼费用。事实与理由:一、根据2019 年的政策导向,丰复久信公司的投资行为并无任何法律或政策瑕疵。丰复久信公司仅投资挖矿,没有购买比特币,故在当时国家、政府层面有相关政策支持甚至鼓励的前提下,一审法院仅凭“挖矿”行为就得出丰复久信公司扰乱金融秩序的结论,是错误的。二、一审法院没有全面、深入审查相关事实,且遗漏了最核心的数据调查工作。三、本案一审判决适用法律错误。涉案合同成立及履行期间并无合同无效的情形,当属有效。一审法院以挖矿活动耗能巨大、不利于我国产业结构调整为依据之一,作出合同无效的判决,实属牵强。最高人民法院发布的全国法院系统2020 年度优秀案例分析评选活动获奖名单中,由上海市第一中级人民法院刘江法官编写的“李圣艳、布兰登·斯密特诉闫向东、李敏等财产损害赔偿纠纷案— —比特币的法律属性及其司法救济”一案入选,该案同样发生在丰复久信公司与中研智创公司合同履行过程中,一审法院认定同时期同类型的涉案合同无效,与上述最高人民法院的优秀案例相悖。四、一审法院径行认定合同无效,未向丰复久信公司进行释明构成程序违法。
|
||||
|
||||
3.中研智创公司辩称,同意一审判决,不同意丰复久信公司的上诉请求。首先,一审法院曾在庭审中询问丰复久信公司关于机器返还的问题,一审法院进行了释明。其次,如二审法院对其该项上诉请求进行判决,会剥夺中研智创公司针对该部分请求再行上诉的权利。
|
||||
|
||||
4.丰复久信公司向一审法院起诉请求:1.中研智创公司交付278.1654976 个比特币,或者按照 2021 年 1 月 25 日比特币的价格交付9550812.36 美元;2.中研智创公司赔偿丰复久信公司服务期到期后占用微型存储空间服务器的损失(自2020 年7 月1日起至实际返还服务器时止,按照bitinfocharts 网站公布的相关日产比特币数据,计算应赔偿比特币数量或按照2021 年1 月25 日比特币的价格交付美元)。
|
||||
|
||||
5.一审法院查明事实:2019 年5 月6 日,丰复久信公司作为甲方(买方)与乙方(卖方)中研智创公司签订《计算机设备采购合
|
||||
|
||||
同》,约定:货物名称为计算机设备,型号规格及数量为T2T-30T 规格型号的微型存储空间服务器1542 台,单价5040/ 台合同金额为 7 771 680 元;交货期 2019 年 8 月 31 日前;交货方式为乙方自行送货到甲方所在地,并提供安装服务,运输工具及运费由乙方负责;交货地点北京;签订购货合同,设备安装完毕后一次性支付项目总货款;乙方提供货物的质量保证期为自交货验收结束之日起不少于十二个月(具体按清单要求);乙方交货前应对产品作出全面检查和对验收文件进行整理,并列出清单,作为甲方收货验收和使用的技术条件依据,检验的结果应随货物交甲方,甲方对乙方提供的货物在使用前进行调试时,乙方协助甲方一起调试,直到符合技术要求,甲方才做最终验收,验收时乙方必须在现场,验收完毕后作出验收结果报告,并经双方签字生效。
|
||||
|
||||
6.同日,丰复久信公司作为甲方(客户方)与乙方中研智创公司(服务方)签订《服务合同书》,约定:乙方同意就采购合同中的微型存储空间服务器向甲方提供特定服务;服务的内容包括质保、维修、服务器设备代为运行管理、代为缴纳服务器相关用度花费如电费等,详细内容见附件一;如果乙方在工作中因自身过错而发生任何错误或遗漏,应无条件更正,不另外收费,并对因此而对甲方造成的损失承担赔偿责任,赔偿额以本合同约定的服务费为限;若因甲方原因造成工作延误,将由甲方承担相应的损失;服务费总金额为2 228 320 元,甲乙双方一致同意项目服务费以人民币形式,于本合同签订后3 日内一次性支付;甲方可以提前10 个工作日以书面形式要求变更或增加所提供的服务,该等变更最终应由双方商定认可,其中包括与该等变更有关的任何费用调整等。合同后附附件一以表格形式列明,1.1542 台T2T-30T 微型存储空间服务器的质保、维修,时限12 个月,完成标准为完成甲方指定的运行量;2.服务器的日常运行管理,时限12 个月;3.代扣代缴电费;4.其他(空白)。
|
||||
|
||||
7.2019 年5 月,双方签订《增值服务协议》,约定:甲方将自有的T2T-30 规格的微型存储空间服务器1542 台委托乙方管理,由甲方向乙方支付一定管理费用;由乙方向甲方提供相关数据增值服务,对于增值服务产生的收益,扣除运行成本后,甲乙双方按照一定比例进行分配(备注:增值服务收益与微型存储空间服务的单位TH/s 相关,分配收益方式不限于人民币支付);甲方最多可将托管的云数据服务器的单位(TH/s)的 $50 \%$ 进行拆分,委托乙方代为出售,用户购买后的单位(TH/s)所产生的收益归购买用户所有,结算价格按照当天实际的市场价格进行结算,扣除市场销售成本后,实时转入甲方提供的收益地址;相关费用及支付,数据增值服务的电费成本,由甲方自行承担,按日计算,具体价格根据实际上架的数据中心的价格进行计算,由后续的《数据增值服务电费计价协议作为补充》;云数据服务器上架后2 天内,甲方应当向乙方预付498 196 元,用于预付部分云数据服务器的电费,后续每日的电费支出按当天24 时云数据服务器的增值部分的价值扣除,扣除完成后的增值服务收益部分当日划入甲方提供的收益地址;单台云数据服务器的放置建设成本为300 元,由甲方承担;数据增值服务产生的收益,按照 $7 \%$ 的比例分配给乙方,作为云数据服务器托管过程中乙方的管理和运营收益;数据增值服务产生的收益,当天进行结算,转入甲方提供的接收地址;乙方保证将按照厂家提供的环境标准(包括但不限于:电压、用电环境、温度、湿度、网络带宽、机房密度)使用、维护本合同项下云数据服务器;在正常使用过程中因不可归责于乙方的原因导致服务器损坏的,乙方不承担责任,乙方应协助甲方维修或更换服务器设备,相关费用由甲方承担;甲方云数据服务器根据机型按实测功耗计算电费,各服务器机型到现场进行测量功耗后,乙方告知甲方;经双方认可后,固定每月耗电量;未经甲方同意,乙方不得将托管的云数据服务器挪作他用,且不得将同类设备进行调换;如云数据服务器出现宕机或TH/s 为零的情况下,乙方必须 30分钟内对云数据服务器设备进行充气或其他处理,以保障甲方的利益;若检查发现系硬件原因无法解决,乙方负责将故障设备进行打包、返场维修,产生的费用由甲方承担;如因突发情况,如供电公司线路检修、机组维护、网络运营商意外断网等,导致数据服务器中断运行,乙方负责协调处理故障;在乙方可控范围内,甲方云数据服务器中断运行时间原则上每月不超过48 小时,如停电超出约定时间,乙方将在合同约定的管理时间基础上延长服务时间,并承担服务器的机器放置费用;乙方因自身原因导致托管的甲方服务器损害或者灭失的,应当向甲方承担赔偿责任;合同期限为 2019 年 6 月 30 日至 2020 年 6 月 30日。
|
||||
|
||||
8.上述合同签订后,中研智创公司购买并委托第三方矿场实际运营“矿机”。
|
||||
|
||||
019 年7 月15 日,甲方中研智创公司与乙方成都毛球华数科技合伙企业(有限合伙)签订两份《矿机托管服务合同(包运维)》,约定:甲方将其所拥有的“矿机”置于乙方算力服务中心,乙方对甲方矿机提供运维管理服务,“矿机”名称为芯动T2T,数量分别为1350 台、502 台,全新,算力26T、30T,功耗2200W;托管期限以甲方矿机到达乙方算力服务中心并开始运行之日起算,分别暂定自2019 年7 月15 日至2019 年10月 25 日止、自 2019 年 6 月 28 日至 2019 年 10 月 25 日止,乙方算力服务中心地址分别为四川省凉山州木里县水洛乡、沙湾乡;托管服务费计量方式均为,按照乙方上架运行机型实测功耗进行核算耗电量 $+ 3 \%$ 电损,电费单价按0.239 元/度计算;甲方应在本合同签订之日起两个工作日内,向乙方支付半个月托管服务费作为本合同履约保证金,分别为人民币251 242 元、
|
||||
|
||||
94000 元,履约保证金可用于抵扣协议最后一个结算周期的托管服务费,托管服务费支付周期为每半月支付。
|
||||
|
||||
10. 合同实际履行过程中,2019 年5 月20 日,丰复久信公司向中研智创公司支付1000 万元,用途备注为货款。中研智创公司曾于2019 年向丰复久信公司交付了18.3463 个比特币。此后未再进行比特币交付,双方故此产生争议,并产生大量微信沟通记录。
|
||||
|
||||
微信聊天记录中,关于核实设备及比特币产量情况。2019年11 月8 日,丰复久信公司称,“我们是应该自己有个矿池账号了吧!这样是不是我们也可远程监控管理”“现在可以登不?”,中研智创公司称“可以的”“之前走的是另外的体系,我们看看怎样把矿池的账号直接对接给你这边吧”“现在不行,因为所有机器都是统一管理,放在同一个大的账号里面,需要切割出来”“两天吧,周一可以给你搞好”。11 月12 日,中研智创公司微信称“郭总,请你升级一下注册一下普惠矿场 App,挖矿收益以后都在这里查看和提取,原来的APP 不再更新了”,丰复久信公司回复称“我不清楚你们要干什么,现在不能你们让我干嘛我干嘛,基本信任已经不存在了。所以,任何一个动作,我现在都需要做尽调”。11 月25 日,双方微信群聊天记录显示,中研智创公司员工介绍,“这是我们在四川木里那边的现场管理人员”,此后双方互相交换了联系电话,并沟通丰复久信公司应以何种交通方式前往四川木里。11 月27 日,丰复久信公司称“我到了矿场,我要看下后台,需要个链接”,中研智创公司给出网页链接称“这是矿场的链接”。该链接地址网站名称为“币印”,现点击链接显示“该观察者链接已失效,请重新创建”。12 月7 日,丰复久信公司称“什么时候可以提供你的原始资料?抓紧核实!”。12 月19 日,双方沟通矿机搬动情况,中研智创公司称“在昭通这边,等通知进场”,2020 年3 月20 日称矿机在“乐山这边”,丰复久信公司称“这几个月挖了多少币了?郭总的软件登不上去没有信息!什么时间通个电话”。中研智创公司未在微信聊天中明确答复。
|
||||
|
||||
12. 关于丰复久信公司向中研智创公司催要比特币情况。微信聊天记录显示,2020 年4 月9 日,丰复久信公司询问中研智创公司,“我想知道一下我的机器到底挖了几个币,就这么难吗?”,中研智创公司回复“放心吧,我已经打电话给唐宇了,他会安排老潘落实,我今天没有联系到潘,我答应你的事情算数的”。
|
||||
|
||||
4 月 10 日、4 月 17 日、4 月 30 日、6 月 22 日、6 月 23 日、6 月27 日,丰复久信公司分别再次询问称“还是这情况呀,APP 还是 $0 ^ { \mathfrak { s } }$ “咱们这事,您准备怎么收场啊,币也不给,钱也不还,算力也不卖,各种理由,您是逼我报官了吧”等,中研智创公司未回复,6 月28 日回复称“稍晚一会给你打电话”。
|
||||
|
||||
13. 关于比特币等虚拟货币及“挖矿”活动的风险防范、整治,国家相关部门曾多次发布《通知》《公告》《风险提示》等政策文件:
|
||||
|
||||
1.2013 年12 月,中国人民银行等五部委发布《关于防范比特币风险的通知》指出,比特币不是由货币当局发行,不具有法偿性与强制性等货币属性,并不是真正意义的货币。从性质上看,比特币应当是一种特定的虚拟商品,不具有与货币等同的法律地位,不能且不应作为货币在市场上流通使用。各金融机构和支付机构不得以比特币为产品或服务定价,不得买卖或作为中央对手买卖比特币,不得承保与比特币相关的保险业务或将比特币纳入保险责任范围,不得直接或间接为客户提供其他与比特币相关的服务。
|
||||
|
||||
2.2017 年9 月,《中国人民银行、中央网信办、工业和信息化部、工商总局、银监会、证监会、保监会、关于防范代币发行融资风险的公告》,再次强调比特币不具有法偿性与强制性等货币属性,不具有与货币等同的法律地位,不能也不应作为货币在市场上流通使用,并提示,代币发行融资与交易存在多重风险,包括虚假资产风险、经营失败风险、投资炒作风险等,投资者须自行承担投资风险,希望广大投资者谨防上当受骗。
|
||||
|
||||
3.2018 年8 月《中国银行保险监督管理委员会、中央网络安全和信息化领导小组办公室、公安部、中国人民银行、国家市场监督管理总局关于防范以“虚拟货币”“区块链”名义进行非法集资的风险提示》也再次明确作出风险提示。
|
||||
|
||||
4.2021 年5 月18 日,中国互联网金融协会、中国银行业协会、中国支付清算协会联合发布《关于防范虚拟货币交易炒作风险的公告》,再次强调正确认识虚拟货币及相关业务活动的本质属性,有关机构不得开展与虚拟货币相关的业务,并特别指出,消费者要提高风险防范意识,“从我国现有司法实践看,虚拟货币交易合同不受法律保护,投资交易造成的后果和引发的损失由相关方自行承担”。
|
||||
|
||||
5.2021 年9 月3 日,国家发展和改革委员会等部门发布《关于整治虚拟货币“挖矿”活动的通知》(发改运行〔2021〕1283号)指出,“虚拟货币‘挖矿’活动指通过专用‘矿机’计算生产虚拟货币的过程,能源消耗和碳排放量大,对国民经济贡献度低,对产业发展、科技进步等带动作用有限,加之虚拟货币生产、交易环节衍生的风险越发突出,其盲目无序发展对推动经济社会高质量发展和节能减排带来不利影响。整治虚拟货币‘挖矿’活动对促进我国产业结构优化、推动节能减排、如期实现碳达峰、碳中和目标具有重要意义。”“严禁投资建设增量项目,禁止以任何名义发展虚拟货币‘挖矿’项目;加快有序退出存量项目。”
|
||||
|
||||
“严格执行有关法律法规和规章制度,严肃查处整治各地违规虚拟货币‘挖矿’活动”。
|
||||
|
||||
6.2021 年9 月15 日,中国人民银行、中央网信办、最高人民法院等部门联合发布《关于进一步防范和处置虚拟货币交易炒作风险的通知》指出,虚拟货币相关业务活动属于非法金融活动,境外虚拟货币交易所通过互联网向我国境内居民提供服务同样属于非法金融活动,并再次提示,参与虚拟货币投资交易活动存在法律风险,任何法人、非法人组织和自然人投资虚拟货币及相关衍生品,违背公序良俗的,相关民事法律行为无效,由此引发的损失由其自行承担。
|
||||
|
||||
14. 上述事实,有丰复久信公司提交的《计算机设备采购合同》《服务合同书》《增值服务协议》、银行转账记录、网页截图、微信聊天记录,有中研智创公司提交的《矿机托管服务合同(包运维)》、微信聊天记录等证据及当事人陈述等在案佐证。
|
||||
|
||||
一审法院认为,本案事实发生于民法典实施前,根据《最高人民法院关于适用<中华人民共和国民法典>时间效力的若干规定》,民法典施行前的法律事实引起的民事纠纷案件,适用当时的法律、司法解释的规定,因此本案应适用《中华人民共和国合同法》的相关规定。
|
||||
|
||||
15. 根据2021 年9 月3 日国家发展和改革委员会等部门《关于整治虚拟货币“挖矿”活动的通知》,虚拟货币“挖矿”活动指通过专用“矿机”计算生产虚拟货币的过程。本案中,丰复久信公司与中研智创公司签订《计算机设备采购合同》《服务合同书》《增值服务协议》,约定丰复久信公司委托中研智创公司采购微型存储空间服务器,并由中研智创公司对计算机服务器进行管理,丰复久信公司向中研智创公司支付管理费用,中研智创公司提供相关数据增值服务,支付增值服务收益。诉讼中,中研智创公司陈述,其按照三份合同约定代丰复久信公司购买了“矿机”,并与第三方公司即“矿场”签订委托合同,将“矿机”在“矿场”运行,并曾向丰复久信公司交付过18.3463 个比特币。根据上述履约过程及三份合同约定的主要内容,双方的交易模式实际上即为丰复久信公司委托中研智创公司购买并管理专用“矿机”计算生产比特币的“挖矿”行为。三份合同系有机整体,合同目的均系双方为了最终进行“挖矿”活动而签订,双方成立合同关系。该比特币“挖矿”的交易模式,属于国家相关行政机关管控范围,需要严格遵守相关法律法规和规章制度。
|
||||
|
||||
《中华人民共和国合同法》第七条规定,“当事人订立、履行合同,应当遵守法律、行政法规,尊重社会公德,不得扰乱社会经济秩序,损害社会公共利益”;第五十二条规定,“有下列情形之一的,合同无效:(一)一方以欺诈、胁迫的手段订立合同,损害国家利益;(二)恶意串通,损害国家、集体或者第三人利益;(三)以合法形式掩盖非法目的;(四)损害社会公共利益;(五)违反法律、行政法规的强制性规定。” 社会公共利益一般指关系到全体社会成员或者社会不特定多数人的利益,主要包括社会公共秩序以及社会善良风俗等,是明确国家和个人权利的行使边界、判断民事法律行为正当性和合法性的重要标准之一。能源安全、金融安全、经济安全等都是国家安全的重要组成部分,防范化解相关风险、深化整治相关市场乱象,均关系到我国的产业结构优化、金融秩序稳定、社会经济平稳运行和高质量发展,故社会经济秩序、金融秩序等均涉及社会公共利益。
|
||||
|
||||
17. 根据上述虚拟货币相关《通知》《公告》《风险提示》等文件,本案涉及的比特币为网络虚拟货币,并非国家有权机关发行的法定货币,不具有与法定货币等同的法律地位,不具有法偿性,不应且不能作为货币在市场上流通使用,相关部门多次发布《风险公告》《通知》等文件,提示消费者提高风险防范意识,投资交易虚拟货币造成的后果和引发的损失由相关方自行承担。且本案的交易模式系“挖矿”活动,随着虚拟货币交易的发展,“挖矿”行为的危害日渐凸显。
|
||||
|
||||
“挖矿”活动能源消耗和碳排放量大,不利于我国产业结构优化、节能减排,不利于我国实现碳达峰、碳中和目标。加之虚拟货币相关交易活动无真实价值支撑,价格极易被操纵,“挖矿”行为也进一步衍生虚假资产风险、经营失败风险、投资炒作风险等相关金融风险,危害外汇管理秩序、金融秩序,甚至容易引发违法犯罪活动、影响社会稳定。正因“挖矿”行为危害大、风险高,其盲目无序发展对推动经济社会高质量发展和节能减排带来不利影响,相关政策明确拟将虚拟货币“挖矿”活动增补列入《产业结构调整指导目录(2019 年本)》“淘汰类”目录,要求采取有效措施,全面整治虚拟货币“挖矿”活动。本案中,丰复久信公司和中研智创公司在明知“挖矿”及比特币交易存在风险,且相关部门明确禁止比特币相关交易的情况下,仍然签订协议形成委托“挖矿”关系。“挖矿”活动及虚拟货币的相关交易行为存在上文论述的诸多风险和危害,干扰了正常的金融秩序、经济发展秩序,故该“挖矿”合同损害社会公共利益,应属无效。
|
||||
|
||||
18. 《中华人民共和国合同法》第五十八条规定,“合同无效或者被撤销后,因该合同取得的财产,应当予以返还;不能返还或者没有必要返还的,应当折价补偿。有过错的一方应当赔偿对方因此所受到的损失,双方都有过错的,应当各自承担相应的责任。”本案中,丰复久信公司第一项诉讼请求系基于合同项下权利义务要求中研智创公司支付比特币收益,因“挖矿” 合同自始无效,丰复久信公司通过履行无效合同主张获得的利益不应受到法律保护,对其相应诉讼请求,一审法院不予支持。丰复久信公司第二项诉讼请求主张占用“矿机”设备期间的比特币损失,该损失系丰复久信公司基于持续利用“矿机”从事 “挖矿”活动产生比特币的损失,不应受到法律保护,对其相应诉讼请求,一审法院亦不予支持。
|
||||
|
||||
19. 关于本案中“矿机”的处理,因现相关计算机设备仍由中研智创公司保管,但诉讼中,丰复久信公司明确表示其将另行主张,不在本案中要求处理“矿机”返还问题。故一审法院在本案中不再予以处理。但同时需要提醒双方当事人,均应遵守国家相关法律规定和产业政策,案涉计算机等设备不得继续用于比特币等虚拟货币“挖矿”活动,当事人应防范虚拟货币交易风险,自觉维护市场秩序和社会公共利益。
|
||||
|
||||
20. 综上,一审法院判决驳回北京丰复久信营销科技有限公司的全部诉讼请求。
|
||||
|
||||
21. 二审中,各方均未提交新的证据。本院对一审查明的事实予以确认。
|
||||
|
||||
22. 本院认为,比特币及相关经济活动是新型、复杂的,我国监管机构对比特币生产、交易等方面的监管措施建立在对其客观认识的基础上,并不断完善。本案双方挖矿合同从签订至履行后发生争议,纠纷延续至今,亦处于这一过程中。对合同效力的认定,应建立在当下对挖矿活动的客观认识基础上。
|
||||
|
||||
, ('、一·、( √D)T4D H上23. 2013 年,中国人民银行等五部委发布通知,禁止金融机构对比特币进行定价,不得买卖或作为中央对手买卖比特币,不得直接或间接为客户提供其他与比特币相关的服务。2017 年,中国人民银行等七部门联合发布《关于防范代币发行融资风险的公告》进一步提出任何所谓的代币融资交易平台不得从事法定货币与代币、“虚拟货币”相互之间的兑换业务,不得买卖或作为中央对手方买卖代币或“虚拟货币”,不得为代币或“虚拟货币”提供定价、信息中介等服务。上述两个文件实质上禁止了比特币在我国相关平台的兑付、交易。2021 年,中国人民银行等部门《关于进一步防范和处置虚拟货币交易炒作风险的通知》显示,虚拟货币交易炒作活动扰乱经济金融秩序,滋生赌博、非法集资、诈骗、传销、洗钱等违法犯罪活动,严重危害人民群众财产安全和国家金融安全。
|
||||
|
||||
24. 2021 年9 月3 日国家发展和改革委员会等部门《关于整治虚拟货币“挖矿”活动的通知》显示,虚拟货币挖矿活动能源消耗和碳排放量大,对国民经济贡献度低,对产业发展、科技进步等带动作用有限,加之虚拟货币生产、交易环节衍生的风险越发突出,其盲目无序发展对推动经济社会高质量发展和节能减排带来不利影响。故以电力资源、碳排放量为代价的“挖矿”行为,与经济社会高质量发展和碳达峰、碳中和目标相悖,与公共利益相悖。
|
||||
|
||||
5. 丰复久信公司主张双方合同签订时并无明确的法律规范禁止比特币“挖矿”活动,故应保障当事人的信赖利益,认定涉案合同有效一节,本院认为,当事人之间基于投资目的进行“挖矿”,并通过电子方式转让、储存以及交易的行为,实际经济追求是为了通过比特币与法定货币的兑换直接获取法定货币体系下的利益。丰复久信公司作为营利法人,在庭审中表示投资比特币仅系持有,本院难以采信。在监管机构禁止了比特币在我国相关平台的兑付、交易,且数次提示比特币投资风险的情况下,双方为获取高额利润,仍从事“挖矿”行为,现丰复久信公司以保障其信赖利益主张合同有效依据不足,本院不予采纳。
|
||||
|
||||
26. 综上,相关部门整治虚拟货币“挖矿”活动、认定虚拟货币相关业务活动属于非法金融活动,有利于保障我国发展利益和金融安全。从“挖矿”行为的高能耗以及比特币交易活动对国家金融秩序和社会秩序的影响来看,一审法院认定涉案合同无效是正确的。双方作为社会主义市场经济主体,既应遵守市场经济规则,亦应承担起相应的社会责任,推动经济社会高质量发展、可持续发展。
|
||||
|
||||
27. 关于合同无效后的返还问题,一审法院未予处理,双方可另行解决。
|
||||
|
||||
28. 综上所述,丰复久信公司的上诉请求不能成立,应予驳回;一审判决并无不当,应予维持。依照《中华人民共和国民事诉讼法》第一百七十七条第一款第一项规定,判决如下:
|
||||
|
||||
驳回上诉,维持原判。
|
||||
|
||||
二审案件受理费450892 元,由北京丰复久信营销科技有限公司负担(已交纳)。
|
||||
|
||||
29. 本判决为终审判决。
|
||||
|
||||
审 判 长 史晓霞审 判 员 邓青菁审 判 员 李 淼二〇二二年七月七日法 官 助 理 黎 铧书 记 员 郑海兴
|
||||
Binary file not shown.
|
|
@ -0,0 +1,43 @@
|
|||
# 北京市第三中级人民法院民事判决书
|
||||
|
||||
(2022)京 03 民终 3852 号
|
||||
|
||||
上诉人(原审原告):北京丰复久信营销科技有限公司,住所地北京市海淀区北小马厂6 号1 号楼华天大厦1306 室。
|
||||
|
||||
法定代表人:郭东军,执行董事、经理。委托诉讼代理人:周大海,北京市康达律师事务所律师。委托诉讼代理人:王乃哲,北京市康达律师事务所律师。
|
||||
|
||||
被上诉人(原审被告):中研智创区块链技术有限公司,住所地天津市津南区双港镇工业园区优谷产业园5 号楼-1505。
|
||||
|
||||
法定代表人:王欢子,总经理。
|
||||
|
||||
委托诉讼代理人:魏鑫,北京市昊衡律师事务所律师。
|
||||
|
||||
1.上诉人北京丰复久信营销科技有限公司(以下简称丰复久信公司)因与被上诉人中研智创区块链技术有限公司(以下简称中研智创公司)服务合同纠纷一案,不服北京市朝阳区人民法院(2020)京0105 民初69754 号民事判决,向本院提起上诉。本院立案后,依法组成合议庭开庭进行了审理。上诉人丰复久信公司之委托诉讼代理人周大海、王乃哲,被上诉人中研智创公司之委托诉讼代理人魏鑫到庭参加诉讼。本案现已审理终结。
|
||||
|
||||
2.丰复久信公司上诉请求:1.撤销一审判决,发回重审或依法改判支持丰复久信公司一审全部诉讼请求;2.或在维持原判的同时判令中研智创公司向丰复久信公司返还 1000 万元款项,并赔偿丰复久信公司因此支付的律师费 220 万元;3.判令中研智创公司承担本案一审、二审全部诉讼费用。事实与理由:一、根据2019 年的政策导向,丰复久信公司的投资行为并无任何法律或政策瑕疵。丰复久信公司仅投资挖矿,没有购买比特币,故在当时国家、政府层面有相关政策支持甚至鼓励的前提下,一审法院仅凭“挖矿”行为就得出丰复久信公司扰乱金融秩序的结论,是错误的。二、一审法院没有全面、深入审查相关事实,且遗漏了最核心的数据调查工作。三、本案一审判决适用法律错误。涉案合同成立及履行期间并无合同无效的情形,当属有效。一审法院以挖矿活动耗能巨大、不利于我国产业结构调整为依据之一,作出合同无效的判决,实属牵强。最高人民法院发布的全国法院系统2020 年度优秀案例分析评选活动获奖名单中,由上海市第一中级人民法院刘江法官编写的“李圣艳、布兰登·斯密特诉闫向东、李敏等财产损害赔偿纠纷案— —比特币的法律属性及其司法救济”一案入选,该案同样发生在丰复久信公司与中研智创公司合同履行过程中,一审法院认定同时期同类型的涉案合同无效,与上述最高人民法院的优秀案例相悖。四、一审法院径行认定合同无效,未向丰复久信公司进行释明构成程序违法。
|
||||
|
||||
3.中研智创公司辩称,同意一审判决,不同意丰复久信公司的上诉请求。首先,一审法院曾在庭审中询问丰复久信公司关于机器返还的问题,一审法院进行了释明。其次,如二审法院对其该项上诉请求进行判决,会剥夺中研智创公司针对该部分请求再行上诉的权利。
|
||||
|
||||
4.丰复久信公司向一审法院起诉请求:1.中研智创公司交付278.1654976 个比特币,或者按照 2021 年 1 月 25 日比特币的价格交付9550812.36 美元;2.中研智创公司赔偿丰复久信公司服务期到期后占用微型存储空间服务器的损失(自2020 年7 月1日起至实际返还服务器时止,按照bitinfocharts 网站公布的相关日产比特币数据,计算应赔偿比特币数量或按照2021 年1 月25 日比特币的价格交付美元)。
|
||||
|
||||
5.一审法院查明事实:2019 年5 月6 日,丰复久信公司作为甲方(买方)与乙方(卖方)中研智创公司签订《计算机设备采购合
|
||||
|
||||
同》,约定:货物名称为计算机设备,型号规格及数量为T2T-30T 规格型号的微型存储空间服务器1542 台,单价5040/ 台合同金额为 7 771 680 元;交货期 2019 年 8 月 31 日前;交货方式为乙方自行送货到甲方所在地,并提供安装服务,运输工具及运费由乙方负责;交货地点北京;签订购货合同,设备安装完毕后一次性支付项目总货款;乙方提供货物的质量保证期为自交货验收结束之日起不少于十二个月(具体按清单要求);乙方交货前应对产品作出全面检查和对验收文件进行整理,并列出清单,作为甲方收货验收和使用的技术条件依据,检验的结果应随货物交甲方,甲方对乙方提供的货物在使用前进行调试时,乙方协助甲方一起调试,直到符合技术要求,甲方才做最终验收,验收时乙方必须在现场,验收完毕后作出验收结果报告,并经双方签字生效。
|
||||
|
||||
6.同日,丰复久信公司作为甲方(客户方)与乙方中研智创公司(服务方)签订《服务合同书》,约定:乙方同意就采购合同中的微型存储空间服务器向甲方提供特定服务;服务的内容包括质保、维修、服务器设备代为运行管理、代为缴纳服务器相关用度花费如电费等,详细内容见附件一;如果乙方在工作中因自身过错而发生任何错误或遗漏,应无条件更正,不另外收费,并对因此而对甲方造成的损失承担赔偿责任,赔偿额以本合同约定的服务费为限;若因甲方原因造成工作延误,将由甲方承担相应的损失;服务费总金额为2 228 320 元,甲乙双方一致同意项目服务费以人民币形式,于本合同签订后3 日内一次性支付;甲方可以提前10 个工作日以书面形式要求变更或增加所提供的服务,该等变更最终应由双方商定认可,其中包括与该等变更有关的任何费用调整等。合同后附附件一以表格形式列明,1.1542 台T2T-30T 微型存储空间服务器的质保、维修,时限12 个月,完成标准为完成甲方指定的运行量;2.服务器的日常运行管理,时限12 个月;3.代扣代缴电费;4.其他(空白)。
|
||||
|
||||
24. 2021 年9 月3 日国家发展和改革委员会等部门《关于整治虚拟货币“挖矿”活动的通知》显示,虚拟货币挖矿活动能源消耗和碳排放量大,对国民经济贡献度低,对产业发展、科技进步等带动作用有限,加之虚拟货币生产、交易环节衍生的风险越发突出,其盲目无序发展对推动经济社会高质量发展和节能减排带来不利影响。故以电力资源、碳排放量为代价的“挖矿”行为,与经济社会高质量发展和碳达峰、碳中和目标相悖,与公共利益相悖。
|
||||
|
||||
26. 综上,相关部门整治虚拟货币“挖矿”活动、认定虚拟货币相关业务活动属于非法金融活动,有利于保障我国发展利益和金融安全。从“挖矿”行为的高能耗以及比特币交易活动对国家金融秩序和社会秩序的影响来看,一审法院认定涉案合同无效是正确的。双方作为社会主义市场经济主体,既应遵守市场经济规则,亦应承担起相应的社会责任,推动经济社会高质量发展、可持续发展。
|
||||
|
||||
27. 关于合同无效后的返还问题,一审法院未予处理,双方可另行解决。
|
||||
|
||||
28. 综上所述,丰复久信公司的上诉请求不能成立,应予驳回;一审判决并无不当,应予维持。依照《中华人民共和国民事诉讼法》第一百七十七条第一款第一项规定,判决如下:
|
||||
|
||||
驳回上诉,维持原判。
|
||||
|
||||
二审案件受理费450892 元,由北京丰复久信营销科技有限公司负担(已交纳)。
|
||||
|
||||
29. 本判决为终审判决。
|
||||
|
||||
审 判 长 史晓霞审 判 员 邓青菁审 判 员 李 淼二〇二二年七月七日法 官 助 理 黎 铧书 记 员 郑海兴
|
||||
|
|
@ -0,0 +1,110 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Unified Docker Compose Setup Script
|
||||
# This script helps set up the unified Docker Compose environment
|
||||
|
||||
set -e
|
||||
|
||||
echo "🚀 Setting up Unified Docker Compose Environment"
|
||||
|
||||
# Function to check if Docker is running
|
||||
check_docker() {
|
||||
if ! docker info > /dev/null 2>&1; then
|
||||
echo "❌ Docker is not running. Please start Docker and try again."
|
||||
exit 1
|
||||
fi
|
||||
echo "✅ Docker is running"
|
||||
}
|
||||
|
||||
# Function to stop existing individual services
|
||||
stop_individual_services() {
|
||||
echo "🛑 Stopping individual Docker Compose services..."
|
||||
|
||||
if [ -f "backend/docker-compose.yml" ]; then
|
||||
echo "Stopping backend services..."
|
||||
cd backend && docker-compose down 2>/dev/null || true && cd ..
|
||||
fi
|
||||
|
||||
if [ -f "frontend/docker-compose.yml" ]; then
|
||||
echo "Stopping frontend services..."
|
||||
cd frontend && docker-compose down 2>/dev/null || true && cd ..
|
||||
fi
|
||||
|
||||
if [ -f "mineru/docker-compose.yml" ]; then
|
||||
echo "Stopping mineru services..."
|
||||
cd mineru && docker-compose down 2>/dev/null || true && cd ..
|
||||
fi
|
||||
|
||||
echo "✅ Individual services stopped"
|
||||
}
|
||||
|
||||
# Function to create necessary directories
|
||||
create_directories() {
|
||||
echo "📁 Creating necessary directories..."
|
||||
|
||||
mkdir -p backend/storage
|
||||
mkdir -p mineru/storage/uploads
|
||||
mkdir -p mineru/storage/processed
|
||||
|
||||
echo "✅ Directories created"
|
||||
}
|
||||
|
||||
# Function to check if unified docker-compose.yml exists
|
||||
check_unified_compose() {
|
||||
if [ ! -f "docker-compose.yml" ]; then
|
||||
echo "❌ Unified docker-compose.yml not found in current directory"
|
||||
echo "Please run this script from the project root directory"
|
||||
exit 1
|
||||
fi
|
||||
echo "✅ Unified docker-compose.yml found"
|
||||
}
|
||||
|
||||
# Function to build and start services
|
||||
start_unified_services() {
|
||||
echo "🔨 Building and starting unified services..."
|
||||
|
||||
# Build all services
|
||||
docker-compose build
|
||||
|
||||
# Start services
|
||||
docker-compose up -d
|
||||
|
||||
echo "✅ Unified services started"
|
||||
}
|
||||
|
||||
# Function to check service status
|
||||
check_service_status() {
|
||||
echo "📊 Checking service status..."
|
||||
|
||||
docker-compose ps
|
||||
|
||||
echo ""
|
||||
echo "🌐 Service URLs:"
|
||||
echo "Frontend: http://localhost:3000"
|
||||
echo "Backend API: http://localhost:8000"
|
||||
echo "Mineru API: http://localhost:8001"
|
||||
echo ""
|
||||
echo "📝 To view logs: docker-compose logs -f [service-name]"
|
||||
echo "📝 To stop services: docker-compose down"
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
echo "=========================================="
|
||||
echo "Unified Docker Compose Setup"
|
||||
echo "=========================================="
|
||||
|
||||
check_docker
|
||||
check_unified_compose
|
||||
stop_individual_services
|
||||
create_directories
|
||||
start_unified_services
|
||||
check_service_status
|
||||
|
||||
echo ""
|
||||
echo "🎉 Setup complete! Your unified Docker environment is ready."
|
||||
echo "Check the DOCKER_COMPOSE_README.md for more information."
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
|
|
@ -1,31 +0,0 @@
|
|||
# settings.py
|
||||
|
||||
from pydantic_settings import BaseSettings
|
||||
from typing import Optional
|
||||
|
||||
class Settings(BaseSettings):
|
||||
# Storage paths
|
||||
OBJECT_STORAGE_PATH: str = "/Users/tigeren/Dev/digisky/legal-doc-masker/src_folder"
|
||||
TARGET_DIRECTORY_PATH: str = "/Users/tigeren/Dev/digisky/legal-doc-masker/target_folder"
|
||||
|
||||
# Ollama API settings
|
||||
OLLAMA_API_URL: str = "https://api.ollama.com"
|
||||
OLLAMA_API_KEY: str = ""
|
||||
OLLAMA_MODEL: str = "llama2"
|
||||
|
||||
# File monitoring settings
|
||||
MONITOR_INTERVAL: int = 5
|
||||
|
||||
# Logging settings
|
||||
LOG_LEVEL: str = "INFO"
|
||||
LOG_FORMAT: str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
||||
LOG_DATE_FORMAT: str = "%Y-%m-%d %H:%M:%S"
|
||||
LOG_FILE: str = "app.log"
|
||||
|
||||
class Config:
|
||||
env_file = ".env"
|
||||
env_file_encoding = "utf-8"
|
||||
extra = "allow"
|
||||
|
||||
# Create settings instance
|
||||
settings = Settings()
|
||||
17
src/main.py
17
src/main.py
|
|
@ -1,17 +0,0 @@
|
|||
from config.logging_config import setup_logging
|
||||
|
||||
def main():
|
||||
# Setup logging first
|
||||
setup_logging()
|
||||
|
||||
from services.file_monitor import FileMonitor
|
||||
from config.settings import settings
|
||||
|
||||
# Initialize the file monitor
|
||||
file_monitor = FileMonitor(settings.OBJECT_STORAGE_PATH, settings.TARGET_DIRECTORY_PATH)
|
||||
|
||||
# Start monitoring the directory for new files
|
||||
file_monitor.start_monitoring()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -1,18 +0,0 @@
|
|||
from abc import ABC, abstractmethod
|
||||
from typing import Any
|
||||
|
||||
class DocumentProcessor(ABC):
|
||||
@abstractmethod
|
||||
def read_content(self) -> str:
|
||||
"""Read document content"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def process_content(self, content: str) -> str:
|
||||
"""Process document content"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def save_content(self, content: str) -> None:
|
||||
"""Save processed content"""
|
||||
pass
|
||||
|
|
@ -1,5 +0,0 @@
|
|||
from models.processors.txt_processor import TxtDocumentProcessor
|
||||
from models.processors.docx_processor import DocxDocumentProcessor
|
||||
from models.processors.pdf_processor import PdfDocumentProcessor
|
||||
|
||||
__all__ = ['TxtDocumentProcessor', 'DocxDocumentProcessor', 'PdfDocumentProcessor']
|
||||
|
|
@ -1,20 +0,0 @@
|
|||
import docx
|
||||
from models.document_processor import DocumentProcessor
|
||||
|
||||
class DocxDocumentProcessor(DocumentProcessor):
|
||||
def __init__(self, input_path: str, output_path: str):
|
||||
self.input_path = input_path
|
||||
self.output_path = output_path
|
||||
|
||||
def read_content(self) -> str:
|
||||
doc = docx.Document(self.input_path)
|
||||
return '\n'.join([paragraph.text for paragraph in doc.paragraphs])
|
||||
|
||||
def process_content(self, content: str) -> str:
|
||||
# Implementation for processing docx content
|
||||
return content
|
||||
|
||||
def save_content(self, content: str) -> None:
|
||||
doc = docx.Document()
|
||||
doc.add_paragraph(content)
|
||||
doc.save(self.output_path)
|
||||
|
|
@ -1,20 +0,0 @@
|
|||
import PyPDF2
|
||||
from models.document_processor import DocumentProcessor
|
||||
|
||||
class PdfDocumentProcessor(DocumentProcessor):
|
||||
def __init__(self, input_path: str, output_path: str):
|
||||
self.input_path = input_path
|
||||
self.output_path = output_path
|
||||
|
||||
def read_content(self) -> str:
|
||||
with open(self.input_path, 'rb') as file:
|
||||
pdf_reader = PyPDF2.PdfReader(file)
|
||||
return ' '.join([page.extract_text() for page in pdf_reader.pages])
|
||||
|
||||
def process_content(self, content: str) -> str:
|
||||
# Implementation for processing PDF content
|
||||
return content
|
||||
|
||||
def save_content(self, content: str) -> None:
|
||||
# Implementation for saving as PDF
|
||||
pass
|
||||
|
|
@ -1,18 +0,0 @@
|
|||
from models.document_processor import DocumentProcessor
|
||||
|
||||
class TxtDocumentProcessor(DocumentProcessor):
|
||||
def __init__(self, input_path: str, output_path: str):
|
||||
self.input_path = input_path
|
||||
self.output_path = output_path
|
||||
|
||||
def read_content(self) -> str:
|
||||
with open(self.input_path, 'r', encoding='utf-8') as file:
|
||||
return file.read()
|
||||
|
||||
def process_content(self, content: str) -> str:
|
||||
# Implementation for processing text content
|
||||
return content
|
||||
|
||||
def save_content(self, content: str) -> None:
|
||||
with open(self.output_path, 'w', encoding='utf-8') as file:
|
||||
file.write(content)
|
||||
Binary file not shown.
|
|
@ -1,24 +0,0 @@
|
|||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class FileMonitor:
|
||||
def __init__(self, directory, callback):
|
||||
self.directory = directory
|
||||
self.callback = callback
|
||||
|
||||
def start_monitoring(self):
|
||||
import time
|
||||
import os
|
||||
|
||||
already_seen = set(os.listdir(self.directory))
|
||||
while True:
|
||||
time.sleep(1) # Check every second
|
||||
current_files = set(os.listdir(self.directory))
|
||||
new_files = current_files - already_seen
|
||||
|
||||
for new_file in new_files:
|
||||
logger.info(f"monitor: new file found: {new_file}")
|
||||
self.callback(os.path.join(self.directory, new_file))
|
||||
|
||||
already_seen = current_files
|
||||
|
|
@ -1,15 +0,0 @@
|
|||
class OllamaClient:
|
||||
def __init__(self, model_name):
|
||||
self.model_name = model_name
|
||||
|
||||
def process_document(self, document_text):
|
||||
# Here you would implement the logic to interact with the Ollama API
|
||||
# and process the document text using the specified model.
|
||||
# This is a placeholder for the actual API call.
|
||||
processed_text = self._mock_api_call(document_text)
|
||||
return processed_text
|
||||
|
||||
def _mock_api_call(self, document_text):
|
||||
# Mock processing: In a real implementation, this would call the Ollama API.
|
||||
# For now, it just returns the input text with a note indicating it was processed.
|
||||
return f"Processed with {self.model_name}: {document_text}"
|
||||
|
|
@ -1,58 +0,0 @@
|
|||
# README.md
|
||||
|
||||
# Document Processing App
|
||||
|
||||
This project is designed to process legal documents by hiding sensitive information such as names and company names. It utilizes the Ollama API with selected models for text processing. The application monitors a specified directory for new files, processes them automatically, and saves the results to a target path.
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
doc-processing-app
|
||||
├── src
|
||||
│ ├── main.py # Entry point of the application
|
||||
│ ├── config
|
||||
│ │ └── settings.py # Configuration settings for paths
|
||||
│ ├── services
|
||||
│ │ ├── file_monitor.py # Monitors directory for new files
|
||||
│ │ ├── document_processor.py # Handles document processing logic
|
||||
│ │ └── ollama_client.py # Interacts with the Ollama API
|
||||
│ ├── utils
|
||||
│ │ └── file_utils.py # Utility functions for file operations
|
||||
│ └── models
|
||||
│ └── document.py # Represents the structure of a document
|
||||
├── tests
|
||||
│ └── test_document_processor.py # Unit tests for DocumentProcessor
|
||||
├── requirements.txt # Project dependencies
|
||||
├── .env.example # Example environment variables
|
||||
└── README.md # Project documentation
|
||||
```
|
||||
|
||||
## Setup Instructions
|
||||
|
||||
1. Clone the repository:
|
||||
```
|
||||
git clone <repository-url>
|
||||
cd doc-processing-app
|
||||
```
|
||||
|
||||
2. Install the required dependencies:
|
||||
```
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
3. Configure the application by editing the `src/config/settings.py` file to set the paths for the object storage and target directory.
|
||||
|
||||
4. Create a `.env` file based on the `.env.example` file to set up necessary environment variables.
|
||||
|
||||
## Usage
|
||||
|
||||
To run the application, execute the following command:
|
||||
```
|
||||
python src/main.py
|
||||
```
|
||||
|
||||
The application will start monitoring the specified directory for new documents. Once a new document is added, it will be processed automatically.
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
|
||||
Loading…
Reference in New Issue