Merge pull request 'feature-ner-keyword-detect' (#1) from feature-ner-keyword-detect into main

Reviewed-on: #1
2025-07-20 13:43:59 +00:00 · 2025-07-20 13:43:59 +00:00 · 56c718d658
parent 47e78c35bb edad8e7322
commit 56c718d658
78 changed files with 21315 additions and 591 deletions
--- a/.env.example
+++ b/.env.example
@ -1,19 +0,0 @@
 # Storage paths
 OBJECT_STORAGE_PATH=/path/to/mounted/object/storage
 TARGET_DIRECTORY_PATH=/path/to/target/directory
 # Ollama API Configuration
 OLLAMA_API_URL=https://api.ollama.com
 OLLAMA_API_KEY=your_api_key_here
 OLLAMA_MODEL=llama2
 # Application Settings
 MONITOR_INTERVAL=5
 # Logging Configuration
 LOG_LEVEL=INFO
 LOG_FILE=app.log
 # Optional: Additional security settings
 # MAX_FILE_SIZE=10485760  # 10MB in bytes
 # ALLOWED_FILE_TYPES=.txt,.doc,.docx,.pdf
--- a/.gitignore
+++ b/.gitignore
@ -71,3 +71,6 @@ __pycache__
 data/doc_dest
 data/doc_src
 data/doc_intermediate
 node_modules
 backend/storage/
--- a/DOCKER_COMPOSE_README.md
+++ b/DOCKER_COMPOSE_README.md
@ -0,0 +1,206 @@
 # Unified Docker Compose Setup
 This project now includes a unified Docker Compose configuration that allows all services (mineru, backend, frontend) to run together and communicate using service names.
 ## Architecture
 The unified setup includes the following services:
 - **mineru-api**: Document processing service (port 8001)
 - **backend-api**: Main API service (port 8000)
 - **celery-worker**: Background task processor
 - **redis**: Message broker for Celery
 - **frontend**: React frontend application (port 3000)
 ## Network Configuration
 All services are connected through a custom bridge network called `app-network`, allowing them to communicate using service names:
 - Backend → Mineru: `http://mineru-api:8000`
 - Frontend → Backend: `http://localhost:8000/api/v1` (external access)
 - Backend → Redis: `redis://redis:6379/0`
 ## Usage
 ### Starting all services
 ```bash
 # From the root directory
 docker-compose up -d
 ```
 ### Starting specific services
 ```bash
 # Start only backend and mineru
 docker-compose up -d backend-api mineru-api redis
 # Start only frontend and backend
 docker-compose up -d frontend backend-api redis
 ```
 ### Stopping services
 ```bash
 # Stop all services
 docker-compose down
 # Stop and remove volumes
 docker-compose down -v
 ```
 ### Viewing logs
 ```bash
 # View all logs
 docker-compose logs -f
 # View specific service logs
 docker-compose logs -f backend-api
 docker-compose logs -f mineru-api
 docker-compose logs -f frontend
 ```
 ## Building Services
 ### Building all services
 ```bash
 # Build all services
 docker-compose build
 # Build and start all services
 docker-compose up -d --build
 ```
 ### Building individual services
 ```bash
 # Build only backend
 docker-compose build backend-api
 # Build only frontend
 docker-compose build frontend
 # Build only mineru
 docker-compose build mineru-api
 # Build multiple specific services
 docker-compose build backend-api frontend
 ```
 ### Building and restarting specific services
 ```bash
 # Build and restart only backend
 docker-compose build backend-api
 docker-compose up -d backend-api
 # Or combine in one command
 docker-compose up -d --build backend-api
 # Build and restart backend and celery worker
 docker-compose up -d --build backend-api celery-worker
 ```
 ### Force rebuild (no cache)
 ```bash
 # Force rebuild all services
 docker-compose build --no-cache
 # Force rebuild specific service
 docker-compose build --no-cache backend-api
 ```
 ## Environment Variables
 The unified setup uses environment variables from the individual service `.env` files:
 - `./backend/.env` - Backend configuration
 - `./frontend/.env` - Frontend configuration
 - `./mineru/.env` - Mineru configuration (if exists)
 ### Key Configuration Changes
 1. **Backend Configuration** (`backend/app/core/config.py`):
   ```python
   MINERU_API_URL: str = "http://mineru-api:8000"
   ```
 2. **Frontend Configuration**:
   ```javascript
   REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
   ```
 ## Service Dependencies
 - `backend-api` depends on `redis` and `mineru-api`
 - `celery-worker` depends on `redis` and `backend-api`
 - `frontend` depends on `backend-api`
 ## Port Mapping
 - **Frontend**: `http://localhost:3000`
 - **Backend API**: `http://localhost:8000`
 - **Mineru API**: `http://localhost:8001`
 - **Redis**: `localhost:6379`
 ## Health Checks
 The mineru-api service includes a health check that verifies the service is running properly.
 ## Development vs Production
 For development, you can still use the individual docker-compose files in each service directory. The unified setup is ideal for:
 - Production deployments
 - End-to-end testing
 - Simplified development environment
 ## Troubleshooting
 ### Service Communication Issues
 If services can't communicate:
 1. Check if all services are running: `docker-compose ps`
 2. Verify network connectivity: `docker network ls`
 3. Check service logs: `docker-compose logs [service-name]`
 ### Port Conflicts
 If you get port conflicts, you can modify the port mappings in the `docker-compose.yml` file:
 ```yaml
 ports:
  - "8002:8000"  # Change external port
 ```
 ### Volume Issues
 Make sure the storage directories exist:
 ```bash
 mkdir -p backend/storage
 mkdir -p mineru/storage/uploads
 mkdir -p mineru/storage/processed
 ```
 ## Migration from Individual Compose Files
 If you were previously using individual docker-compose files:
 1. Stop all individual services:
   ```bash
   cd backend && docker-compose down
   cd ../frontend && docker-compose down
   cd ../mineru && docker-compose down
   ```
 2. Start the unified setup:
   ```bash
   cd .. && docker-compose up -d
   ```
 The unified setup maintains the same functionality while providing better service discovery and networking. 
--- a/DOCKER_MIGRATION_GUIDE.md
+++ b/DOCKER_MIGRATION_GUIDE.md
@ -0,0 +1,399 @@
 # Docker Image Migration Guide
 This guide explains how to export your built Docker images, transfer them to another environment, and run them without rebuilding.
 ## Overview
 The migration process involves:
 1. **Export**: Save built images to tar files
 2. **Transfer**: Copy tar files to target environment
 3. **Import**: Load images on target environment
 4. **Run**: Start services with imported images
 ## Prerequisites
 ### Source Environment (where images are built)
 - Docker installed and running
 - All services built and working
 - Sufficient disk space for image export
 ### Target Environment (where images will run)
 - Docker installed and running
 - Sufficient disk space for image import
 - Network access to source environment (or USB drive)
 ## Step 1: Export Docker Images
 ### 1.1 List Current Images
 First, check what images you have:
 ```bash
 docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.ID}}\t{{.Size}}"
 ```
 You should see images like:
 - `legal-doc-masker-backend-api`
 - `legal-doc-masker-frontend`
 - `legal-doc-masker-mineru-api`
 - `redis:alpine`
 ### 1.2 Export Individual Images
 Create a directory for exports:
 ```bash
 mkdir -p docker-images-export
 cd docker-images-export
 ```
 Export each image:
 ```bash
 # Export backend image
 docker save legal-doc-masker-backend-api:latest -o backend-api.tar
 # Export frontend image
 docker save legal-doc-masker-frontend:latest -o frontend.tar
 # Export mineru image
 docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
 # Export redis image (if not using official)
 docker save redis:alpine -o redis.tar
 ```
 ### 1.3 Export All Images at Once (Alternative)
 If you want to export all images in one command:
 ```bash
 # Export all project images
 docker save \
  legal-doc-masker-backend-api:latest \
  legal-doc-masker-frontend:latest \
  legal-doc-masker-mineru-api:latest \
  redis:alpine \
  -o legal-doc-masker-all.tar
 ```
 ### 1.4 Verify Export Files
 Check the exported files:
 ```bash
 ls -lh *.tar
 ```
 You should see files like:
 - `backend-api.tar` (~200-500MB)
 - `frontend.tar` (~100-300MB)
 - `mineru-api.tar` (~1-3GB)
 - `redis.tar` (~30-50MB)
 ## Step 2: Transfer Images
 ### 2.1 Transfer via Network (SCP/RSYNC)
 ```bash
 # Transfer to remote server
 scp *.tar user@remote-server:/path/to/destination/
 # Or using rsync (more efficient for large files)
 rsync -avz --progress *.tar user@remote-server:/path/to/destination/
 ```
 ### 2.2 Transfer via USB Drive
 ```bash
 # Copy to USB drive
 cp *.tar /Volumes/USB_DRIVE/docker-images/
 # Or create a compressed archive
 tar -czf legal-doc-masker-images.tar.gz *.tar
 cp legal-doc-masker-images.tar.gz /Volumes/USB_DRIVE/
 ```
 ### 2.3 Transfer via Cloud Storage
 ```bash
 # Upload to cloud storage (example with AWS S3)
 aws s3 cp *.tar s3://your-bucket/docker-images/
 # Or using Google Cloud Storage
 gsutil cp *.tar gs://your-bucket/docker-images/
 ```
 ## Step 3: Import Images on Target Environment
 ### 3.1 Prepare Target Environment
 ```bash
 # Create directory for images
 mkdir -p docker-images-import
 cd docker-images-import
 # Copy images from transfer method
 # (SCP, USB, or download from cloud storage)
 ```
 ### 3.2 Import Individual Images
 ```bash
 # Import backend image
 docker load -i backend-api.tar
 # Import frontend image
 docker load -i frontend.tar
 # Import mineru image
 docker load -i mineru-api.tar
 # Import redis image
 docker load -i redis.tar
 ```
 ### 3.3 Import All Images at Once (if exported together)
 ```bash
 docker load -i legal-doc-masker-all.tar
 ```
 ### 3.4 Verify Imported Images
 ```bash
 docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.ID}}\t{{.Size}}"
 ```
 ## Step 4: Prepare Target Environment
 ### 4.1 Copy Project Files
 Transfer the following files to target environment:
 ```bash
 # Essential files to copy
 docker-compose.yml
 DOCKER_COMPOSE_README.md
 setup-unified-docker.sh
 # Environment files (if they exist)
 backend/.env
 frontend/.env
 mineru/.env
 # Storage directories (if you want to preserve data)
 backend/storage/
 mineru/storage/
 backend/legal_doc_masker.db
 ```
 ### 4.2 Create Directory Structure
 ```bash
 # Create necessary directories
 mkdir -p backend/storage
 mkdir -p mineru/storage/uploads
 mkdir -p mineru/storage/processed
 ```
 ## Step 5: Run Services
 ### 5.1 Start All Services
 ```bash
 # Start all services using imported images
 docker-compose up -d
 ```
 ### 5.2 Verify Services
 ```bash
 # Check service status
 docker-compose ps
 # Check service logs
 docker-compose logs -f
 ```
 ### 5.3 Test Endpoints
 ```bash
 # Test frontend
 curl -I http://localhost:3000
 # Test backend API
 curl -I http://localhost:8000/api/v1
 # Test mineru API
 curl -I http://localhost:8001/health
 ```
 ## Automation Scripts
 ### Export Script
 Create `export-images.sh`:
 ```bash
 #!/bin/bash
 set -e
 echo "🚀 Exporting Docker Images"
 # Create export directory
 mkdir -p docker-images-export
 cd docker-images-export
 # Export images
 echo "📦 Exporting backend-api image..."
 docker save legal-doc-masker-backend-api:latest -o backend-api.tar
 echo "📦 Exporting frontend image..."
 docker save legal-doc-masker-frontend:latest -o frontend.tar
 echo "📦 Exporting mineru-api image..."
 docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
 echo "📦 Exporting redis image..."
 docker save redis:alpine -o redis.tar
 # Show file sizes
 echo "📊 Export complete. File sizes:"
 ls -lh *.tar
 echo "✅ Images exported successfully!"
 ```
 ### Import Script
 Create `import-images.sh`:
 ```bash
 #!/bin/bash
 set -e
 echo "🚀 Importing Docker Images"
 # Check if tar files exist
 if [ ! -f "backend-api.tar" ]; then
    echo "❌ backend-api.tar not found"
    exit 1
 fi
 # Import images
 echo "📦 Importing backend-api image..."
 docker load -i backend-api.tar
 echo "📦 Importing frontend image..."
 docker load -i frontend.tar
 echo "📦 Importing mineru-api image..."
 docker load -i mineru-api.tar
 echo "📦 Importing redis image..."
 docker load -i redis.tar
 # Verify imports
 echo "📊 Imported images:"
 docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep legal-doc-masker
 echo "✅ Images imported successfully!"
 ```
 ## Troubleshooting
 ### Common Issues
 1. **Image not found during import**
   ```bash
   # Check if image exists
   docker images | grep image-name
   # Re-export if needed
   docker save image-name:tag -o image-name.tar
   ```
 2. **Port conflicts on target environment**
   ```bash
   # Check what's using the ports
   lsof -i :8000
   lsof -i :8001
   lsof -i :3000
   # Modify docker-compose.yml if needed
   ports:
     - "8002:8000"  # Change external port
   ```
 3. **Permission issues**
   ```bash
   # Fix file permissions
   chmod +x setup-unified-docker.sh
   chmod +x export-images.sh
   chmod +x import-images.sh
   ```
 4. **Storage directory issues**
   ```bash
   # Create directories with proper permissions
   sudo mkdir -p backend/storage
   sudo mkdir -p mineru/storage/uploads
   sudo mkdir -p mineru/storage/processed
   sudo chown -R $USER:$USER backend/storage mineru/storage
   ```
 ### Performance Optimization
 1. **Compress images for transfer**
   ```bash
   # Compress before transfer
   gzip *.tar
   # Decompress on target
   gunzip *.tar.gz
   ```
 2. **Use parallel transfer**
   ```bash
   # Transfer multiple files in parallel
   parallel scp {} user@server:/path/ ::: *.tar
   ```
 3. **Use Docker registry (alternative)**
   ```bash
   # Push to registry
   docker tag legal-doc-masker-backend-api:latest your-registry/backend-api:latest
   docker push your-registry/backend-api:latest
   # Pull on target
   docker pull your-registry/backend-api:latest
   ```
 ## Complete Migration Checklist
 - [ ] Export all Docker images
 - [ ] Transfer image files to target environment
 - [ ] Transfer project configuration files
 - [ ] Import images on target environment
 - [ ] Create necessary directories
 - [ ] Start services
 - [ ] Verify all services are running
 - [ ] Test all endpoints
 - [ ] Update any environment-specific configurations
 ## Security Considerations
 1. **Secure transfer**: Use encrypted transfer methods (SCP, SFTP)
 2. **Image verification**: Verify image integrity after transfer
 3. **Environment isolation**: Ensure target environment is properly secured
 4. **Access control**: Limit access to Docker daemon on target environment
 ## Cost Optimization
 1. **Image size**: Remove unnecessary layers before export
 2. **Compression**: Use compression for large images
 3. **Selective transfer**: Only transfer images you need
 4. **Cleanup**: Remove old images after successful migration 
--- a/48
+++ b/48
@ -1,48 +0,0 @@
 # Build stage
 FROM python:3.12-slim AS builder
 WORKDIR /app
 # Install build dependencies
 RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    build-essential \
    && rm -rf /var/lib/apt/lists/*
 # Copy requirements first to leverage Docker cache
 COPY requirements.txt .
 RUN pip wheel --no-cache-dir --no-deps --wheel-dir /app/wheels -r requirements.txt
 # Final stage
 FROM python:3.12-slim
 WORKDIR /app
 # Create non-root user
 RUN useradd -m -r appuser && \
    chown appuser:appuser /app
 # Copy wheels from builder
 COPY --from=builder /app/wheels /wheels
 COPY --from=builder /app/requirements.txt .
 # Install dependencies
 RUN pip install --no-cache /wheels/*
 # Copy application code
 COPY src/ ./src/
 # Create directories for mounted volumes
 RUN mkdir -p /data/input /data/output && \
    chown -R appuser:appuser /data
 # Switch to non-root user
 USER appuser
 # Environment variables
 ENV PYTHONPATH=/app \
    OBJECT_STORAGE_PATH=/data/input \
    TARGET_DIRECTORY_PATH=/data/output
 # Run the application
 CMD ["python", "src/main.py"]
--- a/MIGRATION_QUICK_REFERENCE.md
+++ b/MIGRATION_QUICK_REFERENCE.md
@ -0,0 +1,178 @@
 # Docker Migration Quick Reference
 ## 🚀 Quick Migration Process
 ### Source Environment (Export)
 ```bash
 # 1. Build images first (if not already built)
 docker-compose build
 # 2. Export all images
 ./export-images.sh
 # 3. Transfer files to target environment
 # Option A: SCP
 scp -r docker-images-export-*/ user@target-server:/path/to/destination/
 # Option B: USB Drive
 cp -r docker-images-export-*/ /Volumes/USB_DRIVE/
 # Option C: Compressed archive
 scp legal-doc-masker-images-*.tar.gz user@target-server:/path/to/destination/
 ```
 ### Target Environment (Import)
 ```bash
 # 1. Copy project files
 scp docker-compose.yml user@target-server:/path/to/destination/
 scp DOCKER_COMPOSE_README.md user@target-server:/path/to/destination/
 # 2. Import images
 ./import-images.sh
 # 3. Start services
 docker-compose up -d
 # 4. Verify
 docker-compose ps
 ```
 ## 📋 Essential Files to Transfer
 ### Required Files
 - `docker-compose.yml` - Unified compose configuration
 - `DOCKER_COMPOSE_README.md` - Documentation
 - `backend/.env` - Backend environment variables
 - `frontend/.env` - Frontend environment variables
 - `mineru/.env` - Mineru environment variables (if exists)
 ### Optional Files (for data preservation)
 - `backend/storage/` - Backend storage directory
 - `mineru/storage/` - Mineru storage directory
 - `backend/legal_doc_masker.db` - Database file
 ## 🔧 Common Commands
 ### Export Commands
 ```bash
 # Manual export
 docker save legal-doc-masker-backend-api:latest -o backend-api.tar
 docker save legal-doc-masker-frontend:latest -o frontend.tar
 docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
 docker save redis:alpine -o redis.tar
 # Compress for transfer
 tar -czf legal-doc-masker-images.tar.gz *.tar
 ```
 ### Import Commands
 ```bash
 # Manual import
 docker load -i backend-api.tar
 docker load -i frontend.tar
 docker load -i mineru-api.tar
 docker load -i redis.tar
 # Extract compressed archive
 tar -xzf legal-doc-masker-images.tar.gz
 ```
 ### Service Management
 ```bash
 # Start all services
 docker-compose up -d
 # Stop all services
 docker-compose down
 # View logs
 docker-compose logs -f [service-name]
 # Check status
 docker-compose ps
 ```
 ### Building Individual Services
 ```bash
 # Build specific service only
 docker-compose build backend-api
 docker-compose build frontend
 docker-compose build mineru-api
 # Build and restart specific service
 docker-compose up -d --build backend-api
 # Force rebuild (no cache)
 docker-compose build --no-cache backend-api
 # Using the build script
 ./build-service.sh backend-api --restart
 ./build-service.sh frontend --no-cache
 ./build-service.sh backend-api celery-worker
 ```
 ## 🌐 Service URLs
 After successful migration:
 - **Frontend**: http://localhost:3000
 - **Backend API**: http://localhost:8000
 - **Mineru API**: http://localhost:8001
 ## ⚠️ Troubleshooting
 ### Port Conflicts
 ```bash
 # Check what's using ports
 lsof -i :8000
 lsof -i :8001
 lsof -i :3000
 # Modify docker-compose.yml if needed
 ports:
  - "8002:8000"  # Change external port
 ```
 ### Permission Issues
 ```bash
 # Fix script permissions
 chmod +x export-images.sh
 chmod +x import-images.sh
 chmod +x setup-unified-docker.sh
 # Fix directory permissions
 sudo chown -R $USER:$USER backend/storage mineru/storage
 ```
 ### Disk Space Issues
 ```bash
 # Check available space
 df -h
 # Clean up Docker
 docker system prune -a
 ```
 ## 📊 Expected File Sizes
 - `backend-api.tar`: ~200-500MB
 - `frontend.tar`: ~100-300MB
 - `mineru-api.tar`: ~1-3GB
 - `redis.tar`: ~30-50MB
 - `legal-doc-masker-images.tar.gz`: ~1-2GB (compressed)
 ## 🔒 Security Notes
 1. Use encrypted transfer (SCP, SFTP) for sensitive environments
 2. Verify image integrity after transfer
 3. Update environment variables for target environment
 4. Ensure proper network security on target environment
 ## 📞 Support
 If you encounter issues:
 1. Check the full `DOCKER_MIGRATION_GUIDE.md`
 2. Verify all required files are present
 3. Check Docker logs: `docker-compose logs -f`
 4. Ensure sufficient disk space and permissions 
--- a/backend/.env
+++ b/backend/.env
@ -0,0 +1,20 @@
 # Storage paths
 OBJECT_STORAGE_PATH=/Users/tigeren/Dev/digisky/legal-doc-masker/data/doc_src
 TARGET_DIRECTORY_PATH=/Users/tigeren/Dev/digisky/legal-doc-masker/data/doc_dest
 INTERMEDIATE_DIR_PATH=/Users/tigeren/Dev/digisky/legal-doc-masker/data/doc_intermediate
 # Ollama API Configuration
 OLLAMA_API_URL=http://192.168.2.245:11434
 # OLLAMA_API_KEY=your_api_key_here
 OLLAMA_MODEL=qwen3:8b
 # Application Settings
 MONITOR_INTERVAL=5
 # Logging Configuration
 LOG_LEVEL=INFO
 LOG_FILE=app.log
 # Optional: Additional security settings
 # MAX_FILE_SIZE=10485760  # 10MB in bytes
 # ALLOWED_FILE_TYPES=.txt,.doc,.docx,.pdf
--- a/backend/Dockerfile
+++ b/backend/Dockerfile
@ -0,0 +1,36 @@
 FROM python:3.11-slim
 WORKDIR /app
 # Install system dependencies
 RUN apt-get update && apt-get install -y \
    build-essential \
    libreoffice \
    wget \
    && rm -rf /var/lib/apt/lists/*
 # Copy requirements first to leverage Docker cache
 COPY requirements.txt .
 # RUN pip install huggingface_hub
 # RUN wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
 # RUN wget https://raw.githubusercontent.com/opendatalab/MinerU/refs/heads/release-1.3.1/scripts/download_models_hf.py -O download_models_hf.py
 # RUN python download_models_hf.py
 RUN pip install --no-cache-dir -r requirements.txt
 # RUN pip install -U magic-pdf[full]
 # Copy the rest of the application
 COPY . .
 # Create storage directories
 RUN mkdir -p storage/uploads storage/processed
 # Expose the port the app runs on
 EXPOSE 8000
 # Command to run the application
 CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"] 
--- a/backend/PDF_PROCESSOR_README.md
+++ b/backend/PDF_PROCESSOR_README.md
@ -0,0 +1,202 @@
 # PDF Processor with Mineru API
 ## Overview
 The PDF processor has been rewritten to use Mineru's REST API instead of the magic_pdf library. This provides better separation of concerns and allows for more flexible deployment options.
 ## Changes Made
 ### 1. Removed Dependencies
 - Removed all `magic_pdf` imports and dependencies
 - Removed `PyPDF2` direct usage (though kept in requirements for potential other uses)
 ### 2. New Implementation
 - **REST API Integration**: Uses HTTP requests to call Mineru's API
 - **Configurable Settings**: Mineru API URL and timeout are configurable
 - **Error Handling**: Comprehensive error handling for network issues, timeouts, and API errors
 - **Flexible Response Parsing**: Handles multiple possible response formats from Mineru API
 ### 3. Configuration
 Add the following settings to your environment or `.env` file:
 ```bash
 # Mineru API Configuration
 MINERU_API_URL=http://mineru-api:8000
 MINERU_TIMEOUT=300
 MINERU_LANG_LIST=["ch"]
 MINERU_BACKEND=pipeline
 MINERU_PARSE_METHOD=auto
 MINERU_FORMULA_ENABLE=true
 MINERU_TABLE_ENABLE=true
 ```
 ### 4. API Endpoint
 The processor expects Mineru to provide a REST API endpoint at `/file_parse` that accepts PDF files via multipart form data and returns JSON with markdown content.
 #### Expected Request Format:
 ```
 POST /file_parse
 Content-Type: multipart/form-data
 files: [PDF file]
 output_dir: ./output
 lang_list: ["ch"]
 backend: pipeline
 parse_method: auto
 formula_enable: true
 table_enable: true
 return_md: true
 return_middle_json: false
 return_model_output: false
 return_content_list: false
 return_images: false
 start_page_id: 0
 end_page_id: 99999
 ```
 #### Expected Response Format:
 The processor can handle multiple response formats:
 ```json
 {
  "markdown": "# Document Title\n\nContent here..."
 }
 ```
 OR
 ```json
 {
  "md": "# Document Title\n\nContent here..."
 }
 ```
 OR
 ```json
 {
  "content": "# Document Title\n\nContent here..."
 }
 ```
 OR
 ```json
 {
  "result": {
    "markdown": "# Document Title\n\nContent here..."
  }
 }
 ```
 ## Usage
 ### Basic Usage
 ```python
 from app.core.document_handlers.processors.pdf_processor import PdfDocumentProcessor
 # Create processor instance
 processor = PdfDocumentProcessor("input.pdf", "output.md")
 # Read and convert PDF to markdown
 content = processor.read_content()
 # Process content (apply masking)
 processed_content = processor.process_content(content)
 # Save processed content
 processor.save_content(processed_content)
 ```
 ### Through Document Service
 ```python
 from app.core.services.document_service import DocumentService
 service = DocumentService()
 success = service.process_document("input.pdf", "output.md")
 ```
 ## Testing
 Run the test script to verify the implementation:
 ```bash
 cd backend
 python test_pdf_processor.py
 ```
 Make sure you have:
 1. A sample PDF file in the `sample_doc/` directory
 2. Mineru API service running and accessible
 3. Proper network connectivity between services
 ## Error Handling
 The processor handles various error scenarios:
 - **Network Timeouts**: Configurable timeout (default: 5 minutes)
 - **API Errors**: HTTP status code errors are logged and handled
 - **Response Parsing**: Multiple fallback strategies for extracting markdown content
 - **File Operations**: Proper error handling for file reading/writing
 ## Logging
 The processor provides detailed logging for debugging:
 - API call attempts and responses
 - Content extraction results
 - Error conditions and stack traces
 - Processing statistics
 ## Deployment
 ### Docker Compose
 Ensure your Mineru service is running and accessible. The default configuration expects it at `http://mineru-api:8000`.
 ### Environment Variables
 Set the following environment variables in your deployment:
 ```bash
 MINERU_API_URL=http://your-mineru-service:8000
 MINERU_TIMEOUT=300
 ```
 ## Troubleshooting
 ### Common Issues
 1. **Connection Refused**: Check if Mineru service is running and accessible
 2. **Timeout Errors**: Increase `MINERU_TIMEOUT` for large PDF files
 3. **Empty Content**: Check Mineru API response format and logs
 4. **Network Issues**: Verify network connectivity between services
 ### Debug Mode
 Enable debug logging to see detailed API interactions:
 ```python
 import logging
 logging.getLogger('app.core.document_handlers.processors.pdf_processor').setLevel(logging.DEBUG)
 ```
 ## Migration from magic_pdf
 If you were previously using magic_pdf:
 1. **No Code Changes Required**: The interface remains the same
 2. **Configuration Update**: Add Mineru API settings
 3. **Service Dependencies**: Ensure Mineru service is running
 4. **Testing**: Run the test script to verify functionality
 ## Performance Considerations
 - **Timeout**: Large PDFs may require longer timeouts
 - **Memory**: The processor loads the entire PDF into memory for API calls
 - **Network**: API calls add network latency to processing time
 - **Caching**: Consider implementing caching for frequently processed documents 
--- a/backend/README.md
+++ b/backend/README.md
@ -0,0 +1,103 @@
 # Legal Document Masker API
 This is the backend API for the Legal Document Masking system. It provides endpoints for file upload, processing status tracking, and file download.
 ## Prerequisites
 - Python 3.8+
 - Redis (for Celery)
 ## File Storage
 Files are stored in the following structure:
 ```
 backend/
 ├── storage/
 │   ├── uploads/     # Original uploaded files
 │   └── processed/   # Masked/processed files
 ```
 ## Setup
 ### Option 1: Local Development
 1. Create a virtual environment:
 ```bash
 python -m venv venv
 source venv/bin/activate  # On Windows: venv\Scripts\activate
 ```
 2. Install dependencies:
 ```bash
 pip install -r requirements.txt
 ```
 3. Set up environment variables:
 Create a `.env` file in the backend directory with the following variables:
 ```env
 SECRET_KEY=your-secret-key-here
 ```
 The database (SQLite) will be automatically created when you first run the application.
 4. Start Redis (required for Celery):
 ```bash
 redis-server
 ```
 5. Start Celery worker:
 ```bash
 celery -A app.services.file_service worker --loglevel=info
 ```
 6. Start the FastAPI server:
 ```bash
 uvicorn app.main:app --reload
 ```
 ### Option 2: Docker Deployment
 1. Build and start the services:
 ```bash
 docker-compose up --build
 ```
 This will start:
 - FastAPI server on port 8000
 - Celery worker for background processing
 - Redis for task queue
 ## API Documentation
 Once the server is running, you can access:
 - Swagger UI: `http://localhost:8000/docs`
 - ReDoc: `http://localhost:8000/redoc`
 ## API Endpoints
 - `POST /api/v1/files/upload` - Upload a new file
 - `GET /api/v1/files` - List all files
 - `GET /api/v1/files/{file_id}` - Get file details
 - `GET /api/v1/files/{file_id}/download` - Download processed file
 - `WS /api/v1/files/ws/status/{file_id}` - WebSocket for real-time status updates
 ## Development
 ### Running Tests
 ```bash
 pytest
 ```
 ### Code Style
 The project uses Black for code formatting:
 ```bash
 black .
 ```
 ### Docker Commands
 - Start services: `docker-compose up`
 - Start in background: `docker-compose up -d`
 - Stop services: `docker-compose down`
 - View logs: `docker-compose logs -f`
 - Rebuild: `docker-compose up --build` 
--- a/backend/app/api/endpoints/files.py
+++ b/backend/app/api/endpoints/files.py
@ -0,0 +1,166 @@
 from fastapi import APIRouter, Depends, HTTPException, UploadFile, File, WebSocket, Response
 from fastapi.responses import FileResponse
 from sqlalchemy.orm import Session
 from typing import List
 import os
 from ...core.config import settings
 from ...core.database import get_db
 from ...models.file import File as FileModel, FileStatus
 from ...services.file_service import process_file, delete_file
 from ...schemas.file import FileResponse as FileResponseSchema, FileList
 import asyncio
 from fastapi import WebSocketDisconnect
 import uuid
 router = APIRouter()
@router.post("/upload", response_model=FileResponseSchema)
 async def upload_file(
    file: UploadFile = File(...),
    db: Session = Depends(get_db)
 ):
    if not file.filename:
        raise HTTPException(status_code=400, detail="No file provided")
    if not any(file.filename.lower().endswith(ext) for ext in settings.ALLOWED_EXTENSIONS):
        raise HTTPException(
            status_code=400,
            detail=f"File type not allowed. Allowed types: {', '.join(settings.ALLOWED_EXTENSIONS)}"
        )
    # Generate unique file ID
    file_id = str(uuid.uuid4())
    file_extension = os.path.splitext(file.filename)[1]
    unique_filename = f"{file_id}{file_extension}"
    # Save file with unique name
    file_path = settings.UPLOAD_FOLDER / unique_filename
    with open(file_path, "wb") as buffer:
        content = await file.read()
        buffer.write(content)
    # Create database entry
    db_file = FileModel(
        id=file_id,
        filename=file.filename,
        original_path=str(file_path),
        status=FileStatus.NOT_STARTED
    )
    db.add(db_file)
    db.commit()
    db.refresh(db_file)
    # Start processing
    process_file.delay(str(db_file.id))
    return db_file
@router.get("/files", response_model=List[FileResponseSchema])
 def list_files(
    skip: int = 0,
    limit: int = 100,
    db: Session = Depends(get_db)
 ):
    files = db.query(FileModel).offset(skip).limit(limit).all()
    return files
@router.get("/files/{file_id}", response_model=FileResponseSchema)
 def get_file(
    file_id: str,
    db: Session = Depends(get_db)
 ):
    file = db.query(FileModel).filter(FileModel.id == file_id).first()
    if not file:
        raise HTTPException(status_code=404, detail="File not found")
    return file
@router.get("/files/{file_id}/download")
 async def download_file(
    file_id: str,
    db: Session = Depends(get_db)
 ):
    print(f"=== DOWNLOAD REQUEST ===")
    print(f"File ID: {file_id}")
    file = db.query(FileModel).filter(FileModel.id == file_id).first()
    if not file:
        print(f"❌ File not found for ID: {file_id}")
        raise HTTPException(status_code=404, detail="File not found")
    print(f"✅ File found: {file.filename}")
    print(f"File status: {file.status}")
    print(f"Original path: {file.original_path}")
    print(f"Processed path: {file.processed_path}")
    if file.status != FileStatus.SUCCESS:
        print(f"❌ File not ready for download. Status: {file.status}")
        raise HTTPException(status_code=400, detail="File is not ready for download")
    if not os.path.exists(file.processed_path):
        print(f"❌ Processed file not found at: {file.processed_path}")
        raise HTTPException(status_code=404, detail="Processed file not found")
    print(f"✅ Processed file exists at: {file.processed_path}")
    # Get the original filename without extension and add .md extension
    original_filename = file.filename
    filename_without_ext = os.path.splitext(original_filename)[0]
    download_filename = f"{filename_without_ext}.md"
    print(f"Original filename: {original_filename}")
    print(f"Filename without extension: {filename_without_ext}")
    print(f"Download filename: {download_filename}")
    response = FileResponse(
        path=file.processed_path,
        filename=download_filename,
        media_type="text/markdown"
    )
    print(f"Response headers: {dict(response.headers)}")
    print(f"=== END DOWNLOAD REQUEST ===")
    return response
@router.websocket("/ws/status/{file_id}")
 async def websocket_endpoint(websocket: WebSocket, file_id: str, db: Session = Depends(get_db)):
    await websocket.accept()
    try:
        while True:
            file = db.query(FileModel).filter(FileModel.id == file_id).first()
            if not file:
                await websocket.send_json({"error": "File not found"})
                break
            await websocket.send_json({
                "status": file.status,
                "error": file.error_message
            })
            if file.status in [FileStatus.SUCCESS, FileStatus.FAILED]:
                break
            await asyncio.sleep(1)
    except WebSocketDisconnect:
        pass
@router.delete("/files/{file_id}")
 async def delete_file_endpoint(
    file_id: str,
    db: Session = Depends(get_db)
 ):
    """
    Delete a file and its associated records.
    This will remove:
    1. The database record
    2. The original uploaded file
    3. The processed markdown file (if it exists)
    """
    try:
        delete_file(file_id)
        return {"message": "File deleted successfully"}
    except HTTPException as e:
        raise e
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e)) 
--- a/backend/app/core/config.py
+++ b/backend/app/core/config.py
@ -0,0 +1,65 @@
 from pydantic_settings import BaseSettings
 from typing import Optional
 import os
 from pathlib import Path
 class Settings(BaseSettings):
    # API Settings
    API_V1_STR: str = "/api/v1"
    PROJECT_NAME: str = "Legal Document Masker API"
    # Security
    SECRET_KEY: str = "your-secret-key-here"  # Change in production
    ACCESS_TOKEN_EXPIRE_MINUTES: int = 60 * 24 * 8  # 8 days
    # Database
    BASE_DIR: Path = Path(__file__).parent.parent.parent
    DATABASE_URL: str = f"sqlite:///{BASE_DIR}/storage/legal_doc_masker.db"
    # File Storage
    UPLOAD_FOLDER: Path = BASE_DIR / "storage" / "uploads"
    PROCESSED_FOLDER: Path = BASE_DIR / "storage" / "processed"
    MAX_FILE_SIZE: int = 50 * 1024 * 1024  # 50MB
    ALLOWED_EXTENSIONS: set = {"pdf", "docx", "doc", "md"}
    # Celery
    CELERY_BROKER_URL: str = "redis://redis:6379/0"
    CELERY_RESULT_BACKEND: str = "redis://redis:6379/0"
    # Ollama API settings
    OLLAMA_API_URL: str = "https://api.ollama.com"
    OLLAMA_API_KEY: str = ""
    OLLAMA_MODEL: str = "llama2"
    # Mineru API settings
    MINERU_API_URL: str = "http://mineru-api:8000"
    # MINERU_API_URL: str = "http://host.docker.internal:8001"
    MINERU_TIMEOUT: int = 300  # 5 minutes timeout
    MINERU_LANG_LIST: list = ["ch"]  # Language list for parsing
    MINERU_BACKEND: str = "pipeline"  # Backend to use
    MINERU_PARSE_METHOD: str = "auto"  # Parse method
    MINERU_FORMULA_ENABLE: bool = True  # Enable formula parsing
    MINERU_TABLE_ENABLE: bool = True  # Enable table parsing
    # Logging settings
    LOG_LEVEL: str = "INFO"
    LOG_FORMAT: str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
    LOG_DATE_FORMAT: str = "%Y-%m-%d %H:%M:%S"
    LOG_FILE: str = "app.log"
    class Config:
        case_sensitive = True
        env_file = ".env"
        env_file_encoding = "utf-8"
        extra = "allow"
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        # Create storage directories if they don't exist
        self.UPLOAD_FOLDER.mkdir(parents=True, exist_ok=True)
        self.PROCESSED_FOLDER.mkdir(parents=True, exist_ok=True)
        # Create storage directory for database
        (self.BASE_DIR / "storage").mkdir(parents=True, exist_ok=True)
 settings = Settings()
--- a/backend/app/core/config/logging_config.py
+++ b/backend/app/core/config/logging_config.py
@ -1,5 +1,6 @@
 import logging.config
-from config.settings import settings
+# from config.settings import settings
 from .settings import settings
 LOGGING_CONFIG = {
    "version": 1,
--- a/backend/app/core/config/settings.py
+++ b/backend/app/core/config/settings.py
--- a/backend/app/core/database.py
+++ b/backend/app/core/database.py
@ -0,0 +1,21 @@
 from sqlalchemy import create_engine
 from sqlalchemy.ext.declarative import declarative_base
 from sqlalchemy.orm import sessionmaker
 from .config import settings
 # Create SQLite engine with check_same_thread=False for FastAPI
 engine = create_engine(
    settings.DATABASE_URL,
    connect_args={"check_same_thread": False}
 )
 SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
 Base = declarative_base()
 # Dependency
 def get_db():
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close() 
--- a/backend/app/core/document_handlers/document.py
+++ b/backend/app/core/document_handlers/document.py
--- a/backend/app/core/document_handlers/document_factory.py
+++ b/backend/app/core/document_handlers/document_factory.py
@ -1,9 +1,9 @@
 import os
 from typing import Optional
-from document_handlers.document_processor import DocumentProcessor
+from .document_processor import DocumentProcessor
-from document_handlers.processors import (
+from .processors import (
    TxtDocumentProcessor,
-    DocxDocumentProcessor,
+    # DocxDocumentProcessor,
    PdfDocumentProcessor,
    MarkdownDocumentProcessor
 )
@ -15,8 +15,8 @@ class DocumentProcessorFactory:
        processors = {
            '.txt': TxtDocumentProcessor,
-            '.docx': DocxDocumentProcessor,
+            # '.docx': DocxDocumentProcessor,
-            '.doc': DocxDocumentProcessor,
+            # '.doc': DocxDocumentProcessor,
            '.pdf': PdfDocumentProcessor,
            '.md': MarkdownDocumentProcessor,
            '.markdown': MarkdownDocumentProcessor
--- a/backend/app/core/document_handlers/document_processor.py
+++ b/backend/app/core/document_handlers/document_processor.py
@ -0,0 +1,71 @@
 from abc import ABC, abstractmethod
 from typing import Any, Dict
 import logging
 from .ner_processor import NerProcessor
 logger = logging.getLogger(__name__)
 class DocumentProcessor(ABC):
    def __init__(self):
        self.max_chunk_size = 1000  # Maximum number of characters per chunk
        self.ner_processor = NerProcessor()
    @abstractmethod
    def read_content(self) -> str:
        """Read document content"""
        pass
    def _split_into_chunks(self, sentences: list[str]) -> list[str]:
        """Split sentences into chunks that don't exceed max_chunk_size"""
        chunks = []
        current_chunk = ""
        for sentence in sentences:
            if not sentence.strip():
                continue
            if len(current_chunk) + len(sentence) > self.max_chunk_size and current_chunk:
                chunks.append(current_chunk)
                current_chunk = sentence
            else:
                if current_chunk:
                    current_chunk += "。" + sentence
                else:
                    current_chunk = sentence
        if current_chunk:
            chunks.append(current_chunk)
        logger.info(f"Split content into {len(chunks)} chunks")
        return chunks
    def _apply_mapping(self, text: str, mapping: Dict[str, str]) -> str:
        """Apply the mapping to replace sensitive information"""
        masked_text = text
        for original, masked in mapping.items():
            if isinstance(masked, dict):
                masked = next(iter(masked.values()), "某")
            elif not isinstance(masked, str):
                masked = str(masked) if masked is not None else "某"
            masked_text = masked_text.replace(original, masked)
        return masked_text
    def process_content(self, content: str) -> str:
        """Process document content by masking sensitive information"""
        sentences = content.split("。")
        chunks = self._split_into_chunks(sentences)
        logger.info(f"Split content into {len(chunks)} chunks")
        final_mapping = self.ner_processor.process(chunks)
        masked_content = self._apply_mapping(content, final_mapping)
        logger.info("Successfully masked content")
        return masked_content
    @abstractmethod
    def save_content(self, content: str) -> None:
        """Save processed content"""
        pass
--- a/backend/app/core/document_handlers/ner_processor.py
+++ b/backend/app/core/document_handlers/ner_processor.py
@ -0,0 +1,305 @@
 from typing import Any, Dict
 from ..prompts.masking_prompts import get_ner_name_prompt, get_ner_company_prompt, get_ner_address_prompt, get_ner_project_prompt, get_ner_case_number_prompt, get_entity_linkage_prompt
 import logging
 import json
 from ..services.ollama_client import OllamaClient
 from ...core.config import settings
 from ..utils.json_extractor import LLMJsonExtractor
 from ..utils.llm_validator import LLMResponseValidator
 import re
 from .regs.entity_regex import extract_id_number_entities, extract_social_credit_code_entities
 logger = logging.getLogger(__name__)
 class NerProcessor:
    def __init__(self):
        self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
        self.max_retries = 3
    def _validate_mapping_format(self, mapping: Dict[str, Any]) -> bool:
        return LLMResponseValidator.validate_entity_extraction(mapping)
    def _process_entity_type(self, chunk: str, prompt_func, entity_type: str) -> Dict[str, str]:
        for attempt in range(self.max_retries):
            try:
                formatted_prompt = prompt_func(chunk)
                logger.info(f"Calling ollama to generate {entity_type} mapping for chunk (attempt {attempt + 1}/{self.max_retries}): {formatted_prompt}")
                response = self.ollama_client.generate(formatted_prompt)
                logger.info(f"Raw response from LLM: {response}")
                mapping = LLMJsonExtractor.parse_raw_json_str(response)
                logger.info(f"Parsed mapping: {mapping}")
                if mapping and self._validate_mapping_format(mapping):
                    return mapping
                else:
                    logger.warning(f"Invalid mapping format received on attempt {attempt + 1}, retrying...")
            except Exception as e:
                logger.error(f"Error generating {entity_type} mapping on attempt {attempt + 1}: {e}")
                if attempt < self.max_retries - 1:
                    logger.info("Retrying...")
                else:
                    logger.error(f"Max retries reached for {entity_type}, returning empty mapping")
        return {}
    def build_mapping(self, chunk: str) -> list[Dict[str, str]]:
        mapping_pipeline = []
        entity_configs = [
            (get_ner_name_prompt, "people names"),
            (get_ner_company_prompt, "company names"),
            (get_ner_address_prompt, "addresses"),
            (get_ner_project_prompt, "project names"),
            (get_ner_case_number_prompt, "case numbers")
        ]
        for prompt_func, entity_type in entity_configs:
            mapping = self._process_entity_type(chunk, prompt_func, entity_type)
            if mapping:
                mapping_pipeline.append(mapping)
        regex_entity_extractors = [
            extract_id_number_entities,
            extract_social_credit_code_entities
        ]
        for extractor in regex_entity_extractors:
            mapping = extractor(chunk)
            if mapping and LLMResponseValidator.validate_regex_entity(mapping):
                mapping_pipeline.append(mapping)
            elif mapping:
                logger.warning(f"Invalid regex entity mapping format: {mapping}")
        return mapping_pipeline
    def _merge_entity_mappings(self, chunk_mappings: list[Dict[str, Any]]) -> list[Dict[str, str]]:
        all_entities = []
        for mapping in chunk_mappings:
            if isinstance(mapping, dict) and 'entities' in mapping:
                entities = mapping['entities']
                if isinstance(entities, list):
                    all_entities.extend(entities)
        unique_entities = []
        seen_texts = set()
        for entity in all_entities:
            if isinstance(entity, dict) and 'text' in entity:
                text = entity['text'].strip()
                if text and text not in seen_texts:
                    seen_texts.add(text)
                    unique_entities.append(entity)
                elif text and  text in seen_texts:
                    # 暂时记录下可能存在冲突的entity
                    logger.info(f"Duplicate entity found: {entity}")
                    continue
        logger.info(f"Merged {len(unique_entities)} unique entities")
        return unique_entities
    def _generate_masked_mapping(self, unique_entities: list[Dict[str, str]], linkage: Dict[str, Any]) -> Dict[str, str]:
        """
        结合 linkage 信息，按实体分组映射同一脱敏名，并实现如下规则：
        1. 人名/简称：保留姓，名变为某，同姓编号；
        2. 公司名：同组公司名映射为大写字母公司（A公司、B公司...）；
        3. 英文人名：每个单词首字母+***；
        4. 英文公司名：替换为所属行业名称，英文大写（如无行业信息，默认 COMPANY）；
        5. 项目名：项目名称变为小写英文字母（如 a项目、b项目...）；
        6. 案号：只替换案号中的数字部分为***，保留前后结构和“号”字，支持中间有空格；
        7. 身份证号：6位X；
        8. 社会信用代码：8位X；
        9. 地址：保留区级及以上行政区划，去除详细位置；
        10. 其他类型按原有逻辑。
        """
        import re
        entity_mapping = {}
        used_masked_names = set()
        group_mask_map = {}
        surname_counter = {}
        company_letter = ord('A')
        project_letter = ord('a')
        # 优先区县级单位，后市、省等
        admin_keywords = [
            '市辖区', '自治县', '自治旗', '林区', '区', '县', '旗', '州', '盟', '地区', '自治州',
            '市', '省', '自治区', '特别行政区'
        ]
        admin_pattern = r"^(.*?(?:" + '|'.join(admin_keywords) + r"))"
        for group in linkage.get('entity_groups', []):
            group_type = group.get('group_type', '')
            entities = group.get('entities', [])
            if '公司' in group_type or 'Company' in group_type:
                masked = chr(company_letter) + '公司'
                company_letter += 1
                for entity in entities:
                    group_mask_map[entity['text']] = masked
            elif '人名' in group_type:
                surname_local_counter = {}
                for entity in entities:
                    name = entity['text']
                    if not name:
                        continue
                    surname = name[0]
                    surname_local_counter.setdefault(surname, 0)
                    surname_local_counter[surname] += 1
                    if surname_local_counter[surname] == 1:
                        masked = f"{surname}某"
                    else:
                        masked = f"{surname}某{surname_local_counter[surname]}"
                    group_mask_map[name] = masked
            elif '英文人名' in group_type:
                for entity in entities:
                    name = entity['text']
                    if not name:
                        continue
                    masked = ' '.join([n[0] + '***' if n else '' for n in name.split()])
                    group_mask_map[name] = masked
        for entity in unique_entities:
            text = entity['text']
            entity_type = entity.get('type', '')
            if text in group_mask_map:
                entity_mapping[text] = group_mask_map[text]
                used_masked_names.add(group_mask_map[text])
            elif '英文公司名' in entity_type or 'English Company' in entity_type:
                industry = entity.get('industry', 'COMPANY')
                masked = industry.upper()
                entity_mapping[text] = masked
                used_masked_names.add(masked)
            elif '项目名' in entity_type:
                masked = chr(project_letter) + '项目'
                project_letter += 1
                entity_mapping[text] = masked
                used_masked_names.add(masked)
            elif '案号' in entity_type:
                masked = re.sub(r'(\d[\d\s]*)(号)', r'***\2', text)
                entity_mapping[text] = masked
                used_masked_names.add(masked)
            elif '身份证号' in entity_type:
                masked = 'X' * 6
                entity_mapping[text] = masked
                used_masked_names.add(masked)
            elif '社会信用代码' in entity_type:
                masked = 'X' * 8
                entity_mapping[text] = masked
                used_masked_names.add(masked)
            elif '地址' in entity_type:
                # 保留区级及以上行政区划，去除详细位置
                match = re.match(admin_pattern, text)
                if match:
                    masked = match.group(1)
                else:
                    masked = text  # fallback
                entity_mapping[text] = masked
                used_masked_names.add(masked)
            elif '人名' in entity_type:
                name = text
                if not name:
                    masked = '某'
                else:
                    surname = name[0]
                    surname_counter.setdefault(surname, 0)
                    surname_counter[surname] += 1
                    if surname_counter[surname] == 1:
                        masked = f"{surname}某"
                    else:
                        masked = f"{surname}某{surname_counter[surname]}"
                entity_mapping[text] = masked
                used_masked_names.add(masked)
            elif '公司' in entity_type or 'Company' in entity_type:
                masked = chr(company_letter) + '公司'
                company_letter += 1
                entity_mapping[text] = masked
                used_masked_names.add(masked)
            elif '英文人名' in entity_type:
                name = text
                masked = ' '.join([n[0] + '***' if n else '' for n in name.split()])
                entity_mapping[text] = masked
                used_masked_names.add(masked)
            else:
                base_name = '某'
                masked = base_name
                counter = 1
                while masked in used_masked_names:
                    if counter <= 10:
                        suffixes = ['甲', '乙', '丙', '丁', '戊', '己', '庚', '辛', '壬', '癸']
                        masked = base_name + suffixes[counter - 1]
                    else:
                        masked = f"{base_name}{counter}"
                    counter += 1
                entity_mapping[text] = masked
                used_masked_names.add(masked)
        return entity_mapping
    def _validate_linkage_format(self, linkage: Dict[str, Any]) -> bool:
        return LLMResponseValidator.validate_entity_linkage(linkage)
    def _create_entity_linkage(self, unique_entities: list[Dict[str, str]]) -> Dict[str, Any]:
        linkable_entities = []
        for entity in unique_entities:
            entity_type = entity.get('type', '')
            if any(keyword in entity_type for keyword in ['公司', 'Company', '人名', '英文人名']):
                linkable_entities.append(entity)
        if not linkable_entities:
            logger.info("No linkable entities found")
            return {"entity_groups": []}
        entities_text = "\n".join([
            f"- {entity['text']} (类型: {entity['type']})" 
            for entity in linkable_entities
        ])
        for attempt in range(self.max_retries):
            try:
                formatted_prompt = get_entity_linkage_prompt(entities_text)
                logger.info(f"Calling ollama to generate entity linkage (attempt {attempt + 1}/{self.max_retries})")
                response = self.ollama_client.generate(formatted_prompt)
                logger.info(f"Raw entity linkage response from LLM: {response}")
                linkage = LLMJsonExtractor.parse_raw_json_str(response)
                logger.info(f"Parsed entity linkage: {linkage}")
                if linkage and self._validate_linkage_format(linkage):
                    logger.info(f"Successfully created entity linkage with {len(linkage.get('entity_groups', []))} groups")
                    return linkage
                else:
                    logger.warning(f"Invalid entity linkage format received on attempt {attempt + 1}, retrying...")
            except Exception as e:
                logger.error(f"Error generating entity linkage on attempt {attempt + 1}: {e}")
                if attempt < self.max_retries - 1:
                    logger.info("Retrying...")
                else:
                    logger.error("Max retries reached for entity linkage, returning empty linkage")
        return {"entity_groups": []}
    def _apply_entity_linkage_to_mapping(self, entity_mapping: Dict[str, str], entity_linkage: Dict[str, Any]) -> Dict[str, str]:
        """
        linkage 已在 _generate_masked_mapping 中处理，此处直接返回 entity_mapping。
        """
        return entity_mapping
    def process(self, chunks: list[str]) -> Dict[str, str]:
        chunk_mappings = []
        for i, chunk in enumerate(chunks):
            logger.info(f"Processing chunk {i+1}/{len(chunks)}")
            chunk_mapping = self.build_mapping(chunk)
            logger.info(f"Chunk mapping: {chunk_mapping}")
            chunk_mappings.extend(chunk_mapping)
        logger.info(f"Final chunk mappings: {chunk_mappings}")
        unique_entities = self._merge_entity_mappings(chunk_mappings)
        logger.info(f"Unique entities: {unique_entities}")
        entity_linkage = self._create_entity_linkage(unique_entities)
        logger.info(f"Entity linkage: {entity_linkage}")
        # for quick test
        # unique_entities = [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}, {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}, {'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}, {'text': '服务合同', 'type': '项目名'}, {'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '（2020）京0105 民初69754 号', 'type': '案号'}, {'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}, {'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}, {'text': '《计算机设备采购合同》', 'type': '项目名'}, {'text': '《服务合同书》', 'type': '项目名'}]
        # entity_linkage = {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
        combined_mapping = self._generate_masked_mapping(unique_entities, entity_linkage)
        logger.info(f"Combined mapping: {combined_mapping}")
        final_mapping = self._apply_entity_linkage_to_mapping(combined_mapping, entity_linkage)
        logger.info(f"Final mapping: {final_mapping}")
        return final_mapping
--- a/backend/app/core/document_handlers/processors/init.py
+++ b/backend/app/core/document_handlers/processors/init.py
@ -0,0 +1,7 @@
 from .txt_processor import TxtDocumentProcessor
 # from .docx_processor import DocxDocumentProcessor
 from .pdf_processor import PdfDocumentProcessor
 from .md_processor import MarkdownDocumentProcessor
 # __all__ = ['TxtDocumentProcessor', 'DocxDocumentProcessor', 'PdfDocumentProcessor', 'MarkdownDocumentProcessor']
 __all__ = ['TxtDocumentProcessor', 'PdfDocumentProcessor', 'MarkdownDocumentProcessor']
--- a/backend/app/core/document_handlers/processors/docx_processor.py.backup
+++ b/backend/app/core/document_handlers/processors/docx_processor.py.backup
@ -1,13 +1,13 @@
 import os
 import docx
-from document_handlers.document_processor import DocumentProcessor
+from ...document_handlers.document_processor import DocumentProcessor
 from magic_pdf.data.data_reader_writer import FileBasedDataWriter
 from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
 from magic_pdf.data.read_api import read_local_office
 import logging
-from services.ollama_client import OllamaClient
+from ...services.ollama_client import OllamaClient
-from config.settings import settings
+from ...config import settings
-from prompts.masking_prompts import get_masking_mapping_prompt
+from ...prompts.masking_prompts import get_masking_mapping_prompt
 logger = logging.getLogger(__name__)
--- a/backend/app/core/document_handlers/processors/md_processor.py
+++ b/backend/app/core/document_handlers/processors/md_processor.py
@ -1,8 +1,8 @@
 import os
-from document_handlers.document_processor import DocumentProcessor
+from ...document_handlers.document_processor import DocumentProcessor
-from services.ollama_client import OllamaClient
+from ...services.ollama_client import OllamaClient
 import logging
-from config.settings import settings
+from ...config import settings
 logger = logging.getLogger(__name__)
--- a/backend/app/core/document_handlers/processors/pdf_processor.py
+++ b/backend/app/core/document_handlers/processors/pdf_processor.py
@ -0,0 +1,204 @@
 import os
 import requests
 import logging
 from typing import Dict, Any, Optional
 from ...document_handlers.document_processor import DocumentProcessor
 from ...services.ollama_client import OllamaClient
 from ...config import settings
 logger = logging.getLogger(__name__)
 class PdfDocumentProcessor(DocumentProcessor):
    def __init__(self, input_path: str, output_path: str):
        super().__init__()  # Call parent class's __init__
        self.input_path = input_path
        self.output_path = output_path
        self.output_dir = os.path.dirname(output_path)
        self.name_without_suff = os.path.splitext(os.path.basename(input_path))[0]
        # Setup work directory for temporary files
        self.work_dir = os.path.join(
            os.path.dirname(output_path), 
            ".work", 
            os.path.splitext(os.path.basename(input_path))[0]
        )
        os.makedirs(self.work_dir, exist_ok=True)
        self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
        # Mineru API configuration
        self.mineru_base_url = getattr(settings, 'MINERU_API_URL', 'http://mineru-api:8000')
        self.mineru_timeout = getattr(settings, 'MINERU_TIMEOUT', 300)  # 5 minutes timeout
        self.mineru_lang_list = getattr(settings, 'MINERU_LANG_LIST', ['ch'])
        self.mineru_backend = getattr(settings, 'MINERU_BACKEND', 'pipeline')
        self.mineru_parse_method = getattr(settings, 'MINERU_PARSE_METHOD', 'auto')
        self.mineru_formula_enable = getattr(settings, 'MINERU_FORMULA_ENABLE', True)
        self.mineru_table_enable = getattr(settings, 'MINERU_TABLE_ENABLE', True)
    def _call_mineru_api(self, file_path: str) -> Optional[Dict[str, Any]]:
        """
        Call Mineru API to convert PDF to markdown
        Args:
            file_path: Path to the PDF file
        Returns:
            API response as dictionary or None if failed
        """
        try:
            url = f"{self.mineru_base_url}/file_parse"
            with open(file_path, 'rb') as file:
                files = {'files': (os.path.basename(file_path), file, 'application/pdf')}
                # Prepare form data according to Mineru API specification
                data = {
                    'output_dir': './output',
                    'lang_list': self.mineru_lang_list,
                    'backend': self.mineru_backend,
                    'parse_method': self.mineru_parse_method,
                    'formula_enable': self.mineru_formula_enable,
                    'table_enable': self.mineru_table_enable,
                    'return_md': True,
                    'return_middle_json': False,
                    'return_model_output': False,
                    'return_content_list': False,
                    'return_images': False,
                    'start_page_id': 0,
                    'end_page_id': 99999
                }
                logger.info(f"Calling Mineru API at {url}")
                response = requests.post(
                    url, 
                    files=files,
                    data=data,
                    timeout=self.mineru_timeout
                )
                if response.status_code == 200:
                    result = response.json()
                    logger.info("Successfully received response from Mineru API")
                    return result
                else:
                    logger.error(f"Mineru API returned status code {response.status_code}: {response.text}")
                    return None
        except requests.exceptions.Timeout:
            logger.error(f"Mineru API request timed out after {self.mineru_timeout} seconds")
            return None
        except requests.exceptions.RequestException as e:
            logger.error(f"Error calling Mineru API: {str(e)}")
            return None
        except Exception as e:
            logger.error(f"Unexpected error calling Mineru API: {str(e)}")
            return None
    def _extract_markdown_from_response(self, response: Dict[str, Any]) -> str:
        """
        Extract markdown content from Mineru API response
        Args:
            response: Mineru API response dictionary
        Returns:
            Extracted markdown content as string
        """
        try:
            logger.debug(f"Mineru API response structure: {response}")
            # Try different possible response formats based on Mineru API
            if 'markdown' in response:
                return response['markdown']
            elif 'md' in response:
                return response['md']
            elif 'content' in response:
                return response['content']
            elif 'text' in response:
                return response['text']
            elif 'result' in response and isinstance(response['result'], dict):
                result = response['result']
                if 'markdown' in result:
                    return result['markdown']
                elif 'md' in result:
                    return result['md']
                elif 'content' in result:
                    return result['content']
                elif 'text' in result:
                    return result['text']
            elif 'data' in response and isinstance(response['data'], dict):
                data = response['data']
                if 'markdown' in data:
                    return data['markdown']
                elif 'md' in data:
                    return data['md']
                elif 'content' in data:
                    return data['content']
                elif 'text' in data:
                    return data['text']
            elif isinstance(response, list) and len(response) > 0:
                # If response is a list, try to extract from first item
                first_item = response[0]
                if isinstance(first_item, dict):
                    return self._extract_markdown_from_response(first_item)
                elif isinstance(first_item, str):
                    return first_item
            else:
                # If no standard format found, try to extract from the response structure
                logger.warning("Could not find standard markdown field in Mineru response")
                # Return the response as string if it's simple, or empty string
                if isinstance(response, str):
                    return response
                elif isinstance(response, dict):
                    # Try to find any text-like content
                    for key, value in response.items():
                        if isinstance(value, str) and len(value) > 100:  # Likely content
                            return value
                        elif isinstance(value, dict):
                            # Recursively search in nested dictionaries
                            nested_content = self._extract_markdown_from_response(value)
                            if nested_content:
                                return nested_content
                return ""
        except Exception as e:
            logger.error(f"Error extracting markdown from Mineru response: {str(e)}")
            return ""
    def read_content(self) -> str:
        logger.info("Starting PDF content processing with Mineru API")
        # Call Mineru API to convert PDF to markdown
        mineru_response = self._call_mineru_api(self.input_path)
        if not mineru_response:
            raise Exception("Failed to get response from Mineru API")
        # Extract markdown content from the response
        markdown_content = self._extract_markdown_from_response(mineru_response)
        if not markdown_content:
            raise Exception("No markdown content found in Mineru API response")
        logger.info(f"Successfully extracted {len(markdown_content)} characters of markdown content")
        # Save the raw markdown content to work directory for reference
        md_output_path = os.path.join(self.work_dir, f"{self.name_without_suff}.md")
        with open(md_output_path, 'w', encoding='utf-8') as file:
            file.write(markdown_content)
        logger.info(f"Saved raw markdown content to {md_output_path}")
        return markdown_content
    def save_content(self, content: str) -> None:
        # Ensure output path has .md extension
        output_dir = os.path.dirname(self.output_path)
        base_name = os.path.splitext(os.path.basename(self.output_path))[0]
        md_output_path = os.path.join(output_dir, f"{base_name}.md")
        logger.info(f"Saving masked content to: {md_output_path}")
        with open(md_output_path, 'w', encoding='utf-8') as file:
            file.write(content)
--- a/backend/app/core/document_handlers/processors/txt_processor.py
+++ b/backend/app/core/document_handlers/processors/txt_processor.py
@ -1,8 +1,8 @@
-from document_handlers.document_processor import DocumentProcessor
+from ...document_handlers.document_processor import DocumentProcessor
-from services.ollama_client import OllamaClient
+from ...services.ollama_client import OllamaClient
 import logging
-from prompts.masking_prompts import get_masking_prompt
+# from ...prompts.masking_prompts import get_masking_prompt
-from config.settings import settings
+from ...config import settings
 logger = logging.getLogger(__name__)
 class TxtDocumentProcessor(DocumentProcessor):
--- a/backend/app/core/document_handlers/regs/entity_regex.py
+++ b/backend/app/core/document_handlers/regs/entity_regex.py
@ -0,0 +1,18 @@
 import re
 def extract_id_number_entities(chunk: str) -> dict:
    """Extract Chinese ID numbers and return in entity mapping format."""
    id_pattern = r'\b\d{17}[\dXx]\b'
    entities = []
    for match in re.findall(id_pattern, chunk):
        entities.append({"text": match, "type": "身份证号"})
    return {"entities": entities} if entities else {}
 def extract_social_credit_code_entities(chunk: str) -> dict:
    """Extract social credit codes and return in entity mapping format."""
    credit_pattern = r'\b[0-9A-Z]{18}\b'
    entities = []
    for match in re.findall(credit_pattern, chunk):
        entities.append({"text": match, "type": "统一社会信用代码"})
    return {"entities": entities} if entities else {} 
--- a/backend/app/core/prompts/masking_prompts.py
+++ b/backend/app/core/prompts/masking_prompts.py
@ -0,0 +1,225 @@
 import textwrap
 def get_ner_name_prompt(text: str) -> str:
    """
    Returns a prompt that generates a mapping of original names/companies to their masked versions.
    Args:
        text (str): The input text to be analyzed for masking
    Returns:
        str: The formatted prompt that will generate a mapping dictionary
    """
    prompt = textwrap.dedent("""
 你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息，并按照指定的类别进行分类。请严格按照JSON格式输出结果。
 实体类别包括:
 - 人名 (不包括律师、法官、书记员、检察官等公职人员)
 - 英文人名
 待处理文本:
 {text}
 输出格式:
 {{
 "entities": [
    {{"text": "原始文本内容", "type": "人名"}},
    {{"text": "原始文本内容", "type": "英文人名"}},
    ...
  ]
 }}
 请严格按照JSON格式输出结果。
    """)
    return prompt.format(text=text)
 def get_ner_company_prompt(text: str) -> str:
    """
    Returns a prompt that generates a mapping of original companies to their masked versions.
    Args:
        text (str): The input text to be analyzed for masking
    Returns:
        str: The formatted prompt that will generate a mapping dictionary
    """
    prompt = textwrap.dedent("""
 你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息，并按照指定的类别进行分类。请严格按照JSON格式输出结果。
 实体类别包括:
 - 公司名称
 - 英文公司名称
 - Company with English name
 - 公司名称简称
 - 公司英文名称简称
 待处理文本:
 {text}  
 输出格式:
 {{
 "entities": [
    {{"text": "原始文本内容", "type": "公司名称"}},
    {{"text": "原始文本内容", "type": "英文公司名称"}},
    {{"text": "原始文本内容", "type": "公司名称简称"}},
    {{"text": "原始文本内容", "type": "公司英文名称简称"}},
    ...
  ]
 }}
 请严格按照JSON格式输出结果。
    """)
    return prompt.format(text=text)
 def get_ner_address_prompt(text: str) -> str:
    """
    Returns a prompt that generates a mapping of original addresses to their masked versions.
    Args:
        text (str): The input text to be analyzed for masking
    Returns:
        str: The formatted prompt that will generate a mapping dictionary
    """
    prompt = textwrap.dedent("""
 你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息，并按照指定的类别进行分类。请严格按照JSON格式输出结果。
 实体类别包括:
 - 地址
 待处理文本:
 {text}  
 输出格式:
 {{
 "entities": [
    {{"text": "原始文本内容", "type": "地址"}},
    ...
  ]
 }}
 请严格按照JSON格式输出结果。
    """)
    return prompt.format(text=text)
 def get_ner_project_prompt(text: str) -> str:
    """
    Returns a prompt that generates a mapping of original project names to their masked versions.
    """
    prompt = textwrap.dedent("""
 你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息，并按照指定的类别进行分类。请严格按照JSON格式输出结果。
 实体类别包括:
 - 项目名(此处项目特指商业、工程、合同等项目)
 待处理文本:
 {text}  
 输出格式:
 {{
 "entities": [
    {{"text": "原始文本内容", "type": "项目名"}},
    ...
  ]
 }}
 请严格按照JSON格式输出结果。
    """)
    return prompt.format(text=text)
 def get_ner_case_number_prompt(text: str) -> str:
    """
    Returns a prompt that generates a mapping of original case numbers to their masked versions.
    """
    prompt = textwrap.dedent("""
 你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息，并按照指定的类别进行分类。请严格按照JSON格式输出结果。
 实体类别包括:
 - 案号
 待处理文本:
 {text}  
 输出格式:
 {{
 "entities": [
    {{"text": "原始文本内容", "type": "案号"}},
    ...
  ]
 }}
 请严格按照JSON格式输出结果。
    """)
    return prompt.format(text=text)
 def get_entity_linkage_prompt(entities_text: str) -> str:
    """
    Returns a prompt that identifies related entities and groups them together.
    Args:
        entities_text (str): The list of entities to be analyzed for linkage
    Returns:
        str: The formatted prompt that will generate entity linkage information
    """
    prompt = textwrap.dedent("""
 你是一个专业的法律文本实体关联分析助手。请分析以下实体列表，识别出相互关联的实体（如全称与简称、中文名与英文名等），并将它们分组。
 关联规则：
 1. 公司名称关联：
   - 全称与简称（如："阿里巴巴集团控股有限公司" 与 "阿里巴巴"）
   - 中文名与英文名（如："腾讯科技有限公司" 与 "Tencent Technology Ltd."）
   - 母公司与子公司（如："腾讯" 与 "腾讯音乐"）
 2. 每个组中应指定一个主要实体（is_primary: true），通常是：
   - 对于公司：选择最正式的全称
   - 对于人名：选择最常用的称呼
 待分析实体列表:
 {entities_text}
 输出格式:
 {{
 "entity_groups": [
    {{
        "group_id": "group_1",
        "group_type": "公司名称",
        "entities": [
            {{
                "text": "阿里巴巴集团控股有限公司",
                "type": "公司名称",
                "is_primary": true
            }},
            {{
                "text": "阿里巴巴",
                "type": "公司名称简称",
                "is_primary": false
            }}
        ]
    }}
 ]
 }}
 注意事项：
 1. 只对确实有关联的实体进行分组
 2. 每个实体只能属于一个组
 3. 每个组必须有且仅有一个主要实体（is_primary: true）
 4. 如果实体之间没有明显关联，不要强制分组
 5. group_type 应该是 "公司名称" 
 请严格按照JSON格式输出结果。
    """)
    return prompt.format(entities_text=entities_text)
--- a/backend/app/core/services/document_service.py
+++ b/backend/app/core/services/document_service.py
@ -1,12 +1,12 @@
 import logging
-from document_handlers.document_factory import DocumentProcessorFactory
+from ..document_handlers.document_factory import DocumentProcessorFactory
-from services.ollama_client import OllamaClient
+from ..services.ollama_client import OllamaClient
 logger = logging.getLogger(__name__)
 class DocumentService:
-    def __init__(self, ollama_client: OllamaClient):
+    def __init__(self):
-        self.ollama_client = ollama_client
+        pass
    def process_document(self, input_path: str, output_path: str) -> bool:
        try:
--- a/backend/app/core/services/ollama_client.py
+++ b/backend/app/core/services/ollama_client.py
--- a/backend/app/core/utils/file_utils.py
+++ b/backend/app/core/utils/file_utils.py
--- a/backend/app/core/utils/json_extractor.py
+++ b/backend/app/core/utils/json_extractor.py
--- a/backend/app/core/utils/llm_validator.py
+++ b/backend/app/core/utils/llm_validator.py
@ -0,0 +1,240 @@
 import logging
 from typing import Any, Dict, Optional
 from jsonschema import validate, ValidationError
 logger = logging.getLogger(__name__)
 class LLMResponseValidator:
    """Validator for LLM JSON responses with different schemas for different entity types"""
    # Schema for basic entity extraction responses
    ENTITY_EXTRACTION_SCHEMA = {
        "type": "object",
        "properties": {
            "entities": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "text": {"type": "string"},
                        "type": {"type": "string"}
                    },
                    "required": ["text", "type"]
                }
            }
        },
        "required": ["entities"]
    }
    # Schema for entity linkage responses
    ENTITY_LINKAGE_SCHEMA = {
        "type": "object",
        "properties": {
            "entity_groups": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "group_id": {"type": "string"},
                        "group_type": {"type": "string"},
                        "entities": {
                            "type": "array",
                            "items": {
                                "type": "object",
                                "properties": {
                                    "text": {"type": "string"},
                                    "type": {"type": "string"},
                                    "is_primary": {"type": "boolean"}
                                },
                                "required": ["text", "type", "is_primary"]
                            }
                        }
                    },
                    "required": ["group_id", "group_type", "entities"]
                }
            }
        },
        "required": ["entity_groups"]
    }
    # Schema for regex-based entity extraction (from entity_regex.py)
    REGEX_ENTITY_SCHEMA = {
        "type": "object",
        "properties": {
            "entities": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "text": {"type": "string"},
                        "type": {"type": "string"}
                    },
                    "required": ["text", "type"]
                }
            }
        },
        "required": ["entities"]
    }
    @classmethod
    def validate_entity_extraction(cls, response: Dict[str, Any]) -> bool:
        """
        Validate entity extraction response from LLM.
        Args:
            response: The parsed JSON response from LLM
        Returns:
            bool: True if valid, False otherwise
        """
        try:
            validate(instance=response, schema=cls.ENTITY_EXTRACTION_SCHEMA)
            logger.debug(f"Entity extraction validation passed for response: {response}")
            return True
        except ValidationError as e:
            logger.warning(f"Entity extraction validation failed: {e}")
            logger.warning(f"Response that failed validation: {response}")
            return False
    @classmethod
    def validate_entity_linkage(cls, response: Dict[str, Any]) -> bool:
        """
        Validate entity linkage response from LLM.
        Args:
            response: The parsed JSON response from LLM
        Returns:
            bool: True if valid, False otherwise
        """
        try:
            validate(instance=response, schema=cls.ENTITY_LINKAGE_SCHEMA)
            content_valid = cls._validate_linkage_content(response)
            if content_valid:
                logger.debug(f"Entity linkage validation passed for response: {response}")
                return True
            else:
                logger.warning(f"Entity linkage content validation failed for response: {response}")
                return False
        except ValidationError as e:
            logger.warning(f"Entity linkage validation failed: {e}")
            logger.warning(f"Response that failed validation: {response}")
            return False
    @classmethod
    def validate_regex_entity(cls, response: Dict[str, Any]) -> bool:
        """
        Validate regex-based entity extraction response.
        Args:
            response: The parsed JSON response from regex extractors
        Returns:
            bool: True if valid, False otherwise
        """
        try:
            validate(instance=response, schema=cls.REGEX_ENTITY_SCHEMA)
            logger.debug(f"Regex entity validation passed for response: {response}")
            return True
        except ValidationError as e:
            logger.warning(f"Regex entity validation failed: {e}")
            logger.warning(f"Response that failed validation: {response}")
            return False
    @classmethod
    def _validate_linkage_content(cls, response: Dict[str, Any]) -> bool:
        """
        Additional content validation for entity linkage responses.
        Args:
            response: The parsed JSON response from LLM
        Returns:
            bool: True if content is valid, False otherwise
        """
        entity_groups = response.get('entity_groups', [])
        for group in entity_groups:
            # Validate group type
            group_type = group.get('group_type', '')
            if group_type not in ['公司名称', '人名']:
                logger.warning(f"Invalid group_type: {group_type}")
                return False
            # Validate entities in group
            entities = group.get('entities', [])
            if not entities:
                logger.warning("Empty entity group found")
                return False
            # Check that exactly one entity is marked as primary
            primary_count = sum(1 for entity in entities if entity.get('is_primary', False))
            if primary_count != 1:
                logger.warning(f"Group must have exactly one primary entity, found {primary_count}")
                return False
            # Validate entity types within group
            for entity in entities:
                entity_type = entity.get('type', '')
                if group_type == '公司名称' and not any(keyword in entity_type for keyword in ['公司', 'Company']):
                    logger.warning(f"Company group contains non-company entity: {entity_type}")
                    return False
                elif group_type == '人名' and not any(keyword in entity_type for keyword in ['人名', '英文人名']):
                    logger.warning(f"Person group contains non-person entity: {entity_type}")
                    return False
        return True
    @classmethod
    def validate_response_by_type(cls, response: Dict[str, Any], response_type: str) -> bool:
        """
        Generic validator that routes to appropriate validation method based on response type.
        Args:
            response: The parsed JSON response from LLM
            response_type: Type of response ('entity_extraction', 'entity_linkage', 'regex_entity')
        Returns:
            bool: True if valid, False otherwise
        """
        validators = {
            'entity_extraction': cls.validate_entity_extraction,
            'entity_linkage': cls.validate_entity_linkage,
            'regex_entity': cls.validate_regex_entity
        }
        validator = validators.get(response_type)
        if not validator:
            logger.error(f"Unknown response type: {response_type}")
            return False
        return validator(response)
    @classmethod
    def get_validation_errors(cls, response: Dict[str, Any], response_type: str) -> Optional[str]:
        """
        Get detailed validation errors for debugging.
        Args:
            response: The parsed JSON response from LLM
            response_type: Type of response
        Returns:
            Optional[str]: Error message or None if valid
        """
        try:
            if response_type == 'entity_extraction':
                validate(instance=response, schema=cls.ENTITY_EXTRACTION_SCHEMA)
            elif response_type == 'entity_linkage':
                validate(instance=response, schema=cls.ENTITY_LINKAGE_SCHEMA)
                if not cls._validate_linkage_content(response):
                    return "Content validation failed for entity linkage"
            elif response_type == 'regex_entity':
                validate(instance=response, schema=cls.REGEX_ENTITY_SCHEMA)
            else:
                return f"Unknown response type: {response_type}"
            return None
        except ValidationError as e:
            return f"Schema validation error: {e}" 
--- a/backend/app/main.py
+++ b/backend/app/main.py
@ -0,0 +1,33 @@
 from fastapi import FastAPI
 from fastapi.middleware.cors import CORSMiddleware
 from .core.config import settings
 from .api.endpoints import files
 from .core.database import engine, Base
 # Create database tables
 Base.metadata.create_all(bind=engine)
 app = FastAPI(
    title=settings.PROJECT_NAME,
    openapi_url=f"{settings.API_V1_STR}/openapi.json"
 )
 # Set up CORS
 app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # In production, replace with specific origins
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
 )
 # Include routers
 app.include_router(
    files.router,
    prefix=f"{settings.API_V1_STR}/files",
    tags=["files"]
 )
@app.get("/")
 async def root():
    return {"message": "Welcome to Legal Document Masker API"} 
--- a/backend/app/models/file.py
+++ b/backend/app/models/file.py
@ -0,0 +1,22 @@
 from sqlalchemy import Column, String, DateTime, Text
 from datetime import datetime
 import uuid
 from ..core.database import Base
 class FileStatus(str):
    NOT_STARTED = "not_started"
    PROCESSING = "processing"
    SUCCESS = "success"
    FAILED = "failed"
 class File(Base):
    __tablename__ = "files"
    id = Column(String(36), primary_key=True, default=lambda: str(uuid.uuid4()))
    filename = Column(String(255), nullable=False)
    original_path = Column(String(255), nullable=False)
    processed_path = Column(String(255))
    status = Column(String(20), nullable=False, default=FileStatus.NOT_STARTED)
    error_message = Column(Text)
    created_at = Column(DateTime, nullable=False, default=datetime.utcnow)
    updated_at = Column(DateTime, nullable=False, default=datetime.utcnow, onupdate=datetime.utcnow) 
--- a/backend/app/schemas/file.py
+++ b/backend/app/schemas/file.py
@ -0,0 +1,21 @@
 from pydantic import BaseModel
 from datetime import datetime
 from typing import Optional
 from uuid import UUID
 class FileBase(BaseModel):
    filename: str
    status: str
    error_message: Optional[str] = None
 class FileResponse(FileBase):
    id: UUID
    created_at: datetime
    updated_at: datetime
    class Config:
        from_attributes = True
 class FileList(BaseModel):
    files: list[FileResponse]
    total: int 
--- a/backend/app/services/file_service.py
+++ b/backend/app/services/file_service.py
@ -0,0 +1,87 @@
 from celery import Celery
 from ..core.config import settings
 from ..models.file import File, FileStatus
 from sqlalchemy.orm import Session
 from ..core.database import SessionLocal
 import sys
 import os
 from ..core.services.document_service import DocumentService
 from pathlib import Path
 from fastapi import HTTPException
 celery = Celery(
    'file_service',
    broker=settings.CELERY_BROKER_URL,
    backend=settings.CELERY_RESULT_BACKEND
 )
 def delete_file(file_id: str):
    """
    Delete a file and its associated records.
    This will:
    1. Delete the database record
    2. Delete the original uploaded file
    3. Delete the processed markdown file (if it exists)
    """
    db = SessionLocal()
    try:
        # Get the file record
        file = db.query(File).filter(File.id == file_id).first()
        if not file:
            raise HTTPException(status_code=404, detail="File not found")
        # Delete the original file if it exists
        if file.original_path and os.path.exists(file.original_path):
            os.remove(file.original_path)
        # Delete the processed file if it exists
        if file.processed_path and os.path.exists(file.processed_path):
            os.remove(file.processed_path)
        # Delete the database record
        db.delete(file)
        db.commit()
    except Exception as e:
        db.rollback()
        raise HTTPException(status_code=500, detail=f"Error deleting file: {str(e)}")
    finally:
        db.close()
@celery.task
 def process_file(file_id: str):
    db = SessionLocal()
    try:
        file = db.query(File).filter(File.id == file_id).first()
        if not file:
            return
        # Update status to processing
        file.status = FileStatus.PROCESSING
        db.commit()
        try:
            # Process the file using your existing masking system
            process_service = DocumentService()
            # Determine output path using file_id with .md extension
            output_filename = f"{file_id}.md"
            output_path = str(settings.PROCESSED_FOLDER / output_filename)
            # Process document with both input and output paths
            process_service.process_document(file.original_path, output_path)
            # Update file record with processed path
            file.processed_path = output_path
            file.status = FileStatus.SUCCESS
            db.commit()
        except Exception as e:
            file.status = FileStatus.FAILED
            file.error_message = str(e)
            db.commit()
            raise
    finally:
        db.close() 
--- a/backend/docker-compose.yml
+++ b/backend/docker-compose.yml
@ -0,0 +1,37 @@
 version: '3.8'
 services:
  api:
    build: .
    ports:
      - "8000:8000"
    volumes:
      - ./storage:/app/storage
      - ./legal_doc_masker.db:/app/legal_doc_masker.db
    env_file:
      - .env
    environment:
      - CELERY_BROKER_URL=redis://redis:6379/0
      - CELERY_RESULT_BACKEND=redis://redis:6379/0
    depends_on:
      - redis
  celery_worker:
    build: .
    command: celery -A app.services.file_service worker --loglevel=info
    volumes:
      - ./storage:/app/storage
      - ./legal_doc_masker.db:/app/legal_doc_masker.db
    env_file:
      - .env
    environment:
      - CELERY_BROKER_URL=redis://redis:6379/0
      - CELERY_RESULT_BACKEND=redis://redis:6379/0
    depends_on:
      - redis
      - api
  redis:
    image: redis:alpine
    ports:
      - "6379:6379" 
--- a/backend/log
+++ b/backend/log
@ -0,0 +1,127 @@
 [2025-07-14 14:20:19,015: INFO/ForkPoolWorker-4] Raw response from LLM: {
 celery_worker-1  |   "entities": []
 celery_worker-1  | }
 celery_worker-1  | [2025-07-14 14:20:19,016: INFO/ForkPoolWorker-4] Parsed mapping: {'entities': []}
 celery_worker-1  | [2025-07-14 14:20:19,020: INFO/ForkPoolWorker-4] Calling ollama to generate case numbers mapping for chunk (attempt 1/3): 
 celery_worker-1  | 你是一个专业的法律文本实体识别助手。请从以下文本中抽取出所有需要脱敏的敏感信息，并按照指定的类别进行分类。请严格按照JSON格式输出结果。
 celery_worker-1  | 
 celery_worker-1  | 实体类别包括:
 celery_worker-1  | - 案号
 celery_worker-1  | 
 celery_worker-1  | 待处理文本:
 celery_worker-1  |   
 celery_worker-1  | 
 celery_worker-1  | 二审案件受理费450892 元，由北京丰复久信营销科技有限公司负担（已交纳）。  
 celery_worker-1  | 
 celery_worker-1  | 29. 本判决为终审判决。  
 celery_worker-1  | 
 celery_worker-1  | 审 判 长 史晓霞审 判 员 邓青菁审 判 员 李 淼二〇二二年七月七日法 官 助 理 黎 铧书 记 员 郑海兴    
 celery_worker-1  | 
 celery_worker-1  | 输出格式:
 celery_worker-1  | {
 celery_worker-1  | "entities": [
 celery_worker-1  |     {"text": "原始文本内容", "type": "案号"},
 celery_worker-1  |     ...
 celery_worker-1  |   ]
 celery_worker-1  | }
 celery_worker-1  | 
 celery_worker-1  | 请严格按照JSON格式输出结果。
 celery_worker-1  | 
 api-1            | INFO:     192.168.65.1:60045 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:34054 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:34054 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:22084 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 celery_worker-1  | [2025-07-14 14:20:31,279: INFO/ForkPoolWorker-4] Raw response from LLM: {
 celery_worker-1  |   "entities": []
 celery_worker-1  | }
 celery_worker-1  | [2025-07-14 14:20:31,281: INFO/ForkPoolWorker-4] Parsed mapping: {'entities': []}
 celery_worker-1  | [2025-07-14 14:20:31,287: INFO/ForkPoolWorker-4] Chunk mapping: [{'entities': []}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': []}, {'entities': []}]
 celery_worker-1  | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Final chunk mappings: [{'entities': [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}]}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}]}, {'entities': [{'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}]}, {'entities': [{'text': '服务合同', 'type': '项目名'}]}, {'entities': [{'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '（2020）京0105 民初69754 号', 'type': '案号'}]}, {'entities': [{'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}]}, {'entities': [{'text': '丰复久信公司', 'type': '公司名称'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}]}, {'entities': [{'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}]}, {'entities': [{'text': '《计算机设备采购合同》', 'type': '项目名'}]}, {'entities': []}, {'entities': []}, {'entities': [{'text': '丰复久信公司', 'type': '公司名称'}, {'text': '中研智创公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': [{'text': '《服务合同书》', 'type': '项目名'}]}, {'entities': []}, {'entities': []}, {'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}]}, {'entities': []}, {'entities': []}, {'entities': []}]
 celery_worker-1  | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '丰复久信公司', 'type': '公司名称'}
 celery_worker-1  | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '丰复久信公司', 'type': '公司名称'}
 celery_worker-1  | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '中研智创公司', 'type': '公司名称'}
 celery_worker-1  | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Duplicate entity found: {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}
 celery_worker-1  | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Merged 22 unique entities
 celery_worker-1  | [2025-07-14 14:20:31,288: INFO/ForkPoolWorker-4] Unique entities: [{'text': '郭东军', 'type': '人名'}, {'text': '王欢子', 'type': '人名'}, {'text': '北京丰复久信营销科技有限公司', 'type': '公司名称'}, {'text': '丰复久信公司', 'type': '公司名称简称'}, {'text': '中研智创区块链技术有限公司', 'type': '公司名称'}, {'text': '中研智才公司', 'type': '公司名称简称'}, {'text': '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室', 'type': '地址'}, {'text': '天津市津南区双港镇工业园区优谷产业园5 号楼-1505', 'type': '地址'}, {'text': '服务合同', 'type': '项目名'}, {'text': '(2022)京 03 民终 3852 号', 'type': '案号'}, {'text': '（2020）京0105 民初69754 号', 'type': '案号'}, {'text': '李圣艳', 'type': '人名'}, {'text': '闫向东', 'type': '人名'}, {'text': '李敏', 'type': '人名'}, {'text': '布兰登·斯密特', 'type': '英文人名'}, {'text': '中研智创公司', 'type': '公司名称'}, {'text': '丰复久信', 'type': '公司名称简称'}, {'text': '中研智创', 'type': '公司名称简称'}, {'text': '上海市', 'type': '地址'}, {'text': '北京', 'type': '地址'}, {'text': '《计算机设备采购合同》', 'type': '项目名'}, {'text': '《服务合同书》', 'type': '项目名'}]
 celery_worker-1  | [2025-07-14 14:20:31,289: INFO/ForkPoolWorker-4] Calling ollama to generate entity linkage (attempt 1/3)
 api-1            | INFO:     192.168.65.1:52168 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:61426 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:30702 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:48159 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:16860 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:21262 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:45564 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:32142 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:27769 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:21196 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 celery_worker-1  | [2025-07-14 14:21:21,436: INFO/ForkPoolWorker-4] Raw entity linkage response from LLM: {
 celery_worker-1  |   "entity_groups": [
 celery_worker-1  |     {
 celery_worker-1  |       "group_id": "group_1",
 celery_worker-1  |       "group_type": "公司名称",
 celery_worker-1  |       "entities": [
 celery_worker-1  |         {
 celery_worker-1  |           "text": "北京丰复久信营销科技有限公司",
 celery_worker-1  |           "type": "公司名称",
 celery_worker-1  |           "is_primary": true
 celery_worker-1  |         },
 celery_worker-1  |         {
 celery_worker-1  |           "text": "丰复久信公司",
 celery_worker-1  |           "type": "公司名称简称",
 celery_worker-1  |           "is_primary": false
 celery_worker-1  |         },
 celery_worker-1  |         {
 celery_worker-1  |           "text": "丰复久信",
 celery_worker-1  |           "type": "公司名称简称",
 celery_worker-1  |           "is_primary": false
 celery_worker-1  |         }
 celery_worker-1  |       ]
 celery_worker-1  |     },
 celery_worker-1  |     {
 celery_worker-1  |       "group_id": "group_2",
 celery_worker-1  |       "group_type": "公司名称",
 celery_worker-1  |       "entities": [
 celery_worker-1  |         {
 celery_worker-1  |           "text": "中研智创区块链技术有限公司",
 celery_worker-1  |           "type": "公司名称",
 celery_worker-1  |           "is_primary": true
 celery_worker-1  |         },
 celery_worker-1  |         {
 celery_worker-1  |           "text": "中研智创公司",
 celery_worker-1  |           "type": "公司名称简称",
 celery_worker-1  |           "is_primary": false
 celery_worker-1  |         },
 celery_worker-1  |         {
 celery_worker-1  |           "text": "中研智创",
 celery_worker-1  |           "type": "公司名称简称",
 celery_worker-1  |           "is_primary": false
 celery_worker-1  |         }
 celery_worker-1  |       ]
 celery_worker-1  |     }
 celery_worker-1  |   ]
 celery_worker-1  | }
 celery_worker-1  | [2025-07-14 14:21:21,437: INFO/ForkPoolWorker-4] Parsed entity linkage: {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
 celery_worker-1  | [2025-07-14 14:21:21,445: INFO/ForkPoolWorker-4] Successfully created entity linkage with 2 groups
 celery_worker-1  | [2025-07-14 14:21:21,445: INFO/ForkPoolWorker-4] Entity linkage: {'entity_groups': [{'group_id': 'group_1', 'group_type': '公司名称', 'entities': [{'text': '北京丰复久信营销科技有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '丰复久信公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '丰复久信', 'type': '公司名称简称', 'is_primary': False}]}, {'group_id': 'group_2', 'group_type': '公司名称', 'entities': [{'text': '中研智创区块链技术有限公司', 'type': '公司名称', 'is_primary': True}, {'text': '中研智创公司', 'type': '公司名称简称', 'is_primary': False}, {'text': '中研智创', 'type': '公司名称简称', 'is_primary': False}]}]}
 celery_worker-1  | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Generated masked mapping for 22 entities
 celery_worker-1  | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Combined mapping: {'郭东军': '某', '王欢子': '某甲', '北京丰复久信营销科技有限公司': '某公司', '丰复久信公司': '某公司甲', '中研智创区块链技术有限公司': '某公司乙', '中研智才公司': '某公司丙', '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室': '某乙', '天津市津南区双港镇工业园区优谷产业园5 号楼-1505': '某丙', '服务合同': '某丁', '(2022)京 03 民终 3852 号': '某戊', '（2020）京0105 民初69754 号': '某己', '李圣艳': '某庚', '闫向东': '某辛', '李敏': '某壬', '布兰登·斯密特': '某癸', '中研智创公司': '某公司丁', '丰复久信': '某公司戊', '中研智创': '某公司己', '上海市': '某11', '北京': '某12', '《计算机设备采购合同》': '某13', '《服务合同书》': '某14'}
 celery_worker-1  | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '北京丰复久信营销科技有限公司' to '北京丰复久信营销科技有限公司' with masked name '某公司'
 celery_worker-1  | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '丰复久信公司' to '北京丰复久信营销科技有限公司' with masked name '某公司'
 celery_worker-1  | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '丰复久信' to '北京丰复久信营销科技有限公司' with masked name '某公司'
 celery_worker-1  | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创区块链技术有限公司' to '中研智创区块链技术有限公司' with masked name '某公司乙'
 celery_worker-1  | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创公司' to '中研智创区块链技术有限公司' with masked name '某公司乙'
 celery_worker-1  | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Linked entity '中研智创' to '中研智创区块链技术有限公司' with masked name '某公司乙'
 celery_worker-1  | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Final mapping: {'郭东军': '某', '王欢子': '某甲', '北京丰复久信营销科技有限公司': '某公司', '丰复久信公司': '某公司', '中研智创区块链技术有限公司': '某公司乙', '中研智才公司': '某公司丙', '北京市海淀区北小马厂6 号1 号楼华天大厦1306 室': '某乙', '天津市津南区双港镇工业园区优谷产业园5 号楼-1505': '某丙', '服务合同': '某丁', '(2022)京 03 民终 3852 号': '某戊', '（2020）京0105 民初69754 号': '某己', '李圣艳': '某庚', '闫向东': '某辛', '李敏': '某壬', '布兰登·斯密特': '某癸', '中研智创公司': '某公司乙', '丰复久信': '某公司', '中研智创': '某公司乙', '上海市': '某11', '北京': '某12', '《计算机设备采购合同》': '某13', '《服务合同书》': '某14'}
 celery_worker-1  | [2025-07-14 14:21:21,446: INFO/ForkPoolWorker-4] Successfully masked content
 celery_worker-1  | [2025-07-14 14:21:21,449: INFO/ForkPoolWorker-4] Successfully saved masked content to /app/storage/processed/47522ea9-c259-4304-bfe4-1d3ed6902ede.md
 celery_worker-1  | [2025-07-14 14:21:21,470: INFO/ForkPoolWorker-4] Task app.services.file_service.process_file[5cfbca4c-0f6f-4c71-a66b-b22ee2d28139] succeeded in 311.847165101s: None
 api-1            | INFO:     192.168.65.1:33432 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:40073 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:29550 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:61350 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:61755 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:63726 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:43446 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:45624 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:25256 - "GET /api/v1/files/files HTTP/1.1" 200 OK
 api-1            | INFO:     192.168.65.1:43464 - "GET /api/v1/files/files HTTP/1.1" 200 OK
--- a/backend/package-lock.json
+++ b/backend/package-lock.json
@ -0,0 +1,6 @@
 {
  "name": "backend",
  "lockfileVersion": 3,
  "requires": true,
  "packages": {}
 }
--- a/backend/requirements.txt
+++ b/backend/requirements.txt
@ -0,0 +1,32 @@
 # FastAPI and server
 fastapi>=0.104.0
 uvicorn>=0.24.0
 python-multipart>=0.0.6
 websockets>=12.0
 # Database
 sqlalchemy>=2.0.0
 alembic>=1.12.0
 # Background tasks
 celery>=5.3.0
 redis>=5.0.0
 # Security
 python-jose[cryptography]>=3.3.0
 passlib[bcrypt]>=1.7.4
 python-dotenv>=1.0.0
 # Testing
 pytest>=7.4.0
 httpx>=0.25.0
 # Existing project dependencies
 pydantic-settings>=2.0.0
 watchdog==2.1.6
 requests==2.28.1
 python-docx>=0.8.11
 PyPDF2>=3.0.0
 pandas>=2.0.0
 # magic-pdf[full] 
 jsonschema>=4.20.0
--- a/backend/tests/test.txt
+++ b/backend/tests/test.txt
--- a/backend/tests/test_ner_processor.py
+++ b/backend/tests/test_ner_processor.py
@ -0,0 +1,62 @@
 import pytest
 from app.core.document_handlers.ner_processor import NerProcessor
 def test_generate_masked_mapping():
    processor = NerProcessor()
    unique_entities = [
        {'text': '李雷', 'type': '人名'},
        {'text': '李明', 'type': '人名'},
        {'text': '王强', 'type': '人名'},
        {'text': 'Acme Manufacturing Inc.', 'type': '英文公司名', 'industry': 'manufacturing'},
        {'text': 'Google LLC', 'type': '英文公司名'},
        {'text': 'A公司', 'type': '公司名称'},
        {'text': 'B公司', 'type': '公司名称'},
        {'text': 'John Smith', 'type': '英文人名'},
        {'text': 'Elizabeth Windsor', 'type': '英文人名'},
        {'text': '华梦龙光伏项目', 'type': '项目名'},
        {'text': '案号12345', 'type': '案号'},
        {'text': '310101198802080000', 'type': '身份证号'},
        {'text': '9133021276453538XT', 'type': '社会信用代码'},
    ]
    linkage = {
        'entity_groups': [
            {
                'group_id': 'g1',
                'group_type': '公司名称',
                'entities': [
                    {'text': 'A公司', 'type': '公司名称', 'is_primary': True},
                    {'text': 'B公司', 'type': '公司名称', 'is_primary': False},
                ]
            },
            {
                'group_id': 'g2',
                'group_type': '人名',
                'entities': [
                    {'text': '李雷', 'type': '人名', 'is_primary': True},
                    {'text': '李明', 'type': '人名', 'is_primary': False},
                ]
            }
        ]
    }
    mapping = processor._generate_masked_mapping(unique_entities, linkage)
    # 人名
    assert mapping['李雷'].startswith('李某')
    assert mapping['李明'].startswith('李某')
    assert mapping['王强'].startswith('王某')
    # 英文公司名
    assert mapping['Acme Manufacturing Inc.'] == 'MANUFACTURING'
    assert mapping['Google LLC'] == 'COMPANY'
    # 公司名同组
    assert mapping['A公司'] == mapping['B公司']
    assert mapping['A公司'].endswith('公司')
    # 英文人名
    assert mapping['John Smith'] == 'J*** S***'
    assert mapping['Elizabeth Windsor'] == 'E*** W***'
    # 项目名
    assert mapping['华梦龙光伏项目'].endswith('项目')
    # 案号
    assert mapping['案号12345'] == '***'
    # 身份证号
    assert mapping['310101198802080000'] == 'XXXXXX'
    # 社会信用代码
    assert mapping['9133021276453538XT'] == 'XXXXXXXX' 
--- a/data/test.sh
+++ b/data/test.sh
@ -1,2 +0,0 @@
 rm ./doc_src/*.md
 cp ./doc/*.md ./doc_src/
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -0,0 +1,105 @@
 version: '3.8'
 services:
  # Mineru API Service
  mineru-api:
    build:
      context: ./mineru
      dockerfile: Dockerfile
    platform: linux/arm64
    ports:
      - "8001:8000"
    volumes:
      - ./mineru/storage/uploads:/app/storage/uploads
      - ./mineru/storage/processed:/app/storage/processed
    environment:
      - PYTHONUNBUFFERED=1
      - MINERU_MODEL_SOURCE=local
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    networks:
      - app-network
  # Backend API Service
  backend-api:
    build:
      context: ./backend
      dockerfile: Dockerfile
    ports:
      - "8000:8000"
    volumes:
      - ./backend/storage:/app/storage
      - ./backend/legal_doc_masker.db:/app/legal_doc_masker.db
    env_file:
      - ./backend/.env
    environment:
      - CELERY_BROKER_URL=redis://redis:6379/0
      - CELERY_RESULT_BACKEND=redis://redis:6379/0
      - MINERU_API_URL=http://mineru-api:8000
    depends_on:
      - redis
      - mineru-api
    networks:
      - app-network
  # Celery Worker
  celery-worker:
    build:
      context: ./backend
      dockerfile: Dockerfile
    command: celery -A app.services.file_service worker --loglevel=info
    volumes:
      - ./backend/storage:/app/storage
      - ./backend/legal_doc_masker.db:/app/legal_doc_masker.db
    env_file:
      - ./backend/.env
    environment:
      - CELERY_BROKER_URL=redis://redis:6379/0
      - CELERY_RESULT_BACKEND=redis://redis:6379/0
      - MINERU_API_URL=http://mineru-api:8000
    depends_on:
      - redis
      - backend-api
    networks:
      - app-network
  # Redis Service
  redis:
    image: redis:alpine
    ports:
      - "6379:6379"
    networks:
      - app-network
  # Frontend Service
  frontend:
    build:
      context: ./frontend
      dockerfile: Dockerfile
      args:
        - REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
    ports:
      - "3000:80"
    env_file:
      - ./frontend/.env
    environment:
      - NODE_ENV=production
      - REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
    restart: unless-stopped
    depends_on:
      - backend-api
    networks:
      - app-network
 networks:
  app-network:
    driver: bridge
 volumes:
  uploads:
  processed: 
--- a/export-images.sh
+++ b/export-images.sh
@ -0,0 +1,168 @@
 #!/bin/bash
 # Docker Image Export Script
 # Exports all project Docker images for migration to another environment
 set -e
 echo "🚀 Legal Document Masker - Docker Image Export"
 echo "=============================================="
 # Function to check if Docker is running
 check_docker() {
    if ! docker info > /dev/null 2>&1; then
        echo "❌ Docker is not running. Please start Docker and try again."
        exit 1
    fi
    echo "✅ Docker is running"
 }
 # Function to check if images exist
 check_images() {
    echo "🔍 Checking for required images..."
    local missing_images=()
    if ! docker images | grep -q "legal-doc-masker-backend-api"; then
        missing_images+=("legal-doc-masker-backend-api")
    fi
    if ! docker images | grep -q "legal-doc-masker-frontend"; then
        missing_images+=("legal-doc-masker-frontend")
    fi
    if ! docker images | grep -q "legal-doc-masker-mineru-api"; then
        missing_images+=("legal-doc-masker-mineru-api")
    fi
    if ! docker images | grep -q "redis:alpine"; then
        missing_images+=("redis:alpine")
    fi
    if [ ${#missing_images[@]} -ne 0 ]; then
        echo "❌ Missing images: ${missing_images[*]}"
        echo "Please build the images first using: docker-compose build"
        exit 1
    fi
    echo "✅ All required images found"
 }
 # Function to create export directory
 create_export_dir() {
    local export_dir="docker-images-export-$(date +%Y%m%d-%H%M%S)"
    mkdir -p "$export_dir"
    cd "$export_dir"
    echo "📁 Created export directory: $export_dir"
    echo "$export_dir"
 }
 # Function to export images
 export_images() {
    local export_dir="$1"
    echo "📦 Exporting Docker images..."
    # Export backend image
    echo "  📦 Exporting backend-api image..."
    docker save legal-doc-masker-backend-api:latest -o backend-api.tar
    # Export frontend image
    echo "  📦 Exporting frontend image..."
    docker save legal-doc-masker-frontend:latest -o frontend.tar
    # Export mineru image
    echo "  📦 Exporting mineru-api image..."
    docker save legal-doc-masker-mineru-api:latest -o mineru-api.tar
    # Export redis image
    echo "  📦 Exporting redis image..."
    docker save redis:alpine -o redis.tar
    echo "✅ All images exported successfully!"
 }
 # Function to show export summary
 show_summary() {
    echo ""
    echo "📊 Export Summary:"
    echo "=================="
    ls -lh *.tar
    echo ""
    echo "📋 Files to transfer:"
    echo "===================="
    for file in *.tar; do
        echo "  - $file"
    done
    echo ""
    echo "💾 Total size: $(du -sh . | cut -f1)"
 }
 # Function to create compressed archive
 create_archive() {
    echo ""
    echo "🗜️  Creating compressed archive..."
    local archive_name="legal-doc-masker-images-$(date +%Y%m%d-%H%M%S).tar.gz"
    tar -czf "$archive_name" *.tar
    echo "✅ Created archive: $archive_name"
    echo "📊 Archive size: $(du -sh "$archive_name" | cut -f1)"
    echo ""
    echo "📋 Transfer options:"
    echo "==================="
    echo "1. Transfer individual .tar files"
    echo "2. Transfer compressed archive: $archive_name"
 }
 # Function to show transfer instructions
 show_transfer_instructions() {
    echo ""
    echo "📤 Transfer Instructions:"
    echo "========================"
    echo ""
    echo "Option 1: Transfer individual files"
    echo "-----------------------------------"
    echo "scp *.tar user@target-server:/path/to/destination/"
    echo ""
    echo "Option 2: Transfer compressed archive"
    echo "-------------------------------------"
    echo "scp legal-doc-masker-images-*.tar.gz user@target-server:/path/to/destination/"
    echo ""
    echo "Option 3: USB Drive"
    echo "-------------------"
    echo "cp *.tar /Volumes/USB_DRIVE/docker-images/"
    echo "cp legal-doc-masker-images-*.tar.gz /Volumes/USB_DRIVE/"
    echo ""
    echo "Option 4: Cloud Storage"
    echo "----------------------"
    echo "aws s3 cp *.tar s3://your-bucket/docker-images/"
    echo "aws s3 cp legal-doc-masker-images-*.tar.gz s3://your-bucket/docker-images/"
 }
 # Main execution
 main() {
    check_docker
    check_images
    local export_dir=$(create_export_dir)
    export_images "$export_dir"
    show_summary
    create_archive
    show_transfer_instructions
    echo ""
    echo "🎉 Export completed successfully!"
    echo "📁 Export location: $(pwd)"
    echo ""
    echo "Next steps:"
    echo "1. Transfer the files to your target environment"
    echo "2. Use import-images.sh on the target environment"
    echo "3. Copy docker-compose.yml and other config files"
 }
 # Run main function
 main "$@" 
--- a/frontend/.dockerignore
+++ b/frontend/.dockerignore
@ -0,0 +1,11 @@
 node_modules
 npm-debug.log
 build
 .git
 .gitignore
 README.md
 .env
 .env.local
 .env.development.local
 .env.test.local
 .env.production.local 
--- a/frontend/.env
+++ b/frontend/.env
@ -0,0 +1,2 @@
 # REACT_APP_API_BASE_URL=http://192.168.2.203:8000/api/v1
 REACT_APP_API_BASE_URL=http://localhost:8000/api/v1
--- a/frontend/Dockerfile
+++ b/frontend/Dockerfile
@ -0,0 +1,33 @@
 # Build stage
 FROM node:18-alpine as build
 WORKDIR /app
 # Copy package files
 COPY package*.json ./
 # Install dependencies
 RUN npm ci
 # Copy source code
 COPY . .
 # Build the app with environment variables
 ARG REACT_APP_API_BASE_URL
 ENV REACT_APP_API_BASE_URL=$REACT_APP_API_BASE_URL
 RUN npm run build
 # Production stage
 FROM nginx:alpine
 # Copy built assets from build stage
 COPY --from=build /app/build /usr/share/nginx/html
 # Copy nginx configuration
 COPY nginx.conf /etc/nginx/conf.d/default.conf
 # Expose port 80
 EXPOSE 80
 # Start nginx
 CMD ["nginx", "-g", "daemon off;"] 
--- a/frontend/README.md
+++ b/frontend/README.md
@ -0,0 +1,55 @@
 # Legal Document Masker Frontend
 This is the frontend application for the Legal Document Masker service. It provides a user interface for uploading legal documents, monitoring their processing status, and downloading the masked versions.
 ## Features
 - Drag and drop file upload
 - Real-time status updates
 - File list with processing status
 - Multi-file selection and download
 - Modern Material-UI interface
 ## Prerequisites
 - Node.js (v14 or higher)
 - npm (v6 or higher)
 ## Installation
 1. Install dependencies:
 ```bash
 npm install
 ```
 2. Start the development server:
 ```bash
 npm start
 ```
 The application will be available at http://localhost:3000
 ## Development
 The frontend is built with:
 - React 18
 - TypeScript
 - Material-UI
 - React Query for data fetching
 - React Dropzone for file uploads
 ## Building for Production
 To create a production build:
 ```bash
 npm run build
 ```
 The build artifacts will be stored in the `build/` directory.
 ## Environment Variables
 The following environment variables can be configured:
 - `REACT_APP_API_URL`: The URL of the backend API (default: http://localhost:8000/api/v1) 
--- a/frontend/docker-compose.yml
+++ b/frontend/docker-compose.yml
@ -0,0 +1,24 @@
 version: '3.8'
 services:
  frontend:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        - REACT_APP_API_BASE_URL=${REACT_APP_API_BASE_URL}
    ports:
      - "3000:80"
    env_file:
      - .env
    environment:
      - NODE_ENV=production
      - REACT_APP_API_BASE_URL=${REACT_APP_API_BASE_URL}
    restart: unless-stopped
    networks:
      - app-network
 networks:
  app-network:
    driver: bridge 
--- a/frontend/nginx.conf
+++ b/frontend/nginx.conf
@ -0,0 +1,25 @@
 server {
    listen 80;
    server_name localhost;
    location / {
        root /usr/share/nginx/html;
        index index.html;
        try_files $uri $uri/ /index.html;
    }
    # Cache static assets
    location /static/ {
        root /usr/share/nginx/html;
        expires 1y;
        add_header Cache-Control "public, no-transform";
    }
    # Enable gzip compression
    gzip on;
    gzip_vary on;
    gzip_min_length 10240;
    gzip_proxied expired no-cache no-store private auth;
    gzip_types text/plain text/css text/xml text/javascript application/x-javascript application/xml application/javascript;
    gzip_disable "MSIE [1-6]\.";
 } 
--- a/frontend/package-lock.json
+++ b/frontend/package-lock.json
--- a/frontend/package.json
+++ b/frontend/package.json
@ -0,0 +1,50 @@
 {
  "name": "legal-doc-masker-frontend",
  "version": "0.1.0",
  "private": true,
  "dependencies": {
    "@emotion/react": "^11.11.3",
    "@emotion/styled": "^11.11.0",
    "@mui/icons-material": "^5.15.10",
    "@mui/material": "^5.15.10",
    "@testing-library/jest-dom": "^5.17.0",
    "@testing-library/react": "^13.4.0",
    "@testing-library/user-event": "^13.5.0",
    "@types/jest": "^27.5.2",
    "@types/node": "^16.18.80",
    "@types/react": "^18.2.55",
    "@types/react-dom": "^18.2.19",
    "axios": "^1.6.7",
    "react": "^18.2.0",
    "react-dom": "^18.2.0",
    "react-dropzone": "^14.2.3",
    "react-query": "^3.39.3",
    "react-scripts": "5.0.1",
    "typescript": "^4.9.5",
    "web-vitals": "^2.1.4"
  },
  "scripts": {
    "start": "react-scripts start",
    "build": "react-scripts build",
    "test": "react-scripts test",
    "eject": "react-scripts eject"
  },
  "eslintConfig": {
    "extends": [
      "react-app",
      "react-app/jest"
    ]
  },
  "browserslist": {
    "production": [
      ">0.2%",
      "not dead",
      "not op_mini all"
    ],
    "development": [
      "last 1 chrome version",
      "last 1 firefox version",
      "last 1 safari version"
    ]
  }
 } 
--- a/frontend/public/index.html
+++ b/frontend/public/index.html
@ -0,0 +1,20 @@
 <!DOCTYPE html>
 <html lang="en">
  <head>
    <meta charset="utf-8" />
    <link rel="icon" href="%PUBLIC_URL%/favicon.ico" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <meta name="theme-color" content="#000000" />
    <meta
      name="description"
      content="Legal Document Masker - Upload and process legal documents"
    />
    <link rel="apple-touch-icon" href="%PUBLIC_URL%/logo192.png" />
    <link rel="manifest" href="%PUBLIC_URL%/manifest.json" />
    <title>Legal Document Masker</title>
  </head>
  <body>
    <noscript>You need to enable JavaScript to run this app.</noscript>
    <div id="root"></div>
  </body>
 </html> 
--- a/frontend/public/manifest.json
+++ b/frontend/public/manifest.json
@ -0,0 +1,15 @@
 {
  "short_name": "Legal Doc Masker",
  "name": "Legal Document Masker",
  "icons": [
    {
      "src": "favicon.ico",
      "sizes": "64x64 32x32 24x24 16x16",
      "type": "image/x-icon"
    }
  ],
  "start_url": ".",
  "display": "standalone",
  "theme_color": "#000000",
  "background_color": "#ffffff"
 } 
--- a/frontend/src/App.tsx
+++ b/frontend/src/App.tsx
@ -0,0 +1,58 @@
 import React, { useEffect, useState } from 'react';
 import { Container, Typography, Box } from '@mui/material';
 import { useQuery, useQueryClient } from 'react-query';
 import FileUpload from './components/FileUpload';
 import FileList from './components/FileList';
 import { File } from './types/file';
 import { api } from './services/api';
 function App() {
  const queryClient = useQueryClient();
  const [files, setFiles] = useState<File[]>([]);
  const { data, isLoading, error } = useQuery<File[]>('files', api.listFiles, {
    refetchInterval: 5000, // Poll every 5 seconds
  });
  useEffect(() => {
    if (data) {
      setFiles(data);
    }
  }, [data]);
  const handleUploadComplete = () => {
    queryClient.invalidateQueries('files');
  };
  if (isLoading) {
    return (
      <Container>
        <Typography>Loading...</Typography>
      </Container>
    );
  }
  if (error) {
    return (
      <Container>
        <Typography color="error">Error loading files</Typography>
      </Container>
    );
  }
  return (
    <Container maxWidth="lg">
      <Box sx={{ my: 4 }}>
        <Typography variant="h4" component="h1" gutterBottom>
          Legal Document Masker
        </Typography>
        <Box sx={{ mb: 4 }}>
          <FileUpload onUploadComplete={handleUploadComplete} />
        </Box>
        <FileList files={files} onFileStatusChange={handleUploadComplete} />
      </Box>
    </Container>
  );
 }
 export default App; 
--- a/frontend/src/components/FileList.tsx
+++ b/frontend/src/components/FileList.tsx
@ -0,0 +1,230 @@
 import React, { useState } from 'react';
 import {
  Table,
  TableBody,
  TableCell,
  TableContainer,
  TableHead,
  TableRow,
  Paper,
  IconButton,
  Checkbox,
  Button,
  Chip,
  Dialog,
  DialogTitle,
  DialogContent,
  DialogActions,
  Typography,
 } from '@mui/material';
 import { Download as DownloadIcon, Delete as DeleteIcon } from '@mui/icons-material';
 import { File, FileStatus } from '../types/file';
 import { api } from '../services/api';
 interface FileListProps {
  files: File[];
  onFileStatusChange: () => void;
 }
 const FileList: React.FC<FileListProps> = ({ files, onFileStatusChange }) => {
  const [selectedFiles, setSelectedFiles] = useState<string[]>([]);
  const [deleteDialogOpen, setDeleteDialogOpen] = useState(false);
  const [fileToDelete, setFileToDelete] = useState<string | null>(null);
  const handleSelectFile = (fileId: string) => {
    setSelectedFiles((prev) =>
      prev.includes(fileId)
        ? prev.filter((id) => id !== fileId)
        : [...prev, fileId]
    );
  };
  const handleSelectAll = () => {
    setSelectedFiles((prev) =>
      prev.length === files.length ? [] : files.map((file) => file.id)
    );
  };
  const handleDownload = async (fileId: string) => {
    try {
      console.log('=== FRONTEND DOWNLOAD START ===');
      console.log('File ID:', fileId);
      const file = files.find((f) => f.id === fileId);
      console.log('File object:', file);
      const blob = await api.downloadFile(fileId);
      console.log('Blob received:', blob);
      console.log('Blob type:', blob.type);
      console.log('Blob size:', blob.size);
      const url = window.URL.createObjectURL(blob);
      const a = document.createElement('a');
      a.href = url;
      // Match backend behavior: change extension to .md
      const originalFilename = file?.filename || 'downloaded-file';
      const filenameWithoutExt = originalFilename.replace(/\.[^/.]+$/, ''); // Remove extension
      const downloadFilename = `${filenameWithoutExt}.md`;
      console.log('Original filename:', originalFilename);
      console.log('Filename without extension:', filenameWithoutExt);
      console.log('Download filename:', downloadFilename);
      a.download = downloadFilename;
      document.body.appendChild(a);
      a.click();
      window.URL.revokeObjectURL(url);
      document.body.removeChild(a);
      console.log('=== FRONTEND DOWNLOAD END ===');
    } catch (error) {
      console.error('Error downloading file:', error);
    }
  };
  const handleDownloadSelected = async () => {
    for (const fileId of selectedFiles) {
      await handleDownload(fileId);
    }
  };
  const handleDeleteClick = (fileId: string) => {
    setFileToDelete(fileId);
    setDeleteDialogOpen(true);
  };
  const handleDeleteConfirm = async () => {
    if (fileToDelete) {
      try {
        await api.deleteFile(fileToDelete);
        onFileStatusChange();
      } catch (error) {
        console.error('Error deleting file:', error);
      }
    }
    setDeleteDialogOpen(false);
    setFileToDelete(null);
  };
  const handleDeleteCancel = () => {
    setDeleteDialogOpen(false);
    setFileToDelete(null);
  };
  const getStatusColor = (status: FileStatus) => {
    switch (status) {
      case FileStatus.SUCCESS:
        return 'success';
      case FileStatus.FAILED:
        return 'error';
      case FileStatus.PROCESSING:
        return 'warning';
      default:
        return 'default';
    }
  };
  return (
    <div>
      <div style={{ marginBottom: '1rem' }}>
        <Button
          variant="contained"
          color="primary"
          onClick={handleDownloadSelected}
          disabled={selectedFiles.length === 0}
          sx={{ mr: 1 }}
        >
          Download Selected
        </Button>
      </div>
      <TableContainer component={Paper}>
        <Table>
          <TableHead>
            <TableRow>
              <TableCell padding="checkbox">
                <Checkbox
                  checked={selectedFiles.length === files.length}
                  indeterminate={selectedFiles.length > 0 && selectedFiles.length < files.length}
                  onChange={handleSelectAll}
                />
              </TableCell>
              <TableCell>Filename</TableCell>
              <TableCell>Status</TableCell>
              <TableCell>Created At</TableCell>
              <TableCell>Finished At</TableCell>
              <TableCell>Actions</TableCell>
            </TableRow>
          </TableHead>
          <TableBody>
            {files.map((file) => (
              <TableRow key={file.id}>
                <TableCell padding="checkbox">
                  <Checkbox
                    checked={selectedFiles.includes(file.id)}
                    onChange={() => handleSelectFile(file.id)}
                  />
                </TableCell>
                <TableCell>{file.filename}</TableCell>
                <TableCell>
                  <Chip
                    label={file.status}
                    color={getStatusColor(file.status) as any}
                    size="small"
                  />
                </TableCell>
                <TableCell>
                  {new Date(file.created_at).toLocaleString()}
                </TableCell>
                <TableCell>
                  {(file.status === FileStatus.SUCCESS || file.status === FileStatus.FAILED)
                    ? new Date(file.updated_at).toLocaleString()
                    : '—'}
                </TableCell>
                <TableCell>
                  <IconButton
                    onClick={() => handleDeleteClick(file.id)}
                    size="small"
                    color="error"
                    sx={{ mr: 1 }}
                  >
                    <DeleteIcon />
                  </IconButton>
                  {file.status === FileStatus.SUCCESS && (
                    <IconButton
                      onClick={() => handleDownload(file.id)}
                      size="small"
                      color="primary"
                    >
                      <DownloadIcon />
                    </IconButton>
                  )}
                </TableCell>
              </TableRow>
            ))}
          </TableBody>
        </Table>
      </TableContainer>
      <Dialog
        open={deleteDialogOpen}
        onClose={handleDeleteCancel}
      >
        <DialogTitle>Confirm Delete</DialogTitle>
        <DialogContent>
          <Typography>
            Are you sure you want to delete this file? This action cannot be undone.
          </Typography>
        </DialogContent>
        <DialogActions>
          <Button onClick={handleDeleteCancel}>Cancel</Button>
          <Button onClick={handleDeleteConfirm} color="error" variant="contained">
            Delete
          </Button>
        </DialogActions>
      </Dialog>
    </div>
  );
 };
 export default FileList; 
--- a/frontend/src/components/FileUpload.tsx
+++ b/frontend/src/components/FileUpload.tsx
@ -0,0 +1,66 @@
 import React, { useCallback } from 'react';
 import { useDropzone } from 'react-dropzone';
 import { Box, Typography, CircularProgress } from '@mui/material';
 import { api } from '../services/api';
 interface FileUploadProps {
  onUploadComplete: () => void;
 }
 const FileUpload: React.FC<FileUploadProps> = ({ onUploadComplete }) => {
  const [isUploading, setIsUploading] = React.useState(false);
  const onDrop = useCallback(async (acceptedFiles: File[]) => {
    setIsUploading(true);
    try {
      for (const file of acceptedFiles) {
        await api.uploadFile(file);
      }
      onUploadComplete();
    } catch (error) {
      console.error('Error uploading files:', error);
    } finally {
      setIsUploading(false);
    }
  }, [onUploadComplete]);
  const { getRootProps, getInputProps, isDragActive } = useDropzone({
    onDrop,
    accept: {
      'application/pdf': ['.pdf'],
      'application/msword': ['.doc'],
      'application/vnd.openxmlformats-officedocument.wordprocessingml.document': ['.docx'],
      'text/markdown': ['.md'],
    },
  });
  return (
    <Box
      {...getRootProps()}
      sx={{
        border: '2px dashed #ccc',
        borderRadius: 2,
        p: 3,
        textAlign: 'center',
        cursor: 'pointer',
        bgcolor: isDragActive ? 'action.hover' : 'background.paper',
        '&:hover': {
          bgcolor: 'action.hover',
        },
      }}
    >
      <input {...getInputProps()} />
      {isUploading ? (
        <CircularProgress />
      ) : (
        <Typography>
          {isDragActive
            ? 'Drop the files here...'
            : 'Drag and drop files here, or click to select files'}
        </Typography>
      )}
    </Box>
  );
 };
 export default FileUpload; 
--- a/frontend/src/env.d.ts
+++ b/frontend/src/env.d.ts
@ -0,0 +1,8 @@
 /// <reference types="react-scripts" />
 declare namespace NodeJS {
  interface ProcessEnv {
    readonly REACT_APP_API_BASE_URL: string;
    // Add other environment variables here
  }
 } 
--- a/frontend/src/index.tsx
+++ b/frontend/src/index.tsx
@ -0,0 +1,29 @@
 import React from 'react';
 import ReactDOM from 'react-dom/client';
 import { QueryClient, QueryClientProvider } from 'react-query';
 import { ThemeProvider, createTheme } from '@mui/material';
 import CssBaseline from '@mui/material/CssBaseline';
 import App from './App';
 const queryClient = new QueryClient();
 const theme = createTheme({
  palette: {
    mode: 'light',
  },
 });
 const root = ReactDOM.createRoot(
  document.getElementById('root') as HTMLElement
 );
 root.render(
  <React.StrictMode>
    <QueryClientProvider client={queryClient}>
      <ThemeProvider theme={theme}>
        <CssBaseline />
        <App />
      </ThemeProvider>
    </QueryClientProvider>
  </React.StrictMode>
 ); 
--- a/frontend/src/services/api.ts
+++ b/frontend/src/services/api.ts
@ -0,0 +1,44 @@
 import axios from 'axios';
 import { File, FileUploadResponse } from '../types/file';
 const API_BASE_URL = process.env.REACT_APP_API_BASE_URL || 'http://localhost:8000/api/v1';
 // Create axios instance with default config
 const axiosInstance = axios.create({
  baseURL: API_BASE_URL,
  timeout: 30000, // 30 seconds timeout
 });
 export const api = {
  uploadFile: async (file: globalThis.File): Promise<FileUploadResponse> => {
    const formData = new FormData();
    formData.append('file', file);
    const response = await axiosInstance.post('/files/upload', formData, {
      headers: {
        'Content-Type': 'multipart/form-data',
      },
    });
    return response.data;
  },
  listFiles: async (): Promise<File[]> => {
    const response = await axiosInstance.get('/files/files');
    return response.data;
  },
  getFile: async (fileId: string): Promise<File> => {
    const response = await axiosInstance.get(`/files/files/${fileId}`);
    return response.data;
  },
  downloadFile: async (fileId: string): Promise<Blob> => {
    const response = await axiosInstance.get(`/files/files/${fileId}/download`, {
      responseType: 'blob',
    });
    return response.data;
  },
  deleteFile: async (fileId: string): Promise<void> => {
    await axiosInstance.delete(`/files/files/${fileId}`);
  },
 }; 
--- a/frontend/src/types/file.ts
+++ b/frontend/src/types/file.ts
@ -0,0 +1,23 @@
 export enum FileStatus {
  NOT_STARTED = "not_started",
  PROCESSING = "processing",
  SUCCESS = "success",
  FAILED = "failed"
 }
 export interface File {
  id: string;
  filename: string;
  status: FileStatus;
  error_message?: string;
  created_at: string;
  updated_at: string;
 }
 export interface FileUploadResponse {
  id: string;
  filename: string;
  status: FileStatus;
  created_at: string;
  updated_at: string;
 } 
--- a/frontend/tsconfig.json
+++ b/frontend/tsconfig.json
@ -0,0 +1,26 @@
 {
  "compilerOptions": {
    "target": "es5",
    "lib": [
      "dom",
      "dom.iterable",
      "esnext"
    ],
    "allowJs": true,
    "skipLibCheck": true,
    "esModuleInterop": true,
    "allowSyntheticDefaultImports": true,
    "strict": true,
    "forceConsistentCasingInFileNames": true,
    "noFallthroughCasesInSwitch": true,
    "module": "esnext",
    "moduleResolution": "node",
    "resolveJsonModule": true,
    "isolatedModules": true,
    "noEmit": true,
    "jsx": "react-jsx"
  },
  "include": [
    "src"
  ]
 } 
--- a/import-images.sh
+++ b/import-images.sh
@ -0,0 +1,232 @@
 #!/bin/bash
 # Docker Image Import Script
 # Imports Docker images on target environment for migration
 set -e
 echo "🚀 Legal Document Masker - Docker Image Import"
 echo "=============================================="
 # Function to check if Docker is running
 check_docker() {
    if ! docker info > /dev/null 2>&1; then
        echo "❌ Docker is not running. Please start Docker and try again."
        exit 1
    fi
    echo "✅ Docker is running"
 }
 # Function to check for tar files
 check_tar_files() {
    echo "🔍 Checking for Docker image files..."
    local missing_files=()
    if [ ! -f "backend-api.tar" ]; then
        missing_files+=("backend-api.tar")
    fi
    if [ ! -f "frontend.tar" ]; then
        missing_files+=("frontend.tar")
    fi
    if [ ! -f "mineru-api.tar" ]; then
        missing_files+=("mineru-api.tar")
    fi
    if [ ! -f "redis.tar" ]; then
        missing_files+=("redis.tar")
    fi
    if [ ${#missing_files[@]} -ne 0 ]; then
        echo "❌ Missing files: ${missing_files[*]}"
        echo ""
        echo "Please ensure all .tar files are in the current directory."
        echo "If you have a compressed archive, extract it first:"
        echo "  tar -xzf legal-doc-masker-images-*.tar.gz"
        exit 1
    fi
    echo "✅ All required files found"
 }
 # Function to check available disk space
 check_disk_space() {
    echo "💾 Checking available disk space..."
    local required_space=0
    for file in *.tar; do
        local file_size=$(stat -f%z "$file" 2>/dev/null || stat -c%s "$file" 2>/dev/null || echo 0)
        required_space=$((required_space + file_size))
    done
    local available_space=$(df . | awk 'NR==2 {print $4}')
    available_space=$((available_space * 1024))  # Convert to bytes
    if [ $required_space -gt $available_space ]; then
        echo "❌ Insufficient disk space"
        echo "Required: $(numfmt --to=iec $required_space)"
        echo "Available: $(numfmt --to=iec $available_space)"
        exit 1
    fi
    echo "✅ Sufficient disk space available"
 }
 # Function to import images
 import_images() {
    echo "📦 Importing Docker images..."
    # Import backend image
    echo "  📦 Importing backend-api image..."
    docker load -i backend-api.tar
    # Import frontend image
    echo "  📦 Importing frontend image..."
    docker load -i frontend.tar
    # Import mineru image
    echo "  📦 Importing mineru-api image..."
    docker load -i mineru-api.tar
    # Import redis image
    echo "  📦 Importing redis image..."
    docker load -i redis.tar
    echo "✅ All images imported successfully!"
 }
 # Function to verify imported images
 verify_images() {
    echo "🔍 Verifying imported images..."
    local missing_images=()
    if ! docker images | grep -q "legal-doc-masker-backend-api"; then
        missing_images+=("legal-doc-masker-backend-api")
    fi
    if ! docker images | grep -q "legal-doc-masker-frontend"; then
        missing_images+=("legal-doc-masker-frontend")
    fi
    if ! docker images | grep -q "legal-doc-masker-mineru-api"; then
        missing_images+=("legal-doc-masker-mineru-api")
    fi
    if ! docker images | grep -q "redis:alpine"; then
        missing_images+=("redis:alpine")
    fi
    if [ ${#missing_images[@]} -ne 0 ]; then
        echo "❌ Missing imported images: ${missing_images[*]}"
        exit 1
    fi
    echo "✅ All images verified successfully!"
 }
 # Function to show imported images
 show_imported_images() {
    echo ""
    echo "📊 Imported Images:"
    echo "==================="
    docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep legal-doc-masker
    docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep redis
 }
 # Function to create necessary directories
 create_directories() {
    echo ""
    echo "📁 Creating necessary directories..."
    mkdir -p backend/storage
    mkdir -p mineru/storage/uploads
    mkdir -p mineru/storage/processed
    echo "✅ Directories created"
 }
 # Function to check for required files
 check_required_files() {
    echo ""
    echo "🔍 Checking for required configuration files..."
    local missing_files=()
    if [ ! -f "docker-compose.yml" ]; then
        missing_files+=("docker-compose.yml")
    fi
    if [ ! -f "DOCKER_COMPOSE_README.md" ]; then
        missing_files+=("DOCKER_COMPOSE_README.md")
    fi
    if [ ${#missing_files[@]} -ne 0 ]; then
        echo "⚠️  Missing files: ${missing_files[*]}"
        echo "Please copy these files from the source environment:"
        echo "  - docker-compose.yml"
        echo "  - DOCKER_COMPOSE_README.md"
        echo "  - backend/.env (if exists)"
        echo "  - frontend/.env (if exists)"
        echo "  - mineru/.env (if exists)"
    else
        echo "✅ All required configuration files found"
    fi
 }
 # Function to show next steps
 show_next_steps() {
    echo ""
    echo "🎉 Import completed successfully!"
    echo ""
    echo "📋 Next Steps:"
    echo "=============="
    echo ""
    echo "1. Copy configuration files (if not already present):"
    echo "   - docker-compose.yml"
    echo "   - backend/.env"
    echo "   - frontend/.env"
    echo "   - mineru/.env"
    echo ""
    echo "2. Start the services:"
    echo "   docker-compose up -d"
    echo ""
    echo "3. Verify services are running:"
    echo "   docker-compose ps"
    echo ""
    echo "4. Test the endpoints:"
    echo "   - Frontend: http://localhost:3000"
    echo "   - Backend API: http://localhost:8000"
    echo "   - Mineru API: http://localhost:8001"
    echo ""
    echo "5. View logs if needed:"
    echo "   docker-compose logs -f [service-name]"
 }
 # Function to handle compressed archive
 handle_compressed_archive() {
    if ls legal-doc-masker-images-*.tar.gz 1> /dev/null 2>&1; then
        echo "🗜️  Found compressed archive, extracting..."
        tar -xzf legal-doc-masker-images-*.tar.gz
        echo "✅ Archive extracted"
    fi
 }
 # Main execution
 main() {
    check_docker
    handle_compressed_archive
    check_tar_files
    check_disk_space
    import_images
    verify_images
    show_imported_images
    create_directories
    check_required_files
    show_next_steps
 }
 # Run main function
 main "$@" 
--- a/mineru/Dockerfile
+++ b/mineru/Dockerfile
@ -0,0 +1,46 @@
 FROM python:3.12-slim
 WORKDIR /app
 # Install system dependencies
 RUN apt-get update && apt-get install -y \
    build-essential \
    libreoffice \
    wget \
    && rm -rf /var/lib/apt/lists/*
 RUN pip install --upgrade pip
 RUN pip install uv
 # Configure uv and install mineru
 ENV UV_SYSTEM_PYTHON=1
 RUN uv pip install --system -U "mineru[core]"
 # Copy requirements first to leverage Docker cache
 # COPY requirements.txt .
 # RUN pip install huggingface_hub
 # RUN wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
 # RUN wget https://raw.githubusercontent.com/opendatalab/MinerU/refs/heads/release-1.3.1/scripts/download_models_hf.py -O download_models_hf.py
 # RUN python download_models_hf.py
 RUN mineru-models-download -s modelscope -m pipeline
 # RUN pip install --no-cache-dir -r requirements.txt
 # RUN pip install -U magic-pdf[full]
 # Copy the rest of the application
 # COPY . .
 # Create storage directories
 # RUN mkdir -p storage/uploads storage/processed
 # Expose the port the app runs on
 EXPOSE 8000
 # Command to run the application
 CMD ["mineru-api", "--host", "0.0.0.0", "--port", "8000"] 
--- a/mineru/docker-compose.yml
+++ b/mineru/docker-compose.yml
@ -0,0 +1,27 @@
 version: '3.8'
 services:
  mineru-api:
    build:
      context: .
      dockerfile: Dockerfile
    platform: linux/arm64
    ports:
      - "8001:8000"
    volumes:
      - ./storage/uploads:/app/storage/uploads
      - ./storage/processed:/app/storage/processed
    environment:
      - PYTHONUNBUFFERED=1
      - MINERU_MODEL_SOURCE=local
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
 volumes:
  uploads:
  processed:
--- a/requirements.txt
+++ b/requirements.txt
@ -1,11 +0,0 @@
 # Base dependencies
 pydantic-settings>=2.0.0
 python-dotenv==1.0.0
 watchdog==2.1.6
 requests==2.28.1
 # Document processing
 python-docx>=0.8.11
 PyPDF2>=3.0.0
 pandas>=2.0.0
 magic-pdf[full]
--- a/sample_doc/20220707_na_decision-2.docx
+++ b/sample_doc/20220707_na_decision-2.docx
--- a/sample_doc/20220707_na_decision-2.md
+++ b/sample_doc/20220707_na_decision-2.md
--- a/sample_doc/20220707_na_decision-2.pdf
+++ b/sample_doc/20220707_na_decision-2.pdf
--- a/sample_doc/short_doc.md
+++ b/sample_doc/short_doc.md
@ -0,0 +1,43 @@
 # 北京市第三中级人民法院民事判决书  
 (2022)京 03 民终 3852 号  
 上诉人（原审原告）：北京丰复久信营销科技有限公司，住所地北京市海淀区北小马厂6 号1 号楼华天大厦1306 室。  
 法定代表人：郭东军，执行董事、经理。委托诉讼代理人：周大海，北京市康达律师事务所律师。委托诉讼代理人：王乃哲，北京市康达律师事务所律师。  
 被上诉人（原审被告）：中研智创区块链技术有限公司，住所地天津市津南区双港镇工业园区优谷产业园5 号楼-1505。  
 法定代表人：王欢子，总经理。  
 委托诉讼代理人：魏鑫，北京市昊衡律师事务所律师。  
 1.上诉人北京丰复久信营销科技有限公司（以下简称丰复久信公司）因与被上诉人中研智创区块链技术有限公司（以下简称中研智创公司）服务合同纠纷一案，不服北京市朝阳区人民法院（2020）京0105 民初69754 号民事判决，向本院提起上诉。本院立案后，依法组成合议庭开庭进行了审理。上诉人丰复久信公司之委托诉讼代理人周大海、王乃哲，被上诉人中研智创公司之委托诉讼代理人魏鑫到庭参加诉讼。本案现已审理终结。  
 2.丰复久信公司上诉请求：1.撤销一审判决，发回重审或依法改判支持丰复久信公司一审全部诉讼请求；2.或在维持原判的同时判令中研智创公司向丰复久信公司返还 1000 万元款项，并赔偿丰复久信公司因此支付的律师费 220 万元；3.判令中研智创公司承担本案一审、二审全部诉讼费用。事实与理由：一、根据2019 年的政策导向，丰复久信公司的投资行为并无任何法律或政策瑕疵。丰复久信公司仅投资挖矿，没有购买比特币，故在当时国家、政府层面有相关政策支持甚至鼓励的前提下，一审法院仅凭“挖矿”行为就得出丰复久信公司扰乱金融秩序的结论，是错误的。二、一审法院没有全面、深入审查相关事实，且遗漏了最核心的数据调查工作。三、本案一审判决适用法律错误。涉案合同成立及履行期间并无合同无效的情形，当属有效。一审法院以挖矿活动耗能巨大、不利于我国产业结构调整为依据之一，作出合同无效的判决，实属牵强。最高人民法院发布的全国法院系统2020 年度优秀案例分析评选活动获奖名单中，由上海市第一中级人民法院刘江法官编写的“李圣艳、布兰登·斯密特诉闫向东、李敏等财产损害赔偿纠纷案— —比特币的法律属性及其司法救济”一案入选，该案同样发生在丰复久信公司与中研智创公司合同履行过程中，一审法院认定同时期同类型的涉案合同无效，与上述最高人民法院的优秀案例相悖。四、一审法院径行认定合同无效，未向丰复久信公司进行释明构成程序违法。  
 3.中研智创公司辩称，同意一审判决，不同意丰复久信公司的上诉请求。首先，一审法院曾在庭审中询问丰复久信公司关于机器返还的问题，一审法院进行了释明。其次，如二审法院对其该项上诉请求进行判决，会剥夺中研智创公司针对该部分请求再行上诉的权利。  
 4.丰复久信公司向一审法院起诉请求：1.中研智创公司交付278.1654976 个比特币，或者按照 2021 年 1 月 25 日比特币的价格交付9550812.36 美元；2.中研智创公司赔偿丰复久信公司服务期到期后占用微型存储空间服务器的损失(自2020 年7 月1日起至实际返还服务器时止，按照bitinfocharts 网站公布的相关日产比特币数据，计算应赔偿比特币数量或按照2021 年1 月25 日比特币的价格交付美元)。  
 5.一审法院查明事实：2019 年5 月6 日，丰复久信公司作为甲方（买方）与乙方（卖方）中研智创公司签订《计算机设备采购合  
 同》，约定：货物名称为计算机设备，型号规格及数量为T2T-30T 规格型号的微型存储空间服务器1542 台，单价5040/ 台合同金额为 7 771 680 元；交货期 2019 年 8 月 31 日前；交货方式为乙方自行送货到甲方所在地，并提供安装服务，运输工具及运费由乙方负责；交货地点北京；签订购货合同，设备安装完毕后一次性支付项目总货款；乙方提供货物的质量保证期为自交货验收结束之日起不少于十二个月（具体按清单要求）；乙方交货前应对产品作出全面检查和对验收文件进行整理，并列出清单，作为甲方收货验收和使用的技术条件依据，检验的结果应随货物交甲方，甲方对乙方提供的货物在使用前进行调试时，乙方协助甲方一起调试，直到符合技术要求，甲方才做最终验收，验收时乙方必须在现场，验收完毕后作出验收结果报告，并经双方签字生效。  
 6.同日，丰复久信公司作为甲方（客户方）与乙方中研智创公司（服务方）签订《服务合同书》，约定：乙方同意就采购合同中的微型存储空间服务器向甲方提供特定服务；服务的内容包括质保、维修、服务器设备代为运行管理、代为缴纳服务器相关用度花费如电费等，详细内容见附件一；如果乙方在工作中因自身过错而发生任何错误或遗漏，应无条件更正，不另外收费，并对因此而对甲方造成的损失承担赔偿责任，赔偿额以本合同约定的服务费为限；若因甲方原因造成工作延误，将由甲方承担相应的损失；服务费总金额为2 228 320 元，甲乙双方一致同意项目服务费以人民币形式，于本合同签订后3 日内一次性支付；甲方可以提前10 个工作日以书面形式要求变更或增加所提供的服务，该等变更最终应由双方商定认可，其中包括与该等变更有关的任何费用调整等。合同后附附件一以表格形式列明，1.1542 台T2T-30T 微型存储空间服务器的质保、维修，时限12 个月，完成标准为完成甲方指定的运行量；2.服务器的日常运行管理，时限12 个月；3.代扣代缴电费；4.其他（空白）。  
 24. 2021 年9 月3 日国家发展和改革委员会等部门《关于整治虚拟货币“挖矿”活动的通知》显示，虚拟货币挖矿活动能源消耗和碳排放量大，对国民经济贡献度低，对产业发展、科技进步等带动作用有限，加之虚拟货币生产、交易环节衍生的风险越发突出，其盲目无序发展对推动经济社会高质量发展和节能减排带来不利影响。故以电力资源、碳排放量为代价的“挖矿”行为，与经济社会高质量发展和碳达峰、碳中和目标相悖，与公共利益相悖。  
 26. 综上，相关部门整治虚拟货币“挖矿”活动、认定虚拟货币相关业务活动属于非法金融活动，有利于保障我国发展利益和金融安全。从“挖矿”行为的高能耗以及比特币交易活动对国家金融秩序和社会秩序的影响来看，一审法院认定涉案合同无效是正确的。双方作为社会主义市场经济主体，既应遵守市场经济规则，亦应承担起相应的社会责任，推动经济社会高质量发展、可持续发展。  
 27. 关于合同无效后的返还问题，一审法院未予处理，双方可另行解决。  
 28. 综上所述，丰复久信公司的上诉请求不能成立，应予驳回；一审判决并无不当，应予维持。依照《中华人民共和国民事诉讼法》第一百七十七条第一款第一项规定，判决如下：  
 驳回上诉，维持原判。  
 二审案件受理费450892 元，由北京丰复久信营销科技有限公司负担（已交纳）。  
 29. 本判决为终审判决。  
 审 判 长 史晓霞审 判 员 邓青菁审 判 员 李 淼二〇二二年七月七日法 官 助 理 黎 铧书 记 员 郑海兴  
--- a/setup-unified-docker.sh
+++ b/setup-unified-docker.sh
@ -0,0 +1,110 @@
 #!/bin/bash
 # Unified Docker Compose Setup Script
 # This script helps set up the unified Docker Compose environment
 set -e
 echo "🚀 Setting up Unified Docker Compose Environment"
 # Function to check if Docker is running
 check_docker() {
    if ! docker info > /dev/null 2>&1; then
        echo "❌ Docker is not running. Please start Docker and try again."
        exit 1
    fi
    echo "✅ Docker is running"
 }
 # Function to stop existing individual services
 stop_individual_services() {
    echo "🛑 Stopping individual Docker Compose services..."
    if [ -f "backend/docker-compose.yml" ]; then
        echo "Stopping backend services..."
        cd backend && docker-compose down 2>/dev/null || true && cd ..
    fi
    if [ -f "frontend/docker-compose.yml" ]; then
        echo "Stopping frontend services..."
        cd frontend && docker-compose down 2>/dev/null || true && cd ..
    fi
    if [ -f "mineru/docker-compose.yml" ]; then
        echo "Stopping mineru services..."
        cd mineru && docker-compose down 2>/dev/null || true && cd ..
    fi
    echo "✅ Individual services stopped"
 }
 # Function to create necessary directories
 create_directories() {
    echo "📁 Creating necessary directories..."
    mkdir -p backend/storage
    mkdir -p mineru/storage/uploads
    mkdir -p mineru/storage/processed
    echo "✅ Directories created"
 }
 # Function to check if unified docker-compose.yml exists
 check_unified_compose() {
    if [ ! -f "docker-compose.yml" ]; then
        echo "❌ Unified docker-compose.yml not found in current directory"
        echo "Please run this script from the project root directory"
        exit 1
    fi
    echo "✅ Unified docker-compose.yml found"
 }
 # Function to build and start services
 start_unified_services() {
    echo "🔨 Building and starting unified services..."
    # Build all services
    docker-compose build
    # Start services
    docker-compose up -d
    echo "✅ Unified services started"
 }
 # Function to check service status
 check_service_status() {
    echo "📊 Checking service status..."
    docker-compose ps
    echo ""
    echo "🌐 Service URLs:"
    echo "Frontend: http://localhost:3000"
    echo "Backend API: http://localhost:8000"
    echo "Mineru API: http://localhost:8001"
    echo ""
    echo "📝 To view logs: docker-compose logs -f [service-name]"
    echo "📝 To stop services: docker-compose down"
 }
 # Main execution
 main() {
    echo "=========================================="
    echo "Unified Docker Compose Setup"
    echo "=========================================="
    check_docker
    check_unified_compose
    stop_individual_services
    create_directories
    start_unified_services
    check_service_status
    echo ""
    echo "🎉 Setup complete! Your unified Docker environment is ready."
    echo "Check the DOCKER_COMPOSE_README.md for more information."
 }
 # Run main function
 main "$@" 
--- a/src/config/settings.py
+++ b/src/config/settings.py
@ -1,31 +0,0 @@
 # settings.py
 from pydantic_settings import BaseSettings
 from typing import Optional
 class Settings(BaseSettings):
    # Storage paths
    OBJECT_STORAGE_PATH: str = ""
    TARGET_DIRECTORY_PATH: str = ""
    # Ollama API settings
    OLLAMA_API_URL: str = "https://api.ollama.com"
    OLLAMA_API_KEY: str = ""
    OLLAMA_MODEL: str = "llama2"
    # File monitoring settings
    MONITOR_INTERVAL: int = 5
    # Logging settings
    LOG_LEVEL: str = "INFO"
    LOG_FORMAT: str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
    LOG_DATE_FORMAT: str = "%Y-%m-%d %H:%M:%S"
    LOG_FILE: str = "app.log"
    class Config:
        env_file = ".env"
        env_file_encoding = "utf-8"
        extra = "allow"
 # Create settings instance
 settings = Settings()
--- a/src/document_handlers/document_processor.py
+++ b/src/document_handlers/document_processor.py
@ -1,190 +0,0 @@
 from abc import ABC, abstractmethod
 from typing import Any, Dict
 from prompts.masking_prompts import get_masking_mapping_prompt
 import logging
 import json
 from services.ollama_client import OllamaClient
 from config.settings import settings
 from utils.json_extractor import LLMJsonExtractor
 logger = logging.getLogger(__name__)
 class DocumentProcessor(ABC):
    def __init__(self):
        self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
        self.max_chunk_size = 1000  # Maximum number of characters per chunk
        self.max_retries = 3  # Maximum number of retries for mapping generation
    @abstractmethod
    def read_content(self) -> str:
        """Read document content"""
        pass
    def _split_into_chunks(self, sentences: list[str]) -> list[str]:
        """Split sentences into chunks that don't exceed max_chunk_size"""
        chunks = []
        current_chunk = ""
        for sentence in sentences:
            if not sentence.strip():
                continue
            # If adding this sentence would exceed the limit, save current chunk and start new one
            if len(current_chunk) + len(sentence) > self.max_chunk_size and current_chunk:
                chunks.append(current_chunk)
                current_chunk = sentence
            else:
                if current_chunk:
                    current_chunk += "。" + sentence
                else:
                    current_chunk = sentence
        # Add the last chunk if it's not empty
        if current_chunk:
            chunks.append(current_chunk)
        return chunks
    def _validate_mapping_format(self, mapping: Dict[str, Any]) -> bool:
        """
        Validate that the mapping follows the required format:
        {
            "原文1": "脱敏后1",
            "原文2": "脱敏后2",
            ...
        }
        """
        if not isinstance(mapping, dict):
            logger.warning("Mapping is not a dictionary")
            return False
        # Check if any key or value is not a string
        for key, value in mapping.items():
            if not isinstance(key, str) or not isinstance(value, str):
                logger.warning(f"Invalid mapping format - key or value is not a string: {key}: {value}")
                return False
        # Check if the mapping has any nested structures
        if any(isinstance(v, (dict, list)) for v in mapping.values()):
            logger.warning("Invalid mapping format - contains nested structures")
            return False
        return True
    def _build_mapping(self, chunk: str) -> Dict[str, str]:
        """Build mapping for a single chunk of text with retry logic"""
        for attempt in range(self.max_retries):
            try:
                formatted_prompt = get_masking_mapping_prompt(chunk)
                logger.info(f"Calling ollama to generate mapping for chunk (attempt {attempt + 1}/{self.max_retries}): {formatted_prompt}")
                response = self.ollama_client.generate(formatted_prompt)
                logger.info(f"Raw response from LLM: {response}")
                # Parse the JSON response into a dictionary
                mapping = LLMJsonExtractor.parse_raw_json_str(response)
                logger.info(f"Parsed mapping: {mapping}")
                if mapping and self._validate_mapping_format(mapping):
                    return mapping
                else:
                    logger.warning(f"Invalid mapping format received on attempt {attempt + 1}, retrying...")
            except Exception as e:
                logger.error(f"Error generating mapping on attempt {attempt + 1}: {e}")
                if attempt < self.max_retries - 1:
                    logger.info("Retrying...")
                else:
                    logger.error("Max retries reached, returning empty mapping")
                    return {}
    def _apply_mapping(self, text: str, mapping: Dict[str, str]) -> str:
        """Apply the mapping to replace sensitive information"""
        masked_text = text
        for original, masked in mapping.items():
            # Ensure masked value is a string
            if isinstance(masked, dict):
                # If it's a dict, use the first value or a default
                masked = next(iter(masked.values()), "某")
            elif not isinstance(masked, str):
                # If it's not a string, convert to string or use default
                masked = str(masked) if masked is not None else "某"
            masked_text = masked_text.replace(original, masked)
        return masked_text
    def _get_next_suffix(self, value: str) -> str:
        """Get the next available suffix for a value that already has a suffix"""
        # Define the sequence of suffixes
        suffixes = ['甲', '乙', '丙', '丁', '戊', '己', '庚', '辛', '壬', '癸']
        # Check if the value already has a suffix
        for suffix in suffixes:
            if value.endswith(suffix):
                # Find the next suffix in the sequence
                current_index = suffixes.index(suffix)
                if current_index + 1 < len(suffixes):
                    return value[:-1] + suffixes[current_index + 1]
                else:
                    # If we've used all suffixes, start over with the first one
                    return value[:-1] + suffixes[0]
        # If no suffix found, return the value with the first suffix
        return value + '甲'
    def _merge_mappings(self, existing: Dict[str, str], new: Dict[str, str]) -> Dict[str, str]:
        """
        Merge two mappings following the rules:
        1. If key exists in existing, keep existing value
        2. If value exists in existing:
           - If value ends with a suffix (甲乙丙丁...), add next suffix
           - If no suffix, add '甲'
        """
        result = existing.copy()
        # Get all existing values
        existing_values = set(result.values())
        for key, value in new.items():
            if key in result:
                # Rule 1: Keep existing value if key exists
                continue
            if value in existing_values:
                # Rule 2: Handle duplicate values
                new_value = self._get_next_suffix(value)
                result[key] = new_value
                existing_values.add(new_value)
            else:
                # No conflict, add as is
                result[key] = value
                existing_values.add(value)
        return result
    def process_content(self, content: str) -> str:
        """Process document content by masking sensitive information"""
        # Split content into sentences
        sentences = content.split("。")
        # Split sentences into manageable chunks
        chunks = self._split_into_chunks(sentences)
        logger.info(f"Split content into {len(chunks)} chunks")
        # Build mapping for each chunk
        combined_mapping = {}
        for i, chunk in enumerate(chunks):
            logger.info(f"Processing chunk {i+1}/{len(chunks)}")
            chunk_mapping = self._build_mapping(chunk)
            if chunk_mapping:  # Only update if we got a valid mapping
                combined_mapping = self._merge_mappings(combined_mapping, chunk_mapping)
            else:
                logger.warning(f"Failed to generate mapping for chunk {i+1}")
        # Apply the combined mapping to the entire content
        masked_content = self._apply_mapping(content, combined_mapping)
        logger.info("Successfully masked content")
        return masked_content
    @abstractmethod
    def save_content(self, content: str) -> None:
        """Save processed content"""
        pass
--- a/src/document_handlers/processors/init.py
+++ b/src/document_handlers/processors/init.py
@ -1,6 +0,0 @@
 from document_handlers.processors.txt_processor import TxtDocumentProcessor
 from document_handlers.processors.docx_processor import DocxDocumentProcessor
 from document_handlers.processors.pdf_processor import PdfDocumentProcessor
 from document_handlers.processors.md_processor import MarkdownDocumentProcessor
 __all__ = ['TxtDocumentProcessor', 'DocxDocumentProcessor', 'PdfDocumentProcessor', 'MarkdownDocumentProcessor']
--- a/src/document_handlers/processors/pdf_processor.py
+++ b/src/document_handlers/processors/pdf_processor.py
@ -1,105 +0,0 @@
 import os
 import PyPDF2
 from document_handlers.document_processor import DocumentProcessor
 from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
 from magic_pdf.data.dataset import PymuDocDataset
 from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
 from magic_pdf.config.enums import SupportedPdfParseMethod
 from prompts.masking_prompts import get_masking_prompt, get_masking_mapping_prompt
 import logging
 from services.ollama_client import OllamaClient
 from config.settings import settings
 logger = logging.getLogger(__name__)
 class PdfDocumentProcessor(DocumentProcessor):
    def __init__(self, input_path: str, output_path: str):
        super().__init__()  # Call parent class's __init__
        self.input_path = input_path
        self.output_path = output_path
        self.output_dir = os.path.dirname(output_path)
        self.name_without_suff = os.path.splitext(os.path.basename(input_path))[0]
        # Setup output directories
        self.local_image_dir = os.path.join(self.output_dir, "images")
        self.image_dir = os.path.basename(self.local_image_dir)
        os.makedirs(self.local_image_dir, exist_ok=True)
        # Setup work directory under output directory
        self.work_dir = os.path.join(
            os.path.dirname(output_path), 
            ".work", 
            os.path.splitext(os.path.basename(input_path))[0]
        )
        os.makedirs(self.work_dir, exist_ok=True)
        self.work_local_image_dir = os.path.join(self.work_dir, "images")
        self.work_image_dir = os.path.basename(self.work_local_image_dir)
        os.makedirs(self.work_local_image_dir, exist_ok=True)   
        self.ollama_client = OllamaClient(model_name=settings.OLLAMA_MODEL, base_url=settings.OLLAMA_API_URL)
    def read_content(self) -> str:
        logger.info("Starting PDF content processing")
        # Read the PDF file
        with open(self.input_path, 'rb') as file:
            content = file.read()
        # Initialize writers
        image_writer = FileBasedDataWriter(self.work_local_image_dir)
        md_writer = FileBasedDataWriter(self.work_dir)
        # Create Dataset Instance
        ds = PymuDocDataset(content)
        logger.info("Classifying PDF type: %s", ds.classify())
        # Process based on PDF type
        if ds.classify() == SupportedPdfParseMethod.OCR:
            infer_result = ds.apply(doc_analyze, ocr=True)
            pipe_result = infer_result.pipe_ocr_mode(image_writer)
        else:
            infer_result = ds.apply(doc_analyze, ocr=False)
            pipe_result = infer_result.pipe_txt_mode(image_writer)
        logger.info("Generating all outputs")
        # Generate all outputs
        infer_result.draw_model(os.path.join(self.work_dir, f"{self.name_without_suff}_model.pdf"))
        model_inference_result = infer_result.get_infer_res()
        pipe_result.draw_layout(os.path.join(self.work_dir, f"{self.name_without_suff}_layout.pdf"))
        pipe_result.draw_span(os.path.join(self.work_dir, f"{self.name_without_suff}_spans.pdf"))
        md_content = pipe_result.get_markdown(self.work_image_dir)
        pipe_result.dump_md(md_writer, f"{self.name_without_suff}.md", self.work_image_dir)
        content_list = pipe_result.get_content_list(self.work_image_dir)
        pipe_result.dump_content_list(md_writer, f"{self.name_without_suff}_content_list.json", self.work_image_dir)
        middle_json = pipe_result.get_middle_json()
        pipe_result.dump_middle_json(md_writer, f'{self.name_without_suff}_middle.json')
        return md_content
    # def process_content(self, content: str) -> str:
    #     logger.info("Starting content masking process")
    #     sentences = content.split("。")
    #     final_md = ""
    #     for sentence in sentences:
    #         if not sentence.strip():  # Skip empty sentences
    #             continue
    #         formatted_prompt = get_masking_mapping_prompt(sentence)
    #         logger.info("Calling ollama to generate response, prompt: %s", formatted_prompt)
    #         response = self.ollama_client.generate(formatted_prompt)
    #         logger.info(f"Response generated: {response}")
    #         final_md += response + "。"
    #     return final_md
    def save_content(self, content: str) -> None:
        # Ensure output path has .md extension
        output_dir = os.path.dirname(self.output_path)
        base_name = os.path.splitext(os.path.basename(self.output_path))[0]
        md_output_path = os.path.join(output_dir, f"{base_name}.md")
        logger.info(f"Saving masked content to: {md_output_path}")
        with open(md_output_path, 'w', encoding='utf-8') as file:
            file.write(content)
--- a/src/main.py
+++ b/src/main.py
@ -1,22 +0,0 @@
 from config.logging_config import setup_logging
 def main():
    # Setup logging first
    setup_logging()
    from services.file_monitor import FileMonitor
    from config.settings import settings
    import logging
    logger = logging.getLogger(__name__)
    logger.info("Starting the application")
    logger.info(f"Monitoring directory: {settings.OBJECT_STORAGE_PATH}")
    logger.info(f"Target directory: {settings.TARGET_DIRECTORY_PATH}")
    # Initialize the file monitor
    file_monitor = FileMonitor(settings.OBJECT_STORAGE_PATH, settings.TARGET_DIRECTORY_PATH)
    # Start monitoring the directory for new files
    file_monitor.start_monitoring()
 if __name__ == "__main__":
    main()
--- a/src/prompts/masking_prompts.py
+++ b/src/prompts/masking_prompts.py
@ -1,81 +0,0 @@
 import textwrap
 def get_masking_prompt(text: str) -> str:
    """
    Returns the prompt for masking sensitive information in legal documents.
    Args:
        text (str): The input text to be masked
    Returns:
        str: The formatted prompt with the input text
    """
    prompt = textwrap.dedent("""
        您是一位专业的法律文档脱敏专家。请按照以下规则对文本进行脱敏处理：
        规则：
        1. 人名：
           - 两字名改为"姓+某"（如：张三 → 张某）
           - 三字名改为"姓+某某"（如：张三丰 → 张某某）
        2. 公司名：
           - 保留地理位置信息（如：北京、上海等）
           - 保留公司类型（如：有限公司、股份公司等）
           - 用"某"替换核心名称
        3. 保持原文其他部分不变
        4. 确保脱敏后的文本保持原有的语言流畅性和可读性
        输入文本：
        {text}
        请直接输出脱敏后的文本，无需解释或其他备注。
    """)
    return prompt.format(text=text)
 def get_masking_mapping_prompt(text: str) -> str:
    """
    Returns a prompt that generates a mapping of original names/companies to their masked versions.
    Args:
        text (str): The input text to be analyzed for masking
    Returns:
        str: The formatted prompt that will generate a mapping dictionary
    """
    prompt = textwrap.dedent("""
        您是一位专业的法律文档脱敏专家。请分析文本并生成一个脱敏映射表，遵循以下规则：
        规则：
        1. 人名映射规则：
           - 对于同一姓氏的不同人名，使用字母区分：
             * 第一个出现的用"姓+某"（如：张三 → 张某）
             * 第二个出现的用"姓+某A"（如：张四 → 张某A）
             * 第三个出现的用"姓+某B"（如：张五 → 张某B）
             依此类推
           - 三字名同样遵循此规则（如：张三丰 → 张某某，张四海 → 张某某A）
        2. 公司名映射规则：
           - 保留地理位置信息（如：北京、上海等）
           - 保留公司类型（如：有限公司、股份公司等）
           - 用"某"替换核心名称,但保留首尾字(如：北京智慧科技有限公司 → 北京智某科技有限公司)
           - 对于多个相似公司名，使用字母区分（如：
             北京智慧科技有限公司 → 北京某科技有限公司
             北京智能科技有限公司 → 北京某科技有限公司A）
        3. 公权机关不做脱敏处理（如：公安局、法院、检察院、中国人民银行、银监会及其他未列明的公权机关）
        请分析以下文本，并生成一个JSON格式的映射表，包含所有需要脱敏的名称及其对应的脱敏后的形式：
        {text}
        请直接输出JSON格式的映射表，格式如下：
        {{
            "原文1": "脱敏后1",
            "原文2": "脱敏后2",
            ...
        }}
        如无需要输出的映射，请输出空json，如下:
        {{}}
    """)
    return prompt.format(text=text)
--- a/src/services/file_monitor.py
+++ b/src/services/file_monitor.py
@ -1,54 +0,0 @@
 import logging
 import os
 from services.document_service import DocumentService
 from services.ollama_client import OllamaClient
 from config.settings import settings
 logger = logging.getLogger(__name__)
 class FileMonitor:
    def __init__(self, input_directory: str, output_directory: str):
        self.input_directory = input_directory
        self.output_directory = output_directory
        # Create OllamaClient instance using settings
        ollama_client = OllamaClient(
            model_name=settings.OLLAMA_MODEL,
            base_url=settings.OLLAMA_API_URL
        )
        # Inject OllamaClient into DocumentService
        self.document_service = DocumentService(ollama_client=ollama_client)
    def process_new_file(self, file_path: str) -> None:
        try:
            # Get the filename without directory path
            filename = os.path.basename(file_path)
            # Create output path
            output_path = os.path.join(self.output_directory, filename)
            logger.info(f"Processing file: {filename}")
            # Process the document using document service
            self.document_service.process_document(file_path, output_path)
            logger.info(f"File processed successfully: {filename}")
        except Exception as e:
            logger.error(f"Error processing file {file_path}: {str(e)}")
    def start_monitoring(self):
        import time
        # Ensure output directory exists
        os.makedirs(self.output_directory, exist_ok=True)
        already_seen = set(os.listdir(self.input_directory))
        while True:
            time.sleep(1)  # Check every second
            current_files = set(os.listdir(self.input_directory))
            new_files = current_files - already_seen
            for new_file in new_files:
                file_path = os.path.join(self.input_directory, new_file)
                logger.info(f"New file found: {new_file}")
                self.process_new_file(file_path)
            already_seen = current_files
		`@ -1,2 +0,0 @@`
			`rm ./doc_src/*.md`
			`cp ./doc/*.md ./doc_src/`
		`@ -0,0 +1,2 @@`
							`# REACT_APP_API_BASE_URL=http://192.168.2.203:8000/api/v1`
							`REACT_APP_API_BASE_URL=http://localhost:8000/api/v1`