Initial commit: Document processing app with Ollama integration

This commit is contained in:
tigermren 2025-04-23 00:02:10 +08:00
commit 0904ab5073
26 changed files with 501 additions and 0 deletions

13
.dockerignore Normal file
View File

@ -0,0 +1,13 @@
__pycache__
*.pyc
*.pyo
*.pyd
.Python
env/
venv/
.env
*.log
.git
.gitignore
.pytest_cache
tests/

19
.env.example Normal file
View File

@ -0,0 +1,19 @@
# Storage paths
OBJECT_STORAGE_PATH=/path/to/mounted/object/storage
TARGET_DIRECTORY_PATH=/path/to/target/directory
# Ollama API Configuration
OLLAMA_API_URL=https://api.ollama.com
OLLAMA_API_KEY=your_api_key_here
OLLAMA_MODEL=llama2
# Application Settings
MONITOR_INTERVAL=5
# Logging Configuration
LOG_LEVEL=INFO
LOG_FILE=app.log
# Optional: Additional security settings
# MAX_FILE_SIZE=10485760 # 10MB in bytes
# ALLOWED_FILE_TYPES=.txt,.doc,.docx,.pdf

48
Dockerfile Normal file
View File

@ -0,0 +1,48 @@
# Build stage
FROM python:3.12-slim AS builder
WORKDIR /app
# Install build dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements first to leverage Docker cache
COPY requirements.txt .
RUN pip wheel --no-cache-dir --no-deps --wheel-dir /app/wheels -r requirements.txt
# Final stage
FROM python:3.12-slim
WORKDIR /app
# Create non-root user
RUN useradd -m -r appuser && \
chown appuser:appuser /app
# Copy wheels from builder
COPY --from=builder /app/wheels /wheels
COPY --from=builder /app/requirements.txt .
# Install dependencies
RUN pip install --no-cache /wheels/*
# Copy application code
COPY src/ ./src/
# Create directories for mounted volumes
RUN mkdir -p /data/input /data/output && \
chown -R appuser:appuser /data
# Switch to non-root user
USER appuser
# Environment variables
ENV PYTHONPATH=/app \
OBJECT_STORAGE_PATH=/data/input \
TARGET_DIRECTORY_PATH=/data/output
# Run the application
CMD ["python", "src/main.py"]

58
README.md Normal file
View File

@ -0,0 +1,58 @@
# README.md
# Document Processing App
This project is designed to process legal documents by hiding sensitive information such as names and company names. It utilizes the Ollama API with selected models for text processing. The application monitors a specified directory for new files, processes them automatically, and saves the results to a target path.
## Project Structure
```
doc-processing-app
├── src
│ ├── main.py # Entry point of the application
│ ├── config
│ │ └── settings.py # Configuration settings for paths
│ ├── services
│ │ ├── file_monitor.py # Monitors directory for new files
│ │ ├── document_processor.py # Handles document processing logic
│ │ └── ollama_client.py # Interacts with the Ollama API
│ ├── utils
│ │ └── file_utils.py # Utility functions for file operations
│ └── models
│ └── document.py # Represents the structure of a document
├── tests
│ └── test_document_processor.py # Unit tests for DocumentProcessor
├── requirements.txt # Project dependencies
├── .env.example # Example environment variables
└── README.md # Project documentation
```
## Setup Instructions
1. Clone the repository:
```
git clone <repository-url>
cd doc-processing-app
```
2. Install the required dependencies:
```
pip install -r requirements.txt
```
3. Configure the application by editing the `src/config/settings.py` file to set the paths for the object storage and target directory.
4. Create a `.env` file based on the `.env.example` file to set up necessary environment variables.
## Usage
To run the application, execute the following command:
```
python src/main.py
```
The application will start monitoring the specified directory for new documents. Once a new document is added, it will be processed automatically.
## Contributing
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

1
app.log Normal file
View File

@ -0,0 +1 @@
2025-04-20 20:14:00 - services.file_monitor - INFO - monitor: new file found: README.md

10
requirements.txt Normal file
View File

@ -0,0 +1,10 @@
# Base dependencies
pydantic-settings>=2.0.0
python-dotenv==1.0.0
watchdog==2.1.6
requests==2.26.0
# Document processing
python-docx>=0.8.11
PyPDF2>=3.0.0
pandas>=2.0.0

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,39 @@
import logging.config
from config.settings import settings
LOGGING_CONFIG = {
"version": 1,
"disable_existing_loggers": False,
"formatters": {
"standard": {
"format": settings.LOG_FORMAT,
"datefmt": settings.LOG_DATE_FORMAT
},
},
"handlers": {
"console": {
"class": "logging.StreamHandler",
"formatter": "standard",
"level": settings.LOG_LEVEL,
"stream": "ext://sys.stdout"
},
"file": {
"class": "logging.FileHandler",
"formatter": "standard",
"level": settings.LOG_LEVEL,
"filename": settings.LOG_FILE,
"mode": "a",
}
},
"loggers": {
"": { # root logger
"handlers": ["console", "file"],
"level": settings.LOG_LEVEL,
"propagate": True
}
}
}
def setup_logging():
"""Initialize logging configuration"""
logging.config.dictConfig(LOGGING_CONFIG)

31
src/config/settings.py Normal file
View File

@ -0,0 +1,31 @@
# settings.py
from pydantic_settings import BaseSettings
from typing import Optional
class Settings(BaseSettings):
# Storage paths
OBJECT_STORAGE_PATH: str = "/Users/tigeren/Dev/digisky/legal-doc-masker/src_folder"
TARGET_DIRECTORY_PATH: str = "/Users/tigeren/Dev/digisky/legal-doc-masker/target_folder"
# Ollama API settings
OLLAMA_API_URL: str = "https://api.ollama.com"
OLLAMA_API_KEY: str = ""
OLLAMA_MODEL: str = "llama2"
# File monitoring settings
MONITOR_INTERVAL: int = 5
# Logging settings
LOG_LEVEL: str = "INFO"
LOG_FORMAT: str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
LOG_DATE_FORMAT: str = "%Y-%m-%d %H:%M:%S"
LOG_FILE: str = "app.log"
class Config:
env_file = ".env"
env_file_encoding = "utf-8"
extra = "allow"
# Create settings instance
settings = Settings()

17
src/main.py Normal file
View File

@ -0,0 +1,17 @@
from config.logging_config import setup_logging
def main():
# Setup logging first
setup_logging()
from services.file_monitor import FileMonitor
from config.settings import settings
# Initialize the file monitor
file_monitor = FileMonitor(settings.OBJECT_STORAGE_PATH, settings.TARGET_DIRECTORY_PATH)
# Start monitoring the directory for new files
file_monitor.start_monitoring()
if __name__ == "__main__":
main()

12
src/models/document.py Normal file
View File

@ -0,0 +1,12 @@
class Document:
def __init__(self, file_path):
self.file_path = file_path
self.content = ""
def load(self):
with open(self.file_path, 'r') as file:
self.content = file.read()
def save(self, target_path):
with open(target_path, 'w') as file:
file.write(self.content)

View File

@ -0,0 +1,25 @@
import os
from typing import Optional
from models.document_processor import DocumentProcessor
from models.processors import (
TxtDocumentProcessor,
DocxDocumentProcessor,
PdfDocumentProcessor
)
class DocumentProcessorFactory:
@staticmethod
def create_processor(input_path: str, output_path: str) -> Optional[DocumentProcessor]:
file_extension = os.path.splitext(input_path)[1].lower()
processors = {
'.txt': TxtDocumentProcessor,
'.docx': DocxDocumentProcessor,
'.doc': DocxDocumentProcessor,
'.pdf': PdfDocumentProcessor
}
processor_class = processors.get(file_extension)
if processor_class:
return processor_class(input_path, output_path)
return None

View File

@ -0,0 +1,18 @@
from abc import ABC, abstractmethod
from typing import Any
class DocumentProcessor(ABC):
@abstractmethod
def read_content(self) -> str:
"""Read document content"""
pass
@abstractmethod
def process_content(self, content: str) -> str:
"""Process document content"""
pass
@abstractmethod
def save_content(self, content: str) -> None:
"""Save processed content"""
pass

View File

@ -0,0 +1,5 @@
from models.processors.txt_processor import TxtDocumentProcessor
from models.processors.docx_processor import DocxDocumentProcessor
from models.processors.pdf_processor import PdfDocumentProcessor
__all__ = ['TxtDocumentProcessor', 'DocxDocumentProcessor', 'PdfDocumentProcessor']

View File

@ -0,0 +1,20 @@
import docx
from models.document_processor import DocumentProcessor
class DocxDocumentProcessor(DocumentProcessor):
def __init__(self, input_path: str, output_path: str):
self.input_path = input_path
self.output_path = output_path
def read_content(self) -> str:
doc = docx.Document(self.input_path)
return '\n'.join([paragraph.text for paragraph in doc.paragraphs])
def process_content(self, content: str) -> str:
# Implementation for processing docx content
return content
def save_content(self, content: str) -> None:
doc = docx.Document()
doc.add_paragraph(content)
doc.save(self.output_path)

View File

@ -0,0 +1,20 @@
import PyPDF2
from models.document_processor import DocumentProcessor
class PdfDocumentProcessor(DocumentProcessor):
def __init__(self, input_path: str, output_path: str):
self.input_path = input_path
self.output_path = output_path
def read_content(self) -> str:
with open(self.input_path, 'rb') as file:
pdf_reader = PyPDF2.PdfReader(file)
return ' '.join([page.extract_text() for page in pdf_reader.pages])
def process_content(self, content: str) -> str:
# Implementation for processing PDF content
return content
def save_content(self, content: str) -> None:
# Implementation for saving as PDF
pass

View File

@ -0,0 +1,18 @@
from models.document_processor import DocumentProcessor
class TxtDocumentProcessor(DocumentProcessor):
def __init__(self, input_path: str, output_path: str):
self.input_path = input_path
self.output_path = output_path
def read_content(self) -> str:
with open(self.input_path, 'r', encoding='utf-8') as file:
return file.read()
def process_content(self, content: str) -> str:
# Implementation for processing text content
return content
def save_content(self, content: str) -> None:
with open(self.output_path, 'w', encoding='utf-8') as file:
file.write(content)

Binary file not shown.

View File

@ -0,0 +1,30 @@
import logging
from models.document_factory import DocumentProcessorFactory
from services.ollama_client import OllamaClient
logger = logging.getLogger(__name__)
class DocumentService:
def __init__(self, ollama_client: OllamaClient):
self.ollama_client = ollama_client
def process_document(self, input_path: str, output_path: str) -> bool:
try:
processor = DocumentProcessorFactory.create_processor(input_path, output_path)
if not processor:
logger.error(f"Unsupported file format: {input_path}")
return False
# Read content
content = processor.read_content()
# Process with Ollama
processed_content = self.ollama_client.process_document(content)
# Save processed content
processor.save_content(processed_content)
return True
except Exception as e:
logger.error(f"Error processing document {input_path}: {str(e)}")
return False

View File

@ -0,0 +1,24 @@
import logging
logger = logging.getLogger(__name__)
class FileMonitor:
def __init__(self, directory, callback):
self.directory = directory
self.callback = callback
def start_monitoring(self):
import time
import os
already_seen = set(os.listdir(self.directory))
while True:
time.sleep(1) # Check every second
current_files = set(os.listdir(self.directory))
new_files = current_files - already_seen
for new_file in new_files:
logger.info(f"monitor: new file found: {new_file}")
self.callback(os.path.join(self.directory, new_file))
already_seen = current_files

View File

@ -0,0 +1,15 @@
class OllamaClient:
def __init__(self, model_name):
self.model_name = model_name
def process_document(self, document_text):
# Here you would implement the logic to interact with the Ollama API
# and process the document text using the specified model.
# This is a placeholder for the actual API call.
processed_text = self._mock_api_call(document_text)
return processed_text
def _mock_api_call(self, document_text):
# Mock processing: In a real implementation, this would call the Ollama API.
# For now, it just returns the input text with a note indicating it was processed.
return f"Processed with {self.model_name}: {document_text}"

20
src/utils/file_utils.py Normal file
View File

@ -0,0 +1,20 @@
def read_file(file_path):
with open(file_path, 'r') as file:
return file.read()
def write_file(file_path, content):
with open(file_path, 'w') as file:
file.write(content)
def file_exists(file_path):
import os
return os.path.isfile(file_path)
def delete_file(file_path):
import os
if file_exists(file_path):
os.remove(file_path)
def list_files_in_directory(directory_path):
import os
return [f for f in os.listdir(directory_path) if os.path.isfile(os.path.join(directory_path, f))]

58
src_folder/README.md Normal file
View File

@ -0,0 +1,58 @@
# README.md
# Document Processing App
This project is designed to process legal documents by hiding sensitive information such as names and company names. It utilizes the Ollama API with selected models for text processing. The application monitors a specified directory for new files, processes them automatically, and saves the results to a target path.
## Project Structure
```
doc-processing-app
├── src
│ ├── main.py # Entry point of the application
│ ├── config
│ │ └── settings.py # Configuration settings for paths
│ ├── services
│ │ ├── file_monitor.py # Monitors directory for new files
│ │ ├── document_processor.py # Handles document processing logic
│ │ └── ollama_client.py # Interacts with the Ollama API
│ ├── utils
│ │ └── file_utils.py # Utility functions for file operations
│ └── models
│ └── document.py # Represents the structure of a document
├── tests
│ └── test_document_processor.py # Unit tests for DocumentProcessor
├── requirements.txt # Project dependencies
├── .env.example # Example environment variables
└── README.md # Project documentation
```
## Setup Instructions
1. Clone the repository:
```
git clone <repository-url>
cd doc-processing-app
```
2. Install the required dependencies:
```
pip install -r requirements.txt
```
3. Configure the application by editing the `src/config/settings.py` file to set the paths for the object storage and target directory.
4. Create a `.env` file based on the `.env.example` file to set up necessary environment variables.
## Usage
To run the application, execute the following command:
```
python src/main.py
```
The application will start monitoring the specified directory for new documents. Once a new document is added, it will be processed automatically.
## Contributing
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.