Initial commit: Document processing app with Ollama integration
This commit is contained in:
commit
0904ab5073
|
|
@ -0,0 +1,13 @@
|
|||
__pycache__
|
||||
*.pyc
|
||||
*.pyo
|
||||
*.pyd
|
||||
.Python
|
||||
env/
|
||||
venv/
|
||||
.env
|
||||
*.log
|
||||
.git
|
||||
.gitignore
|
||||
.pytest_cache
|
||||
tests/
|
||||
|
|
@ -0,0 +1,19 @@
|
|||
# Storage paths
|
||||
OBJECT_STORAGE_PATH=/path/to/mounted/object/storage
|
||||
TARGET_DIRECTORY_PATH=/path/to/target/directory
|
||||
|
||||
# Ollama API Configuration
|
||||
OLLAMA_API_URL=https://api.ollama.com
|
||||
OLLAMA_API_KEY=your_api_key_here
|
||||
OLLAMA_MODEL=llama2
|
||||
|
||||
# Application Settings
|
||||
MONITOR_INTERVAL=5
|
||||
|
||||
# Logging Configuration
|
||||
LOG_LEVEL=INFO
|
||||
LOG_FILE=app.log
|
||||
|
||||
# Optional: Additional security settings
|
||||
# MAX_FILE_SIZE=10485760 # 10MB in bytes
|
||||
# ALLOWED_FILE_TYPES=.txt,.doc,.docx,.pdf
|
||||
|
|
@ -0,0 +1,48 @@
|
|||
# Build stage
|
||||
FROM python:3.12-slim AS builder
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install build dependencies
|
||||
RUN apt-get update && \
|
||||
apt-get install -y --no-install-recommends \
|
||||
build-essential \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Copy requirements first to leverage Docker cache
|
||||
COPY requirements.txt .
|
||||
RUN pip wheel --no-cache-dir --no-deps --wheel-dir /app/wheels -r requirements.txt
|
||||
|
||||
# Final stage
|
||||
FROM python:3.12-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Create non-root user
|
||||
RUN useradd -m -r appuser && \
|
||||
chown appuser:appuser /app
|
||||
|
||||
# Copy wheels from builder
|
||||
COPY --from=builder /app/wheels /wheels
|
||||
COPY --from=builder /app/requirements.txt .
|
||||
|
||||
# Install dependencies
|
||||
RUN pip install --no-cache /wheels/*
|
||||
|
||||
# Copy application code
|
||||
COPY src/ ./src/
|
||||
|
||||
# Create directories for mounted volumes
|
||||
RUN mkdir -p /data/input /data/output && \
|
||||
chown -R appuser:appuser /data
|
||||
|
||||
# Switch to non-root user
|
||||
USER appuser
|
||||
|
||||
# Environment variables
|
||||
ENV PYTHONPATH=/app \
|
||||
OBJECT_STORAGE_PATH=/data/input \
|
||||
TARGET_DIRECTORY_PATH=/data/output
|
||||
|
||||
# Run the application
|
||||
CMD ["python", "src/main.py"]
|
||||
|
|
@ -0,0 +1,58 @@
|
|||
# README.md
|
||||
|
||||
# Document Processing App
|
||||
|
||||
This project is designed to process legal documents by hiding sensitive information such as names and company names. It utilizes the Ollama API with selected models for text processing. The application monitors a specified directory for new files, processes them automatically, and saves the results to a target path.
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
doc-processing-app
|
||||
├── src
|
||||
│ ├── main.py # Entry point of the application
|
||||
│ ├── config
|
||||
│ │ └── settings.py # Configuration settings for paths
|
||||
│ ├── services
|
||||
│ │ ├── file_monitor.py # Monitors directory for new files
|
||||
│ │ ├── document_processor.py # Handles document processing logic
|
||||
│ │ └── ollama_client.py # Interacts with the Ollama API
|
||||
│ ├── utils
|
||||
│ │ └── file_utils.py # Utility functions for file operations
|
||||
│ └── models
|
||||
│ └── document.py # Represents the structure of a document
|
||||
├── tests
|
||||
│ └── test_document_processor.py # Unit tests for DocumentProcessor
|
||||
├── requirements.txt # Project dependencies
|
||||
├── .env.example # Example environment variables
|
||||
└── README.md # Project documentation
|
||||
```
|
||||
|
||||
## Setup Instructions
|
||||
|
||||
1. Clone the repository:
|
||||
```
|
||||
git clone <repository-url>
|
||||
cd doc-processing-app
|
||||
```
|
||||
|
||||
2. Install the required dependencies:
|
||||
```
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
3. Configure the application by editing the `src/config/settings.py` file to set the paths for the object storage and target directory.
|
||||
|
||||
4. Create a `.env` file based on the `.env.example` file to set up necessary environment variables.
|
||||
|
||||
## Usage
|
||||
|
||||
To run the application, execute the following command:
|
||||
```
|
||||
python src/main.py
|
||||
```
|
||||
|
||||
The application will start monitoring the specified directory for new documents. Once a new document is added, it will be processed automatically.
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
|
||||
|
|
@ -0,0 +1 @@
|
|||
2025-04-20 20:14:00 - services.file_monitor - INFO - monitor: new file found: README.md
|
||||
|
|
@ -0,0 +1,10 @@
|
|||
# Base dependencies
|
||||
pydantic-settings>=2.0.0
|
||||
python-dotenv==1.0.0
|
||||
watchdog==2.1.6
|
||||
requests==2.26.0
|
||||
|
||||
# Document processing
|
||||
python-docx>=0.8.11
|
||||
PyPDF2>=3.0.0
|
||||
pandas>=2.0.0
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
|
|
@ -0,0 +1,39 @@
|
|||
import logging.config
|
||||
from config.settings import settings
|
||||
|
||||
LOGGING_CONFIG = {
|
||||
"version": 1,
|
||||
"disable_existing_loggers": False,
|
||||
"formatters": {
|
||||
"standard": {
|
||||
"format": settings.LOG_FORMAT,
|
||||
"datefmt": settings.LOG_DATE_FORMAT
|
||||
},
|
||||
},
|
||||
"handlers": {
|
||||
"console": {
|
||||
"class": "logging.StreamHandler",
|
||||
"formatter": "standard",
|
||||
"level": settings.LOG_LEVEL,
|
||||
"stream": "ext://sys.stdout"
|
||||
},
|
||||
"file": {
|
||||
"class": "logging.FileHandler",
|
||||
"formatter": "standard",
|
||||
"level": settings.LOG_LEVEL,
|
||||
"filename": settings.LOG_FILE,
|
||||
"mode": "a",
|
||||
}
|
||||
},
|
||||
"loggers": {
|
||||
"": { # root logger
|
||||
"handlers": ["console", "file"],
|
||||
"level": settings.LOG_LEVEL,
|
||||
"propagate": True
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
def setup_logging():
|
||||
"""Initialize logging configuration"""
|
||||
logging.config.dictConfig(LOGGING_CONFIG)
|
||||
|
|
@ -0,0 +1,31 @@
|
|||
# settings.py
|
||||
|
||||
from pydantic_settings import BaseSettings
|
||||
from typing import Optional
|
||||
|
||||
class Settings(BaseSettings):
|
||||
# Storage paths
|
||||
OBJECT_STORAGE_PATH: str = "/Users/tigeren/Dev/digisky/legal-doc-masker/src_folder"
|
||||
TARGET_DIRECTORY_PATH: str = "/Users/tigeren/Dev/digisky/legal-doc-masker/target_folder"
|
||||
|
||||
# Ollama API settings
|
||||
OLLAMA_API_URL: str = "https://api.ollama.com"
|
||||
OLLAMA_API_KEY: str = ""
|
||||
OLLAMA_MODEL: str = "llama2"
|
||||
|
||||
# File monitoring settings
|
||||
MONITOR_INTERVAL: int = 5
|
||||
|
||||
# Logging settings
|
||||
LOG_LEVEL: str = "INFO"
|
||||
LOG_FORMAT: str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
||||
LOG_DATE_FORMAT: str = "%Y-%m-%d %H:%M:%S"
|
||||
LOG_FILE: str = "app.log"
|
||||
|
||||
class Config:
|
||||
env_file = ".env"
|
||||
env_file_encoding = "utf-8"
|
||||
extra = "allow"
|
||||
|
||||
# Create settings instance
|
||||
settings = Settings()
|
||||
|
|
@ -0,0 +1,17 @@
|
|||
from config.logging_config import setup_logging
|
||||
|
||||
def main():
|
||||
# Setup logging first
|
||||
setup_logging()
|
||||
|
||||
from services.file_monitor import FileMonitor
|
||||
from config.settings import settings
|
||||
|
||||
# Initialize the file monitor
|
||||
file_monitor = FileMonitor(settings.OBJECT_STORAGE_PATH, settings.TARGET_DIRECTORY_PATH)
|
||||
|
||||
# Start monitoring the directory for new files
|
||||
file_monitor.start_monitoring()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -0,0 +1,12 @@
|
|||
class Document:
|
||||
def __init__(self, file_path):
|
||||
self.file_path = file_path
|
||||
self.content = ""
|
||||
|
||||
def load(self):
|
||||
with open(self.file_path, 'r') as file:
|
||||
self.content = file.read()
|
||||
|
||||
def save(self, target_path):
|
||||
with open(target_path, 'w') as file:
|
||||
file.write(self.content)
|
||||
|
|
@ -0,0 +1,25 @@
|
|||
import os
|
||||
from typing import Optional
|
||||
from models.document_processor import DocumentProcessor
|
||||
from models.processors import (
|
||||
TxtDocumentProcessor,
|
||||
DocxDocumentProcessor,
|
||||
PdfDocumentProcessor
|
||||
)
|
||||
|
||||
class DocumentProcessorFactory:
|
||||
@staticmethod
|
||||
def create_processor(input_path: str, output_path: str) -> Optional[DocumentProcessor]:
|
||||
file_extension = os.path.splitext(input_path)[1].lower()
|
||||
|
||||
processors = {
|
||||
'.txt': TxtDocumentProcessor,
|
||||
'.docx': DocxDocumentProcessor,
|
||||
'.doc': DocxDocumentProcessor,
|
||||
'.pdf': PdfDocumentProcessor
|
||||
}
|
||||
|
||||
processor_class = processors.get(file_extension)
|
||||
if processor_class:
|
||||
return processor_class(input_path, output_path)
|
||||
return None
|
||||
|
|
@ -0,0 +1,18 @@
|
|||
from abc import ABC, abstractmethod
|
||||
from typing import Any
|
||||
|
||||
class DocumentProcessor(ABC):
|
||||
@abstractmethod
|
||||
def read_content(self) -> str:
|
||||
"""Read document content"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def process_content(self, content: str) -> str:
|
||||
"""Process document content"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def save_content(self, content: str) -> None:
|
||||
"""Save processed content"""
|
||||
pass
|
||||
|
|
@ -0,0 +1,5 @@
|
|||
from models.processors.txt_processor import TxtDocumentProcessor
|
||||
from models.processors.docx_processor import DocxDocumentProcessor
|
||||
from models.processors.pdf_processor import PdfDocumentProcessor
|
||||
|
||||
__all__ = ['TxtDocumentProcessor', 'DocxDocumentProcessor', 'PdfDocumentProcessor']
|
||||
|
|
@ -0,0 +1,20 @@
|
|||
import docx
|
||||
from models.document_processor import DocumentProcessor
|
||||
|
||||
class DocxDocumentProcessor(DocumentProcessor):
|
||||
def __init__(self, input_path: str, output_path: str):
|
||||
self.input_path = input_path
|
||||
self.output_path = output_path
|
||||
|
||||
def read_content(self) -> str:
|
||||
doc = docx.Document(self.input_path)
|
||||
return '\n'.join([paragraph.text for paragraph in doc.paragraphs])
|
||||
|
||||
def process_content(self, content: str) -> str:
|
||||
# Implementation for processing docx content
|
||||
return content
|
||||
|
||||
def save_content(self, content: str) -> None:
|
||||
doc = docx.Document()
|
||||
doc.add_paragraph(content)
|
||||
doc.save(self.output_path)
|
||||
|
|
@ -0,0 +1,20 @@
|
|||
import PyPDF2
|
||||
from models.document_processor import DocumentProcessor
|
||||
|
||||
class PdfDocumentProcessor(DocumentProcessor):
|
||||
def __init__(self, input_path: str, output_path: str):
|
||||
self.input_path = input_path
|
||||
self.output_path = output_path
|
||||
|
||||
def read_content(self) -> str:
|
||||
with open(self.input_path, 'rb') as file:
|
||||
pdf_reader = PyPDF2.PdfReader(file)
|
||||
return ' '.join([page.extract_text() for page in pdf_reader.pages])
|
||||
|
||||
def process_content(self, content: str) -> str:
|
||||
# Implementation for processing PDF content
|
||||
return content
|
||||
|
||||
def save_content(self, content: str) -> None:
|
||||
# Implementation for saving as PDF
|
||||
pass
|
||||
|
|
@ -0,0 +1,18 @@
|
|||
from models.document_processor import DocumentProcessor
|
||||
|
||||
class TxtDocumentProcessor(DocumentProcessor):
|
||||
def __init__(self, input_path: str, output_path: str):
|
||||
self.input_path = input_path
|
||||
self.output_path = output_path
|
||||
|
||||
def read_content(self) -> str:
|
||||
with open(self.input_path, 'r', encoding='utf-8') as file:
|
||||
return file.read()
|
||||
|
||||
def process_content(self, content: str) -> str:
|
||||
# Implementation for processing text content
|
||||
return content
|
||||
|
||||
def save_content(self, content: str) -> None:
|
||||
with open(self.output_path, 'w', encoding='utf-8') as file:
|
||||
file.write(content)
|
||||
Binary file not shown.
|
|
@ -0,0 +1,30 @@
|
|||
import logging
|
||||
from models.document_factory import DocumentProcessorFactory
|
||||
from services.ollama_client import OllamaClient
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class DocumentService:
|
||||
def __init__(self, ollama_client: OllamaClient):
|
||||
self.ollama_client = ollama_client
|
||||
|
||||
def process_document(self, input_path: str, output_path: str) -> bool:
|
||||
try:
|
||||
processor = DocumentProcessorFactory.create_processor(input_path, output_path)
|
||||
if not processor:
|
||||
logger.error(f"Unsupported file format: {input_path}")
|
||||
return False
|
||||
|
||||
# Read content
|
||||
content = processor.read_content()
|
||||
|
||||
# Process with Ollama
|
||||
processed_content = self.ollama_client.process_document(content)
|
||||
|
||||
# Save processed content
|
||||
processor.save_content(processed_content)
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing document {input_path}: {str(e)}")
|
||||
return False
|
||||
|
|
@ -0,0 +1,24 @@
|
|||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class FileMonitor:
|
||||
def __init__(self, directory, callback):
|
||||
self.directory = directory
|
||||
self.callback = callback
|
||||
|
||||
def start_monitoring(self):
|
||||
import time
|
||||
import os
|
||||
|
||||
already_seen = set(os.listdir(self.directory))
|
||||
while True:
|
||||
time.sleep(1) # Check every second
|
||||
current_files = set(os.listdir(self.directory))
|
||||
new_files = current_files - already_seen
|
||||
|
||||
for new_file in new_files:
|
||||
logger.info(f"monitor: new file found: {new_file}")
|
||||
self.callback(os.path.join(self.directory, new_file))
|
||||
|
||||
already_seen = current_files
|
||||
|
|
@ -0,0 +1,15 @@
|
|||
class OllamaClient:
|
||||
def __init__(self, model_name):
|
||||
self.model_name = model_name
|
||||
|
||||
def process_document(self, document_text):
|
||||
# Here you would implement the logic to interact with the Ollama API
|
||||
# and process the document text using the specified model.
|
||||
# This is a placeholder for the actual API call.
|
||||
processed_text = self._mock_api_call(document_text)
|
||||
return processed_text
|
||||
|
||||
def _mock_api_call(self, document_text):
|
||||
# Mock processing: In a real implementation, this would call the Ollama API.
|
||||
# For now, it just returns the input text with a note indicating it was processed.
|
||||
return f"Processed with {self.model_name}: {document_text}"
|
||||
|
|
@ -0,0 +1,20 @@
|
|||
def read_file(file_path):
|
||||
with open(file_path, 'r') as file:
|
||||
return file.read()
|
||||
|
||||
def write_file(file_path, content):
|
||||
with open(file_path, 'w') as file:
|
||||
file.write(content)
|
||||
|
||||
def file_exists(file_path):
|
||||
import os
|
||||
return os.path.isfile(file_path)
|
||||
|
||||
def delete_file(file_path):
|
||||
import os
|
||||
if file_exists(file_path):
|
||||
os.remove(file_path)
|
||||
|
||||
def list_files_in_directory(directory_path):
|
||||
import os
|
||||
return [f for f in os.listdir(directory_path) if os.path.isfile(os.path.join(directory_path, f))]
|
||||
|
|
@ -0,0 +1,58 @@
|
|||
# README.md
|
||||
|
||||
# Document Processing App
|
||||
|
||||
This project is designed to process legal documents by hiding sensitive information such as names and company names. It utilizes the Ollama API with selected models for text processing. The application monitors a specified directory for new files, processes them automatically, and saves the results to a target path.
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
doc-processing-app
|
||||
├── src
|
||||
│ ├── main.py # Entry point of the application
|
||||
│ ├── config
|
||||
│ │ └── settings.py # Configuration settings for paths
|
||||
│ ├── services
|
||||
│ │ ├── file_monitor.py # Monitors directory for new files
|
||||
│ │ ├── document_processor.py # Handles document processing logic
|
||||
│ │ └── ollama_client.py # Interacts with the Ollama API
|
||||
│ ├── utils
|
||||
│ │ └── file_utils.py # Utility functions for file operations
|
||||
│ └── models
|
||||
│ └── document.py # Represents the structure of a document
|
||||
├── tests
|
||||
│ └── test_document_processor.py # Unit tests for DocumentProcessor
|
||||
├── requirements.txt # Project dependencies
|
||||
├── .env.example # Example environment variables
|
||||
└── README.md # Project documentation
|
||||
```
|
||||
|
||||
## Setup Instructions
|
||||
|
||||
1. Clone the repository:
|
||||
```
|
||||
git clone <repository-url>
|
||||
cd doc-processing-app
|
||||
```
|
||||
|
||||
2. Install the required dependencies:
|
||||
```
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
3. Configure the application by editing the `src/config/settings.py` file to set the paths for the object storage and target directory.
|
||||
|
||||
4. Create a `.env` file based on the `.env.example` file to set up necessary environment variables.
|
||||
|
||||
## Usage
|
||||
|
||||
To run the application, execute the following command:
|
||||
```
|
||||
python src/main.py
|
||||
```
|
||||
|
||||
The application will start monitoring the specified directory for new documents. Once a new document is added, it will be processed automatically.
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
|
||||
Loading…
Reference in New Issue