- Introduced `MarkdownDocumentProcessor` for handling markdown files, including reading and saving content.
- Updated `DocumentProcessorFactory` to include support for markdown file types.
- Enhanced existing document processors to utilize a shared initialization method for OllamaClient.
- Implemented chunking and mapping logic in `DocumentProcessor` for improved content processing and masking.
- Added utility class `LLMJsonExtractor` for extracting and parsing JSON from LLM outputs.
- Added a new function `get_masking_mapping_prompt` to generate prompts for creating a mapping of original names/companies to their masked versions.
- Updated `PdfDocumentProcessor` to utilize the new mapping prompt, processing each sentence individually for improved content masking.
- Updated read_content method to return raw bytes instead of extracted text.
- Modified process_content method to handle bytes and generate multiple output files including markdown, JSON, and processed PDFs.
- Implemented directory setup for image storage and output management.
- Integrated PymuDocDataset for PDF classification and processing based on OCR capabilities.
- Added OllamaClient for document processing in TxtDocumentProcessor.
- Updated process_content method to use Ollama API for content masking.
- Refactored FileMonitor to utilize DocumentService with OllamaClient.
- Removed unnecessary log files and Python cache files.
- Added test file for document processing validation.