Commit Graph

12 Commits

Author SHA1 Message Date
oliviamn 47e78c35bb Add Markdown document processing support and enhance document handling
- Introduced `MarkdownDocumentProcessor` for handling markdown files, including reading and saving content.
- Updated `DocumentProcessorFactory` to include support for markdown file types.
- Enhanced existing document processors to utilize a shared initialization method for OllamaClient.
- Implemented chunking and mapping logic in `DocumentProcessor` for improved content processing and masking.
- Added utility class `LLMJsonExtractor` for extracting and parsing JSON from LLM outputs.
2025-05-24 21:05:48 +08:00
oliviamn caa4d6d2ef Update README.md to clarify installation steps and add LibreOffice dependency 2025-05-24 14:55:04 +08:00
oliviamn 5abfa4998d 实现docx转md 2025-05-21 00:15:01 +08:00
oliviamn 0f158c159b Enhance PDF content masking by introducing mapping prompts
- Added a new function `get_masking_mapping_prompt` to generate prompts for creating a mapping of original names/companies to their masked versions.
- Updated `PdfDocumentProcessor` to utilize the new mapping prompt, processing each sentence individually for improved content masking.
2025-05-08 00:04:50 +08:00
oliviamn 7d0be5aa8a 将题词抽象出来 2025-05-06 00:13:19 +08:00
oliviamn 815427a509 文件写入output folder的.work隐藏目录下 2025-05-05 23:34:10 +08:00
oliviamn e6fb9b9a83 调整目录结构 2025-05-05 20:33:08 +08:00
oliviamn edca9a87a0 Refactor PdfDocumentProcessor to enhance PDF content processing
- Updated read_content method to return raw bytes instead of extracted text.
- Modified process_content method to handle bytes and generate multiple output files including markdown, JSON, and processed PDFs.
- Implemented directory setup for image storage and output management.
- Integrated PymuDocDataset for PDF classification and processing based on OCR capabilities.
2025-05-05 19:15:03 +08:00
oliviamn 6acf3e5423 Update requirements.txt to upgrade requests and add magic-pdf dependency 2025-05-05 18:53:22 +08:00
tigermren 592fb66f40 Enhance document processing with Ollama integration and update .gitignore
- Added OllamaClient for document processing in TxtDocumentProcessor.
- Updated process_content method to use Ollama API for content masking.
- Refactored FileMonitor to utilize DocumentService with OllamaClient.
- Removed unnecessary log files and Python cache files.
- Added test file for document processing validation.
2025-04-23 01:09:33 +08:00
tigermren fc68c243bb add gitignore 2025-04-23 00:06:39 +08:00
tigermren 0904ab5073 Initial commit: Document processing app with Ollama integration 2025-04-23 00:02:10 +08:00