Commit Graph

25 Commits

Author SHA1 Message Date
oliviamn c554bd0c2f feat: 增加dockerfile 2025-05-28 01:51:58 +08:00
oliviamn 3afe01f5f2 fix: 使用环境变量设置baseurl 2025-05-27 23:39:01 +08:00
oliviamn c3fc9459b8 fix: 设置baseUrl可配置 2025-05-27 23:29:25 +08:00
oliviamn fbdeba5088 feat: 增加删除文件功能 2025-05-26 23:19:43 +08:00
oliviamn dea3a6bd6a feat:在docker中集成mineru,并且修正下载文件名不正确的问题 2025-05-26 23:07:10 +08:00
oliviamn 345fd05a2b fix: 解决md不允许上传的问题 2025-05-26 00:06:37 +08:00
oliviamn b3cf9f98a7 refine 2025-05-25 16:45:48 +08:00
oliviamn 24c5bbd5d7 refine: 删除文档数据文件夹,用sample_doc取代 2025-05-25 16:43:32 +08:00
oliviamn 13ef24a3da feat:增加前端 2025-05-25 00:37:20 +08:00
oliviamn 900a614b09 refine: 解决了导入路径的问题 2025-05-25 00:04:19 +08:00
oliviamn 3e9c44e8c4 refine: 将原src的内容复制到backend/app/core 2025-05-24 23:28:33 +08:00
oliviamn e0695e7f0e refine: src rename to core 2025-05-24 22:13:20 +08:00
oliviamn 76b0351f8f feat: 增加backend 2025-05-24 22:06:28 +08:00
oliviamn 47e78c35bb Add Markdown document processing support and enhance document handling
- Introduced `MarkdownDocumentProcessor` for handling markdown files, including reading and saving content.
- Updated `DocumentProcessorFactory` to include support for markdown file types.
- Enhanced existing document processors to utilize a shared initialization method for OllamaClient.
- Implemented chunking and mapping logic in `DocumentProcessor` for improved content processing and masking.
- Added utility class `LLMJsonExtractor` for extracting and parsing JSON from LLM outputs.
2025-05-24 21:05:48 +08:00
oliviamn caa4d6d2ef Update README.md to clarify installation steps and add LibreOffice dependency 2025-05-24 14:55:04 +08:00
oliviamn 5abfa4998d 实现docx转md 2025-05-21 00:15:01 +08:00
oliviamn 0f158c159b Enhance PDF content masking by introducing mapping prompts
- Added a new function `get_masking_mapping_prompt` to generate prompts for creating a mapping of original names/companies to their masked versions.
- Updated `PdfDocumentProcessor` to utilize the new mapping prompt, processing each sentence individually for improved content masking.
2025-05-08 00:04:50 +08:00
oliviamn 7d0be5aa8a 将题词抽象出来 2025-05-06 00:13:19 +08:00
oliviamn 815427a509 文件写入output folder的.work隐藏目录下 2025-05-05 23:34:10 +08:00
oliviamn e6fb9b9a83 调整目录结构 2025-05-05 20:33:08 +08:00
oliviamn edca9a87a0 Refactor PdfDocumentProcessor to enhance PDF content processing
- Updated read_content method to return raw bytes instead of extracted text.
- Modified process_content method to handle bytes and generate multiple output files including markdown, JSON, and processed PDFs.
- Implemented directory setup for image storage and output management.
- Integrated PymuDocDataset for PDF classification and processing based on OCR capabilities.
2025-05-05 19:15:03 +08:00
oliviamn 6acf3e5423 Update requirements.txt to upgrade requests and add magic-pdf dependency 2025-05-05 18:53:22 +08:00
tigermren 592fb66f40 Enhance document processing with Ollama integration and update .gitignore
- Added OllamaClient for document processing in TxtDocumentProcessor.
- Updated process_content method to use Ollama API for content masking.
- Refactored FileMonitor to utilize DocumentService with OllamaClient.
- Removed unnecessary log files and Python cache files.
- Added test file for document processing validation.
2025-04-23 01:09:33 +08:00
tigermren fc68c243bb add gitignore 2025-04-23 00:06:39 +08:00
tigermren 0904ab5073 Initial commit: Document processing app with Ollama integration 2025-04-23 00:02:10 +08:00