Commit Graph

47 Commits

Author SHA1 Message Date
oliviamn edad8e7322 fix: 解决下载文件后缀的问题 2025-07-17 01:07:13 +08:00
oliviamn 68765ab45f fix: 解决下载文件扩展名的问题 2025-07-17 00:20:11 +08:00
oliviamn 19d8e4a0b1 镜像导入导出流程及脚本 2025-07-15 00:47:44 +08:00
oliviamn 4689fade84 增加统一的docker compose 2025-07-15 00:36:59 +08:00
oliviamn 88b790dd6b 更新pdf_processor,适用mineru 2025-07-15 00:29:34 +08:00
oliviamn d3e1927bc5 重新启用pdf_processor 2025-07-14 23:49:28 +08:00
oliviamn e8cb7b1a04 feat: 调整ner的mask规则 2025-07-14 23:48:55 +08:00
oliviamn 1ba4f3cc02 feat: 增加构建mapping的日志 2025-07-14 22:24:43 +08:00
tigerenwork daf316bb92 add mineru docker file 2025-07-13 17:48:18 +08:00
tigerenwork 94e500c990 add log for ner 2025-07-13 17:48:08 +08:00
tigerenwork a4d4a7608b feat: 增加一些日志记录 2025-07-12 17:39:06 +08:00
oliviamn f2e6ab44c0 add .env 2025-07-12 16:36:56 +08:00
oliviamn 1649a9328b WIP: 重构NER processor 2025-07-10 00:14:16 +08:00
oliviamn 1cf3c45cee WIP 2025-07-06 21:11:23 +08:00
oliviamn a949902367 完整所有的匹配规则 2025-07-03 23:58:30 +08:00
oliviamn 5b1b8f8e9c feat: Enhance NER processing by adding company name mapping and refactoring prompt functions 2025-06-27 00:39:38 +08:00
oliviamn 5ddef90e8b feat:单独对名字进行NER 2025-06-25 01:31:12 +08:00
oliviamn ee95f1daa7 WIP: 暂时屏蔽docx,pdf解析 2025-06-25 01:30:43 +08:00
oliviamn 12c1b5f75e feat: 显示完成时间 2025-06-01 15:47:58 +08:00
oliviamn e2ebd2fb09 feat: port 3000 2025-05-28 02:25:45 +08:00
oliviamn 2947743d28 feat: set env 2025-05-28 02:23:10 +08:00
oliviamn 329610088d feat:删除开发docker文件 2025-05-28 02:05:46 +08:00
oliviamn c554bd0c2f feat: 增加dockerfile 2025-05-28 01:51:58 +08:00
oliviamn 3afe01f5f2 fix: 使用环境变量设置baseurl 2025-05-27 23:39:01 +08:00
oliviamn c3fc9459b8 fix: 设置baseUrl可配置 2025-05-27 23:29:25 +08:00
oliviamn fbdeba5088 feat: 增加删除文件功能 2025-05-26 23:19:43 +08:00
oliviamn dea3a6bd6a feat:在docker中集成mineru,并且修正下载文件名不正确的问题 2025-05-26 23:07:10 +08:00
oliviamn 345fd05a2b fix: 解决md不允许上传的问题 2025-05-26 00:06:37 +08:00
oliviamn b3cf9f98a7 refine 2025-05-25 16:45:48 +08:00
oliviamn 24c5bbd5d7 refine: 删除文档数据文件夹,用sample_doc取代 2025-05-25 16:43:32 +08:00
oliviamn 13ef24a3da feat:增加前端 2025-05-25 00:37:20 +08:00
oliviamn 900a614b09 refine: 解决了导入路径的问题 2025-05-25 00:04:19 +08:00
oliviamn 3e9c44e8c4 refine: 将原src的内容复制到backend/app/core 2025-05-24 23:28:33 +08:00
oliviamn e0695e7f0e refine: src rename to core 2025-05-24 22:13:20 +08:00
oliviamn 76b0351f8f feat: 增加backend 2025-05-24 22:06:28 +08:00
oliviamn 47e78c35bb Add Markdown document processing support and enhance document handling
- Introduced `MarkdownDocumentProcessor` for handling markdown files, including reading and saving content.
- Updated `DocumentProcessorFactory` to include support for markdown file types.
- Enhanced existing document processors to utilize a shared initialization method for OllamaClient.
- Implemented chunking and mapping logic in `DocumentProcessor` for improved content processing and masking.
- Added utility class `LLMJsonExtractor` for extracting and parsing JSON from LLM outputs.
2025-05-24 21:05:48 +08:00
oliviamn caa4d6d2ef Update README.md to clarify installation steps and add LibreOffice dependency 2025-05-24 14:55:04 +08:00
oliviamn 5abfa4998d 实现docx转md 2025-05-21 00:15:01 +08:00
oliviamn 0f158c159b Enhance PDF content masking by introducing mapping prompts
- Added a new function `get_masking_mapping_prompt` to generate prompts for creating a mapping of original names/companies to their masked versions.
- Updated `PdfDocumentProcessor` to utilize the new mapping prompt, processing each sentence individually for improved content masking.
2025-05-08 00:04:50 +08:00
oliviamn 7d0be5aa8a 将题词抽象出来 2025-05-06 00:13:19 +08:00
oliviamn 815427a509 文件写入output folder的.work隐藏目录下 2025-05-05 23:34:10 +08:00
oliviamn e6fb9b9a83 调整目录结构 2025-05-05 20:33:08 +08:00
oliviamn edca9a87a0 Refactor PdfDocumentProcessor to enhance PDF content processing
- Updated read_content method to return raw bytes instead of extracted text.
- Modified process_content method to handle bytes and generate multiple output files including markdown, JSON, and processed PDFs.
- Implemented directory setup for image storage and output management.
- Integrated PymuDocDataset for PDF classification and processing based on OCR capabilities.
2025-05-05 19:15:03 +08:00
oliviamn 6acf3e5423 Update requirements.txt to upgrade requests and add magic-pdf dependency 2025-05-05 18:53:22 +08:00
tigermren 592fb66f40 Enhance document processing with Ollama integration and update .gitignore
- Added OllamaClient for document processing in TxtDocumentProcessor.
- Updated process_content method to use Ollama API for content masking.
- Refactored FileMonitor to utilize DocumentService with OllamaClient.
- Removed unnecessary log files and Python cache files.
- Added test file for document processing validation.
2025-04-23 01:09:33 +08:00
tigermren fc68c243bb add gitignore 2025-04-23 00:06:39 +08:00
tigermren 0904ab5073 Initial commit: Document processing app with Ollama integration 2025-04-23 00:02:10 +08:00