Releases: opendatalab/MinerU
Releases · opendatalab/MinerU
magic_pdf-0.10.6-released
What's Changed
- perf(model): optimize model initialization by @myhloli in #1198
- fix: update notify by @dt-yy in #1201
- fix(model): simplify model initialization logic by @myhloli in #1207
- feat: update test case by @dt-yy in #1209
- build(deps): specify minimum version for ultralytics by @myhloli in #1212
- Refactor/add user api by @icecraft in #1178
- fix(dict2md): add space for inline equations in CJK contexts by @myhloli in #1222
- fix: 1. ocr txt mode error 2. lose pdf_parse_type field by @icecraft in #1224
- fix: add parse_pdf_type and version by @icecraft in #1228
- fix: unicode decode error by @icecraft in #1231
- fix(detect_invalid_chars):fix the stack error caused by multiple memory releases in PyMuPDF by @myhloli in #1252
- fix: dup classify pdf type by @icecraft in #1258
- feat(layout): improve layout detection for DocLayout_YOLO model by @myhloli in #1259
- refactor(draw_bbox): remove redundant '_line_sort' suffix from output filename by @myhloli in #1263
- build(docker): add torch and torchvision dependencies by @myhloli in #1264
Full Changelog: magic_pdf-0.10.5-released...magic_pdf-0.10.6-released
magic_pdf-0.10.5-released
What's Changed
- fix: 修复文件名错误 by @LollipopsAndWine in #1154
- refactor(para): adjust line height multiplier for block splitting by @myhloli in #1156
- fix(pre_proc): prevent errors when imageWriter is None by @myhloli in #1166
Full Changelog: magic_pdf-0.10.4-released...magic_pdf-0.10.5-released
magic_pdf-0.10.4-released
What's Changed
Full Changelog: magic_pdf-0.10.3-released...magic_pdf-0.10.4-released
magic_pdf-0.10.3-released
What's Changed
- fix(Hybrid OCR):Enable Hybrid OCR for Empty Spans That Contain a Certain Number of Placeholders but No Actual Text by @myhloli in #1132
- refactor(para): improve language detection and block splitting by @myhloli in #1134
- feat(pdf_parse): filter out skewed text lines by @myhloli in #1135
- refactor(ocr): improve text processing and span handling by @myhloli in #1136
- refactor(pdf_check): improve character detection using PyMuPDF by @myhloli in #1137
- feat(pdf_parse): add line start flag detection and optimize line stop flag logic by @myhloli in #1138
- fix(ocr_mkcontent): handle empty paragraphs on pages by @myhloli in #1139
- refactor(pdf_parse): adjust character-axis alignment algorithm by @myhloli in #1140
- refactor(ocr): Fix the error of paddleocr failing to initialize in a multi-threaded environment by @myhloli in #1141
Full Changelog: magic_pdf-0.10.2-released...magic_pdf-0.10.3-released
magic_pdf-0.10.2-released
What's Changed
- fix(pdf_parse): Move the logic for filling text content into spans before the discarded_block recognition to fix the issue of empty text blocks in discarded_block. by @myhloli in #1082
- refactor(txt_spans_extract_v2): optimize span processing and OCR logic by @myhloli in #1086
- feat(ocr): filter out low confidence ocr results by @myhloli in #1088
- feat(pdf_parse): add OCR score to span data by @myhloli in #1089
- fix: test_rag by @icecraft in #1105
- perf(image_processing): reduce maximum image size for analysis by @myhloli in #1106
- fix: test_tools unittest by @icecraft in #1104
- refactor(libs): remove unused imports and functions by @myhloli in #1112
- Feat/add s3 read write example by @icecraft in #1117
Full Changelog: magic_pdf-0.10.1-released...magic_pdf-0.10.2-released
magic_pdf-0.10.1-released
magic_pdf-0.10.0-released
What's Changed
- fix: 修复issue #715 by @LollipopsAndWine in #971
- docs(README): update GPU hardware recommendations and table recognition options by @myhloli in #973
- docs: improve GPU support list formatting in README_zh-CN.md by @myhloli in #974
- docs: update feature description for table conversion by @myhloli in #975
- docs: update readme by @myhloli in #977
- update ci by @dt-yy in #986
- test(unitest): Restore unit test cases by @myhloli in #998
- refactor(tests): extract common test utilities into test_commons.py by @myhloli in #1001
- feat(ocr): improve handling of angled text boxes by @myhloli in #1010
- refactor(para): improve paragraph splitting logic by @myhloli in #1013
- build(setup): add old_linux specific dependencies by @myhloli in #1016
- refactor(para): adjust right margin threshold based on block width by @myhloli in #1018
- fix: using new data api replace old rw api by @icecraft in #1006
- delete unused pipeline file by @liugongjian in #1024
- refactor: move some constants or enums defs to config folder by @icecraft in #1027
- fix: remove test code by @icecraft in #1036
- fix(tools): handle empty language string in common.py by @myhloli in #1045
- refactor(ocr_dict_merge): add threshold parameter for line merging by @myhloli in #1046
- fix(ocr_mkcontent): improve hyphen handling at line ends by @myhloli in #1047
- fix(remove_overlaps_min_spans): optimize overlap detection in OCR span list modification by @myhloli in #1048
- feat(ocr): improve text detection and OCR accuracy by @myhloli in #1049
- refactor(txt_parse): improve text extraction accuracy with new algorithm by @myhloli in #1050
- fix: use concrete class instead of abstract class by @icecraft in #1052
- fix(pdf_parse): improve line stop flag detection accuracy by @myhloli in #1053
- test: comment out assertions for metascan classify and meta scan tests by @myhloli in #1054
- Add test cases to json compressor util by @liugongjian in #1056
- refactor(para): improve line stop flag and remove unused debug mode by @myhloli in #1058
- fix(table): add null check for OCR result in rapid table prediction by @myhloli in #1060
- refactor(model): move page total time logging to custom model analysis by @myhloli in #1061
- fix(table): add null check for OCR result in rapid table prediction by @myhloli in #1062
- fix(pdf_parse): improve OCR result handling by @myhloli in #1064
New Contributors
- @liugongjian made their first contribution in #1024
Full Changelog: magic_pdf-0.9.3-released...magic_pdf-0.10.0-released
magic_pdf-0.9.3-released
What's Changed
- feat(model): add xycut algorithm for block sorting by @myhloli in #898
- refactor(pdf_parse): adjust line count threshold for layoutreader by @myhloli in #902
- Feat/add en docs by @icecraft in #906
- feat: using next_docs by @icecraft in #907
- feat(table): integrate RapidTable model for table recognition by @myhloli in #910
- fix(gradio-app): add missing file type in upload by @myhloli in #911
- refactor(magic_pdf_parse_main): optimize model data handling and JSON output by @myhloli in #912
- Modify the test directory by @DTwz in #913
- test(table): improve ppTableModel test coverage by @myhloli in #914
- feat(table): add RapidOCR support for RapidTable model by @myhloli in #915
- 新增DocLayout-YOLO超链接 by @qiangqiang199 in #889
- fix: remove classes hierarchy diagram by @icecraft in #919
- refactor(model download script) by @myhloli in #922
- docs(readme): update table recognition configuration and documentation by @myhloli in #924
- docs(README_ja-JP.md): update warning message and remove outdated content by @myhloli in #925
- 更新 para_split_v3.py by @hyastar in #923
- Style/docs by @icecraft in #927
- docs: rewrite zh_cn docs without translate by @icecraft in #928
- fix: typo by @icecraft in #931
- fix: 修复Dockerfile文件中download_models.py脚本路径问题 by @kimi360 in #938
- build(Dockerfile): update model download script and dependencies by @myhloli in #941
- fix(ocr_mkcontent): improve handling of single-character content #937 by @myhloli in #943
- feat: tune docs by @icecraft in #948
- fix(parse_pipeline): Resolve post-processing exceptions caused by partial PDFs due to file corruption or non-standard format by forcing a re-print. by @myhloli in #957
- refactor(model): rename and restructure model modules by @myhloli in #964
- docs:update docs for 0.9.3 by @myhloli in #965
- docs(README): update project references and translations by @myhloli in #967
New Contributors
- @DTwz made their first contribution in #913
- @qiangqiang199 made their first contribution in #889
- @hyastar made their first contribution in #923
- @kimi360 made their first contribution in #938
Full Changelog: magic_pdf-0.9.2-released...magic_pdf-0.9.3-released
magic_pdf-0.9.2-released
magic_pdf-0.9.1-released
What's Changed
- Feat/tune docs by @icecraft in #833
- fix(ocr_mkcontent): improve content handling for different languages and equation types by @myhloli in #839
- feat(list): improve list detection algorithm & fix(list): improve list identification accuracy by @myhloli in #843
- docs(tutorial): update magic-pdf command with output directory by @myhloli in #844
- feat(para_split_v3): improve list identification with block aspect ratio by @myhloli in #845
- fix(dict2md): improve text concatenation logic by @myhloli in #847
- Update pdf_extract_kit.py by @CiaranYoung in #853
- feat(table): upgrade StructEqTable model and integrate into PDF Extract Kit by @myhloli in #854
- feat(model): add HTML minification to StructTableModel by @myhloli in #855
- chore: add .gitattributes to configure file linguist attributes by @myhloli in #856
- fix(merge_text): add ligature replacement functionality #305 #241 by @myhloli in #857
- chore: add CSS and SCSS files to linguist-vendored- Update .gitattributes to mark CSS and SCSS files as vendored by @myhloli in #858
- docs(README): update Colab demo link by @myhloli in #860
- fix(table): improve table image processing by @myhloli in #866
- docs(faq): add troubleshooting for illegal instruction error on Linux servers by @myhloli in #867
- feat: mineru_demo接口文档替换为链接 by @LollipopsAndWine in #871
- test(table): improve HTML validation for table extraction by @myhloli in #874
- docs: update arXiv paper link in README files by @myhloli in #875
- docs(README): update changelog for v0.9.1 release by @myhloli in #877
New Contributors
- @CiaranYoung made their first contribution in #853
Full Changelog: magic_pdf-0.9.0-released...magic_pdf-0.9.1-released