Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 1.1.0 #1614

Merged
merged 77 commits into from
Jan 23, 2025
Merged

Release 1.1.0 #1614

merged 77 commits into from
Jan 23, 2025

Conversation

myhloli
Copy link
Collaborator

@myhloli myhloli commented Jan 23, 2025

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.

Checklist

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
  • CLA has been signed and all committers have signed the CLA in this PR.

Update pdf_parse_union_core_v2.py
- Set PyMuPDF version to <= 1.24.14 in all requirements files
- Prevent potential compatibility issues with future versions
build(deps): add upper version limit for PyMuPDF
- Merge title blocks that are close to each other horizontally
- Adjust line insertion logic for title blocks- Increase image size and decrease confidence threshold for layout detection
- Update DocLayoutYOLO model weights
- Refactor drawing of bounding boxes for different block types
feat(layout): improve title block handling and layout detection
- Add average line height calculation for title blocks
- Include page number in title dictionary
- Improve title optimization prompt for better hierarchy- Implement retry mechanism for JSON decoding errors
- Add error logging for title count mismatch
feat(post_proc): enhance title block processing with average line height
- Modified the IOU threshold in ocr_span_list_modify.py from 0.9 to 0.35
- This change aims to improve the detection of overlapping characters in OCR processed PDFs
refactor(pre_proc): adjust IOU threshold for character overlap detection
- Clarify the expected format for the optimized title list JSON output- Emphasize the need to return only the title levels in the specified format
docs(magic_pdf): update llm_aided.py prompt for title list optimization
- Add `remove_invalid_surrogates` function to filter out invalid UTF-16 surrogate pairs
- Integrate the new function into the `detect_lang` workflow
- Include a test case with UTF-16 surrogates to verify the fix
fix(language): remove invalid UTF-16 surrogate pairs from input text
- Remove doclayout_yolo==0.0.2b1 and doclayout-yolo==0.0.2
- Add doclayout-yolo==0.0.2b1 to all requirements files
build(docker): update doclayout-yolo dependency
- Add support for NPU (Neural Processing Unit) when available
- Implement batch analysis for GPU and NPU devices
- Optimize memory usage and improve performance
- Update logging and error handling
feat(model): improve batch analysis logic and support npu
- Rename and update merge_title_blocks function
- Implement merge_two_bbox helper function
- Refactor merging logic to preserve original block structure- Update function calls and integrate with existing pipeline
refactor(magic_pdf): improve title block merging logic
- Adjust end_page_id calculation to prevent IndexError when accessing pages
- Enhance error handling in LLM post-processing by specifically catching JSONDecodeError
fix(magic_pdf): correct end page index and improve error handling
- Update OpenDataLab badge to new design
- Update OpenDataLab badge to new design
docs(README): update demo badges
- Update RapidTable dependency to version 1.0.3
- Add support for sub-models in RapidTable
- Update magic-pdf configuration to include table sub-model
- Modify table model initialization to support sub-models
- Update table prediction logic to handle new output format
- Update model path from 'unimernet_small' to 'unimernet_small_2501' in multiple scripts and configuration files
- This change affects download_models.py, download_models_hf.py, and model_configs.yaml
fix(models): update unimernet_small model path
- Reduce YOLO_LAYOUT_BASE_BATCH_SIZE from 4 to 1
- Simplify batch ratio calculation for formula detection
- Remove unused conditional logic in batch ratio determination
- Update GPU memory check and batch ratio calculation logic
- Add support for virtual VRAM size environment variable
- Improve logging for GPU memory and batch ratio
…e VRAM allocation logic to use 'VIRTUAL_VRAM_SIZE' environment variable

- Reduce MFR (Math Formula Recognition) batch size from 64 to 32
perf(magic_pdf): optimize batch ratio calculation for GPU
- Reduce batch_ratio by 1 for better performance and stability
- This change ensures more consistent memory usage when processing documents
perf(magic_pdf): adjust batch ratio calculation for GPU memory
- Improve batch ratio calculation based on GPU memory
- Enhance performance for devices with 8GB or more VRAM
- Update conditions for batch ratio assignment:
  -8 <= gpu_memory < 10: batch_ratio = 2 - 10 <= gpu_memory <= 12: batch_ratio =4
- This fix ensures proper batch ratio selection for GPU memory sizes
perf(magic_pdf): optimize batch processing for GPU
- Restore commented code for filtering out characters with invalid bounding boxes
- This change may affect the filtering of unnecessary characters in PDF parsing
- Add a check to return 0 when either bbox1_area or bbox2_area is zero
- This prevents division by zero errors when calculating IoU
refactor(pdf_parse): uncomment char bbox validation logic
- Add timing measurement for formula, text, and title optimization using LLM
- Log the execution time for each LLM aided process
docs(readme):update readme for 1.1.0
docs(url): update Miners links in header
docs(url): update Miners links in header
- Add sub_model configuration option for rapid_table model
- Provide two sub_model options: slanet_plus and unitable
feat(table-config): add sub_model configuration for rapid_table
feat(table-config): add sub_model configuration for rapid_table
…ilities: upgrade to latest doclayout_yolo(2501) and unimernet(2501) models

- Improve performance: optimize resource usage and processing pipeline for faster parsing on high-end devices- Enhance parsing effects: add new heading classification feature to online demo
- Refactor changelog structure for better readability and organization
docs(readme): update changelog for v1.1.0 release
…ability

- Update online demo links in both English and Chinese README files
docs(README): update online demo links and enhance documentation readability
docs(README): update online demo links
@myhloli myhloli merged commit 19f72c2 into master Jan 23, 2025
1 of 2 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jan 23, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants