Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to extracting parts within a page while utilizing layout analysis #1064

Open
s-natsubori opened this issue Nov 22, 2024 · 0 comments
Open

Comments

@s-natsubori
Copy link

Is there a way to partially extract text while maintaining the layout in HTMLConverter?

The text extraction in HTMLConverter yields very good results, including text grouping.
However, I want to extract specific parts of the PDF (such as the upper half).
In the output from HTMLConverter, the positional information of the text is lost.
When extracting elements with extract_pages, detailed information is obtained, but the text is not grouped and all becomes LTChar.

Is there a solution for such cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant