how to extracting parts within a page while utilizing layout analysis #1064

s-natsubori · 2024-11-22T08:46:11Z

Is there a way to partially extract text while maintaining the layout in HTMLConverter?

The text extraction in HTMLConverter yields very good results, including text grouping.
However, I want to extract specific parts of the PDF (such as the upper half).
In the output from HTMLConverter, the positional information of the text is lost.
When extracting elements with extract_pages, detailed information is obtained, but the text is not grouped and all becomes LTChar.

Is there a solution for such cases?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to extracting parts within a page while utilizing layout analysis #1064

how to extracting parts within a page while utilizing layout analysis #1064

s-natsubori commented Nov 22, 2024

how to extracting parts within a page while utilizing layout analysis #1064

how to extracting parts within a page while utilizing layout analysis #1064

Comments

s-natsubori commented Nov 22, 2024