Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Im newbie. #113

Open
Mohanrajkarnan opened this issue Feb 22, 2021 · 1 comment
Open

Im newbie. #113

Mohanrajkarnan opened this issue Feb 22, 2021 · 1 comment

Comments

@Mohanrajkarnan
Copy link

Mohanrajkarnan commented Feb 22, 2021

I have requirement of extract pdf to Html5.

I have tried the below code which was able to extract text from pdf and created html but not structured as in pdf.
-Missed images
-Missed text positioning.

pdftotree.parse(pdf_file,html_path=htmlPath, favor_figures=True,model_type=None, model_path=None,visualize=False)

Please assist me as what am i missing.

Thanks
Mohan

@lukehsiao
Copy link
Contributor

lukehsiao commented Feb 23, 2021

Hi Mohan,

It sounds like you're trying to get an HTML representation that focuses on visually looking like the source PDF, is that correct? If so, pdftotree most likely isn't for you. The focus here is more on structural accuracy (e.g., tables end up in HTML tables), not faithfully representing a PDF document visually. Many PDF to HTML tools have a similar focus.

If my assumption is correct, then I'd suggest trying some other tools. I think pdftohtml.org is one that emphasizes visual accuracy, but I'm sure there are others as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants