Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future Road Map #154

Open
tataganesh opened this issue Jun 16, 2018 · 30 comments
Open

Future Road Map #154

tataganesh opened this issue Jun 16, 2018 · 30 comments

Comments

@tataganesh
Copy link
Member

tataganesh commented Jun 16, 2018

Hi everyone!

Since the inception of pdfminer.six, a lot of improvements have been made, and several issues have been fixed. I would first like to thank every contributor for having kept this project alive! We are all well aware of the difficulties of parsing PDF documents, and I am sure pdfminer.six has made it easier for developers to extract information from PDFs.

But, there are more issues cropping up, and a lot of PRs are pending as well. Documentation, too, is pending. With the increase in these incomplete tasks, it is time that we decide on how to take this project forward. There needs to be a road map created for the future development of this project. It is not necessary to have people completely dedicated to it, but we at least need to create some specific targets / goals, so that the project becomes more concrete and we can ensure its stability.

For starters, I have reached out to an audience through the dev.to platform - dev.to so that more people can become aware of this wonderful project, and start contributing.

I myself am new to open source, and I have been the admin of this project for sometime. Sadly, I haven't been able to give it much time, but I am sure that if we can make a good plan for the future of this project, the quality of the project can be improved. I would LOVE to hear your thoughts on this!
Update - I have created a Gitter chatroom for having discussions regarding the project.

@pietermarsman
Copy link
Member

pietermarsman commented Oct 13, 2019

Things we could focus on

This is just a list of major things we can do.

  • Fix bugs and add tests

    Most open issues are about bugs in pdfminer (59). Most of them are small, specific and disjoint. It is also a great opportunity to extend the test suite and consequently making it easier to add new features. The goal is to zero known bugs!

  • Improve documentation for users

    Some of the issues are about adding documentation (5) or are actually questions on how to use pdfminer (17). StackOverflow also has many questions on how to use pdfminer. I know from my own experience that using pdfminer is confusing at first. Creating a place with up-to-date documentation for most basic usage scenario's will help enormously. The goal is to make it pleasant for developpers that are new to pdfminer to do basic things with it!

  • Define stable high-level entry points (i.e. an API)

    Currently, pdfminer does not have a stable and easy-to-use API. Most code examples show the use of five or more classes from pdfminer to something simple as extracting text from a pdf. I think that having a set of well-defined functions/classes that can handle most basic tasks, make using pdfminer easier and also more consistent (through time, and by different people). The goal is to create an API that is stable so that we can change the internals with no (or less) people noticing it.

  • Add new features

    Some of the issues are about "enhancing" pdfminer (20), e.g. adding new features or changing default values. I think pdfminer already has many features and that most feature request can wait a while.

  • Improve code documentation

    Almost all code is undocumented and that makes it harder to contribute, especially for newcomers. I think the best way to improve this is to consistently add/improve code documentation for all code that is touched by any PR. The goal is to easily understand the responsibility of each part of pdfminer.

  • Drop Python2 support

    Python2 is no longer supported by the python development team as of january 2020. We should also drop Python2 support at that very moment (Drop python 2 support #194).

  • Introduce best open-source practices

    Like using semver (one day, user semver? #255), keeping the changelog up to date, use git large-file-storage (Start using GIT LFS for binaries #114), automate release process using travis, add code-style enforcement (PEP-8  #92)

Where to start

I think we should focus first on fixing bugs and adding test. The amount of reported bugs is (to) large, some of them are really old, and the test coverage (to) low.

I think we should focus less on major changes until we have fixed most of the bugs. This includes adding new features, dropping python 2 support, improving documentation, creating a stable API, etc.

Since Python2 is no longer supported as of January 2020, I think we should postpone the major changes up till then. In 2019 we can do one or more releases that don't contain any breaking changes and use the old versioning system. These releases will be useful to everybody. With most of the bugs fixed, we can start from the beginning of 2020 to do some major breaking changes, e.g. drop Python2 support, change the versioning system and deprecate/remove some of the old API's.

@pietermarsman
Copy link
Member

This week I will work on README.md and read-the-docs documentation.

@Recursing
Copy link
Contributor

Since python 2 is dead, is there a way to merge this project with pdfminer and pdfminer3, to prevent duplicate work?

@pietermarsman
Copy link
Member

A quote from @igormp:

we've discussed it before, and sadly it doesn't seem like it's possible. pdfminer3 seems to be abandoned, and euske seems to have no interest in merging both projects.

@pietermarsman
Copy link
Member

pietermarsman commented Jan 21, 2020

Status update:

  • Fix bugs and add tests

A lot of bugs where fixed. The CHANGELOG.md lists 15 fixes since 2019-10-20. For most of those bugs tests where added. There are still 27 issues that are labeled as bugs (compared to 59 earlier).

  • Improve documentation for users

We have a readthedocs now, but it does not contain a lot of examples. We should probably add more.

  • Define stable high-level entry points (i.e. an API)

The pdf2txt.py and dumppdf.py are always stable. Two functions are added to the high-level api: extract_text() and extract_pages(). These high-level api functions serve many needs but could be more widely used (instead of using the composable api).

  • Add new features

Since October 2019, 3 new features were added according to the CHANGELOG.md. There are 11 issues that are labelled as enhancement. Still some work to do here...

  • Improve code documentation

The command-line utilities are better document, the high-level functions got better documentation, but there is still a lot of work to do on all the classes and functions.

  • Drop Python2 support

Done!

  • Introduce best open-source practices

We are not going to use semver (because pypi will get untenable confused). Also no progress on git lfs and automatic releasing using travis. On the bright side, the CHANGELOG.md is always up-to-date and code-style enforcement is used.

What to do next

I still think we should focus first on fixing bugs and adding test. The amount of reported bugs is (to) large, some of them are really old.

But I also noticed in the last months that some reported bugs are not actually bugs but rather questions. These issues, and also e.g. questions on stackoverflow, indicate that it is difficult to use pdfminer(.six). So improving the documentation is key.

I think we should focus less on adding new features until we have fixed most of the bugs.

@KunalGehlot
Copy link
Contributor

A small addition to what @pietermarsman already mentioned

I think documenting the code and improving read-the-docs is one of the most important things.
After using this library for more than a year, I noticed most people are using the library (including myself) by Frankenstein-ing the code from read-the-docs tutorials and StackOverflow answers without completely understanding what each class/method is doing.

We should start by explaining all the methods and classes used in the Tutorial demos. That will solve many problems people face by relating the errors and their understanding of the library.

Being new to open source and this code base, I'll start contributing by helping with the issues and trying to improve the documentation.

@julie777
Copy link

I myself am new to open source, and I have been the admin of this project for sometime. Sadly, I haven't been able to give it much time, but I am sure that if we can make a good plan for the future of this project, the quality of the project can be improved. I would LOVE to hear your thoughts on this! Update - I have created a Gitter chatroom for having discussions regarding the project.

I would much prefer using github discussions instead of a chatroom. That way the discussions are part of the project. The wiki could also be used to capture the results of the discussion and the resulting roadmap.

@igormp
Copy link
Contributor

igormp commented Sep 19, 2022

I would much prefer using github discussions instead of a chatroom. That way the discussions are part of the project. The wiki could also be used to capture the results of the discussion and the resulting roadmap.

IIRC, the discussions feature wasn't a thing back then. Seeing how the gitter isn't that active, I guess that would be a good idea in order to properly organize any discussion into threads without needing to search through all of the chatroom history.

@vilabho
Copy link

vilabho commented Feb 28, 2023

I would like to update the current documentation of pdf miner, but whom should I tag for PR approval? it seems this repo is dormant for months... if anybody is maintaining it, please mention them

@pietermarsman
Copy link
Member

Hi @vilabho,

I'm the dormant maintainer with merge permissions. I've been meaning to do some work last months / year but haven't got to it. Help on proper documentation is very much appreciated.

@dhdaines
Copy link
Contributor

IIRC, the discussions feature wasn't a thing back then. Seeing how the gitter isn't that active, I guess that would be a good idea in order to properly organize any discussion into threads without needing to search through all of the chatroom history.

The chatroom doesn't actually seem to exist anymore! So searching its history is no longer an option :(

But more on topic ... I have been submitting PRs to pdfplumber to properly support tagged PDFs that should really be features in pdfminer.six, e.g. jsvine/pdfplumber#961 and jsvine/pdfplumber#963. The reason I haven't done this is that it doesn't appear that pdfminer.six will be maintained at this point, so it doesn't seem worthwhile to put in the extra effort to create a fork/PR that can't actually be depended upon in the foreseeable future.

Is there any possibility that bugfixes, optimizations, and documentation enhancements will be merged at any point soon, let alone new features?

@pietermarsman
Copy link
Member

Long story short: we are looking for new maintainers

@dhdaines I am sorry that I was not more active in the last years. Unfortunately, I cannot be as active as I was when I started as a maintainer op pdfminer.six. The current situation is much like when I took over from @goulu.

For the future of pdfminer.six it would be very beneficial if we had a maintainer again with time and energy to guide this project. I'm tagging all potential candidates below. But feel free to respond here as well if you are not in the list.

Right now there are 4 owners:

There are also 5 other members of the pdfminer.six organization:

There are also some people that contributed more than once (all 3 commits or more):

(Have not thought of a procedure for picking a new maintainer yet).

@sergei-maertens
Copy link
Contributor

I'm sorry, but I don't use the project anymore nor do I have time to step in. Good luck finding a candidate though!

@dhdaines
Copy link
Contributor

Long story short: we are looking for new maintainers

@dhdaines I am sorry that I was not more active in the last years. Unfortunately, I cannot be as active as I was when I started as a maintainer op pdfminer.six. The current situation is much like when I took over from @goulu.

Thank you for the quick reply... as the maintainer of a rather old project I totally understand!

I think the underlying question is whether the project is still relevant enough and used enough to be maintained - I have to admit that I only actually use it via pdfplumber which has a somewhat more Pythonic API while still giving low-level access to the PDF structure.

Since there are a variety of other options for high-level manipulation and text extraction (if you only want text...) from PDFs, I wonder if it would make sense to simply merge the two projects.

@NickFabry
Copy link

NickFabry commented Oct 20, 2023

FWIW, I use pdfminer every day; it's still been the only PDF library I've encountered which attempts to account accurately for whitespace and actual page position of text elements in a consistent way. Sometimes the structured data you need is buried in a PDF, and you don't have an alternative source...

I'd love to keep it going, but I don't know if I have the skills to maintain it. A long time ago (15+ years?) I worked a little with @euske on improving PDF miner, so I'm quite fond of it. I'd put up my hand if no body else would.

@pettzilla1
Copy link
Contributor

Hi happy to say it's still incredibly relevant pdfminer.six is incredibly useful for pdf parsing with a permissive license which most other libraries don't have, we still use it daily

@dhdaines
Copy link
Contributor

FWIW, I use pdfminer every day; it's still been the only PDF library I've encountered which attempts to account accurately for whitespace and actual page position of text elements in a consistent way. Sometimes the structured data you need is buried in a PDF, and you don't have an alternative source...

Yes, exactly - from what I've seen most PDF libraries work hard to hide the hideous, horrible complexity of the PDF format from you, which is fine if you just want to dump a load of text into a large language model, not so great if you want to use layout information. This, plus pure-Python and permissive license, make pdfminer (and by extension pdfplumber) relevant in my opinion.

I would be willing to help out with maintenance as well. I could also definitely contribute some improvements to documentation and performance.

@WolfgangFahl
Copy link

The CI is currently broken - this might be the first area of improvement to make sure committing may be done in a way that doesn't break the current state of affairs see https://github.com/pdfminer/pdfminer.six/actions/runs/6793184760. If you invite me i might try some fixes that get the CI working again e.g. simple things such as code formatting.

@FriedrichFroebel
Copy link

Getting the CI working again should be something which can be ensured on a fork and then submitted as a PR. If there really is some ongoing activity on this repository, merging the CI fixes first from the maintainer side is still possible - no need to directly grant you write permissions.

@dhdaines
Copy link
Contributor

Getting the CI working again should be something which can be ensured on a fork and then submitted as a PR. If there really is some ongoing activity on this repository, merging the CI fixes first from the maintainer side is still possible - no need to directly grant you write permissions.

Looks like it's mainly a case of looking for obsolete Python versions on Ubuntu latest. I'll take a look right now on my fork.

@dhdaines
Copy link
Contributor

Well, it's a bit more than just Python versions, because there's an unversioned dependency on black among other things. Working on this now here: #921

@dhdaines
Copy link
Contributor

And now CI passes: https://github.com/pdfminer/pdfminer.six/actions/runs/6883805593?pr=921

I did this with the minimal amount of code changes, but there are things that will need to be fixed so we can actually use the latest pip and setuptools for instance. They should go in a separate PR.

Now the $921 question! Can someone merge this? @pietermarsman ?

@dhdaines
Copy link
Contributor

A secondary PR to also fix building with current pip/setuptools (in Python 3.12): #923

@pietermarsman
Copy link
Member

Thanks @dhdaines for the work! I merged #921 and will try to look at #923 in the coming days.

@pietermarsman
Copy link
Member

@WolfgangFahl I'm positive to new contributors, but hesitant to handing out permissions quickly. I see you did not contribute to issues or PR's, that's a great start for any contributor. From your profile it looks like you are an active coder, and we could definitely benefit from your knowledge when triaging issues and PR's.

For PR's I prefer to have at least one review, and not commit directly to master.

@WolfgangFahl
Copy link

@pietermarsman thanks for looking into my offer again. It was an if sentence and the decision was not to invite me. I accepted that decision.

@suryavaddiraju
Copy link

A secondary PR to also fix building with current pip/setuptools (in Python 3.12): #923

Yes, I can build setuptools integration and also with new python pypi trusted publishers It's now very easy to build python packages with github workflow automations. But for a change we eliminate setuptools and implement python new standard packaging procedures using hatchling. From now on I will give my hand and support this package and make sure this sets a new standard for python pdf users.

@madhubandru
Copy link

Hi, I want to check if pdfminer.six is using any API under the hood for processing PDF, or does everything happen in local code? Please comment if someone knows. I appreciate any help you can provide.

@FriedrichFroebel
Copy link

As you can see from having a look at the dependencies and the code, pdfminer.six implements its own PDF parser and API. No remote tools are required to run it, although it has some dependencies of course.

@madhubandru
Copy link

Thank you @FriedrichFroebel for your quick response. I am thinking in terms of production, I do not want my PDF or PDF data to go anywhere for processing, expecting to happen everything within my environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests