Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra entries with full text in plain_text list #96

Open
malicialab opened this issue Apr 6, 2021 · 1 comment
Open

Extra entries with full text in plain_text list #96

malicialab opened this issue Apr 6, 2021 · 1 comment

Comments

@malicialab
Copy link

malicialab commented Apr 6, 2021

Some HTML files produce extra entries in the plain_text key of the JSON with the full key, in addition to the entries with the text of each paragraph, i.e., the same paragraph will appear as an entry and as part of this extra entries.
This behavior only manifests using Readibility.js. Using the Python based parser this does not happen.

I am attaching one HTML file that shows this behavior:

readabilipy -V
0.2.0

readabilipy -i ef94fca40c96ebf85c2217855fe6382364b75da0d8029be5ee395f607886bd9e.html -o tmp.json

The first entry in tmp.json plain_text field has the full text, other entries have the subset per-paragraph text

readabilipy -i ef94fca40c96ebf85c2217855fe6382364b75da0d8029be5ee395f607886bd9e.html -o tmp2.json -p

tmp2.json does not have the extra entry in the plain_text field

I am wondering if this would disappear by using the latest Readibility.js instead of the embedded version. Any chance that pull request #95 is going to be incorporated soon? It would be great to avoid reporting issues already fixed in the latest Readability.js

Thanks!

ef94fca40c96ebf85c2217855fe6382364b75da0d8029be5ee395f607886bd9e.html.gz

kinow added a commit to kinow/wandering-inn-tts that referenced this issue Apr 21, 2023
@tpai
Copy link

tpai commented May 1, 2023

I found an interesting situation. There are two different outputs between MacOS and Linux:

  • When I used MacOS with Readabilipy 0.2.0 and Node.js 18, it worked without extra entries.
  • When I used Docker to wrap the Python application with Linux, Readabilipy 0.2.0 and Node.js 18, the extra entries poped out.

And this library seems no longer to be maintained, so I workaround this bug by using dictionary.

article = readabilipy.simple_json_from_html_string(req.text, use_readability=True)
text_array = [obj['text'] for obj in article['plain_text']]
article_content = list(dict.fromkeys(text_array))

Hope this can help people who see this comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants