ReadabiliPy vs Readability.js #81

kjoshi · 2019-07-31T17:44:43Z

Apologies if this is a stupid question, since I've not had a proper read through the source of ReadabiliPy or Readability.js, but is the pure-python implementation of ReadabiliPy intended to exactly reproduce the results of Readability.js?

In other words, should I get the exact same results when calling:

readabilipy.simple_json_from_html_string(html, use_readability=False)
and
readabilipy.simple_json_from_html_string(html, use_readability=True)
?

Because for certain articles I find that ReabiliPy gives me extra html elements and text that I'm not at all interested in, for example:

> import requests
> from readabilipy import simple_json_from_html_string

> url = 'https://analytics.jiscinvolve.org/wp/2019/02/12/my-algorithmic-friend-by-andrew-cormack/'
> html = requests.get(url).text
> article = simple_json_from_html_string(html, use_readability=False)
> article['plain_text']
...
{'text': 'If you have comments on the draft Wellbeing Analytics Code of Practice, please...'}
...
{'text': 'Archives'},
 {'text': '* July 2019, * June 2019, * February 2019, * December 2018, * November 2018, ........'}
...

whereas Readability.js manages to avoid extracting all of those links in the side bar:

> article = simple_json_from_html_string(html, use_readability=True)
> article['plain_text']
...
{'text': 'If you have comments on the draft Wellbeing Analytics Code of Practice, please...'}
<end>

Is there anything I can do to get ReadabiliPy to give me results more like Readability.js, since I'd like to use ReadabiliPy inside an AWS Lambda function and would like to avoid using both node and
python (if that's even possible in a single function..?)

Thanks

(Hi @jemrobinson - small world..!)

The text was updated successfully, but these errors were encountered:

jemrobinson · 2019-08-01T17:16:33Z

Hi @kjoshi!

No, it's not meant to be identical.

The original idea was that this would just be a python wrapper around Readability.js, and you can still use it as that if you want to. However, we found that sometimes Readability.js gives HTML that doesn't strictly adhere to the standard (although it renders in browsers without issue). The downstream application that we're using this package for cares more about that aspect so we focused on that.

We are (were?) planning to work on getting them to be feature equivalent (if not completely identical) but we haven't got much budget for that at the moment.

I think that Readability.js uses some complex heuristics to decide which part of the page to pull out as the main content element and we haven't had a chance to look into that. If you're interested in doing so, you can try diving into the Javascript to work out what it's doing...

PS. Whereabouts are you working these days?

kjoshi · 2019-08-14T08:16:54Z

Ok, great, thanks for confirming.

I had a quick look at the Readability.js code but it was a bit more complicated than I assumed it would be, and I don't have enough time to go through it in detail at the moment so I'm just going to stick with your ReadabiliPy wrapper for now.

PS. I'm currently a Data Science Developer at Jisc - still based in Manchester

jemrobinson added the future Needs revisiting in the future label Aug 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReadabiliPy vs Readability.js #81

ReadabiliPy vs Readability.js #81

kjoshi commented Jul 31, 2019

jemrobinson commented Aug 1, 2019 •

edited

Loading

kjoshi commented Aug 14, 2019

ReadabiliPy vs Readability.js #81

ReadabiliPy vs Readability.js #81

Comments

kjoshi commented Jul 31, 2019

jemrobinson commented Aug 1, 2019 • edited Loading

kjoshi commented Aug 14, 2019

jemrobinson commented Aug 1, 2019 •

edited

Loading