You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Apologies if this is a stupid question, since I've not had a proper read through the source of ReadabiliPy or Readability.js, but is the pure-python implementation of ReadabiliPy intended to exactly reproduce the results of Readability.js?
In other words, should I get the exact same results when calling:
readabilipy.simple_json_from_html_string(html, use_readability=False)
and readabilipy.simple_json_from_html_string(html, use_readability=True)
?
Because for certain articles I find that ReabiliPy gives me extra html elements and text that I'm not at all interested in, for example:
> import requests
> from readabilipy import simple_json_from_html_string
> url = 'https://analytics.jiscinvolve.org/wp/2019/02/12/my-algorithmic-friend-by-andrew-cormack/'
> html = requests.get(url).text
> article = simple_json_from_html_string(html, use_readability=False)
> article['plain_text']
...
{'text': 'If you have comments on the draft Wellbeing Analytics Code of Practice, please...'}
...
{'text': 'Archives'},
{'text': '* July 2019, * June 2019, * February 2019, * December 2018, * November 2018, ........'}
...
whereas Readability.js manages to avoid extracting all of those links in the side bar:
> article = simple_json_from_html_string(html, use_readability=True)
> article['plain_text']
...
{'text': 'If you have comments on the draft Wellbeing Analytics Code of Practice, please...'}
<end>
Is there anything I can do to get ReadabiliPy to give me results more like Readability.js, since I'd like to use ReadabiliPy inside an AWS Lambda function and would like to avoid using both node and
python (if that's even possible in a single function..?)
The original idea was that this would just be a python wrapper around Readability.js, and you can still use it as that if you want to. However, we found that sometimes Readability.js gives HTML that doesn't strictly adhere to the standard (although it renders in browsers without issue). The downstream application that we're using this package for cares more about that aspect so we focused on that.
We are (were?) planning to work on getting them to be feature equivalent (if not completely identical) but we haven't got much budget for that at the moment.
I think that Readability.js uses some complex heuristics to decide which part of the page to pull out as the main content element and we haven't had a chance to look into that. If you're interested in doing so, you can try diving into the Javascript to work out what it's doing...
I had a quick look at the Readability.js code but it was a bit more complicated than I assumed it would be, and I don't have enough time to go through it in detail at the moment so I'm just going to stick with your ReadabiliPy wrapper for now.
PS. I'm currently a Data Science Developer at Jisc - still based in Manchester
Apologies if this is a stupid question, since I've not had a proper read through the source of ReadabiliPy or Readability.js, but is the pure-python implementation of ReadabiliPy intended to exactly reproduce the results of Readability.js?
In other words, should I get the exact same results when calling:
readabilipy.simple_json_from_html_string(html, use_readability=False)
and
readabilipy.simple_json_from_html_string(html, use_readability=True)
?
Because for certain articles I find that ReabiliPy gives me extra html elements and text that I'm not at all interested in, for example:
whereas Readability.js manages to avoid extracting all of those links in the side bar:
Is there anything I can do to get ReadabiliPy to give me results more like Readability.js, since I'd like to use ReadabiliPy inside an AWS Lambda function and would like to avoid using both node and
python (if that's even possible in a single function..?)
Thanks
(Hi @jemrobinson - small world..!)
The text was updated successfully, but these errors were encountered: