Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to allow extracting YouTube videos or <iframe> tags? #93

Open
cayolblake opened this issue Feb 18, 2021 · 5 comments
Open

How to allow extracting YouTube videos or <iframe> tags? #93

cayolblake opened this issue Feb 18, 2021 · 5 comments
Labels
future Needs revisiting in the future

Comments

@cayolblake
Copy link

Hello,

Is there a way to allow extracting YouTube video and iframe tags similar to how image extraction is done?

@martintoreilly
Copy link
Member

Hi @cayolblake - welcome 👋 . Are you using the Python library (uses our HTML simplification code) or calling from the command line with default options (uses Mozilla's Readability.js package to simplify the HTML)? Thanks.

@cayolblake
Copy link
Author

Hi @martintoreilly

I'm using the Python library :)

@martintoreilly
Copy link
Member

martintoreilly commented Feb 19, 2021

@cayolblake I'm afraid that, if this isn't fixed by updating to the latest version of Mozilla's Readability.js, then we wont' have the bandwidth to be able to look into it anytime soon. Sensibly extracting images and videos using our python based HTML simplifier is something we've talked about supporting before, but until we're next working on a project that's parsing web articles, we'll struggle to carve out time to work on this further.

I think for your use case, adding support in our python HTML simplifier won't be enough, as we're not currently as good as Readability.js in stripping out non-content elements so I think would not be suitable for you even if it did contain images and video tags.

I'm tagging this with a future tag rather than closing it to keep it visible for when we're next working on this.

@martintoreilly martintoreilly added the future Needs revisiting in the future label Feb 19, 2021
@martintoreilly
Copy link
Member

Linked to issue #31, which considers iframe handling more generally.

@cayolblake
Copy link
Author

Hi @martintoreilly

That's perfectly understood. I'm planning to take a dive into your project and understand how it works - any docs that can help explaining/simplifying things further would be appreciated - hopefully after doing so I'm be able to find that best candidate for applying modification if possible.

I think the Readability.js main point of strength is that it gets burned on a daily basis from all Firefox users from everywhere which gives it the chance to enhance its heuristic algorithms as it goes.

Have you thought about splitting your own simplifier and the readability wrapper/utilizer in two different projects? I guess that could highlight more healthy focus on your own simplifier while still having something that works on its own dependably and may be use it as a reference or a benchmark? Just a humble thought 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
future Needs revisiting in the future
Projects
None yet
Development

No branches or pull requests

2 participants