Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for flexible focus language crawling framework #139

Open
thammegowda opened this issue Nov 15, 2017 · 3 comments
Open

Support for flexible focus language crawling framework #139

thammegowda opened this issue Nov 15, 2017 · 3 comments
Milestone

Comments

@thammegowda
Copy link
Member

The first task is defining and expressing the forcus crawling specification.
The second subtask will be implementing that specification in sparkler.

Currently, we have support for URL based focus/filters.
this has to be advanced with content-based focus.

Example task can be:

  1. "Crawl top news in Kannada language"
  2. "Crawl sports news in XYZ language"
  3. "Crawl cooking blogs that are in XYZ language"
  4. "Crawl poetry or song lyrics in XYZ language"
  5. "Craw news about earthquakes in XYZ language"

Sparkler should be able to express and accept this first 'focus' requirement, which is a combination of two filters:

  1. language filter, often rare languages (i.e. languages that are not supported by Google translator). There are over few thousands.
  2. domain such as cooking, news, sports news etc Maybe a few tens or hundreds max.
@chrismattmann
Copy link
Contributor

great job Thamme! If I may this is "focused language crawling" as opposed to e.g., "focused multimedia crawling" or "web page crawling" etc. We should update the issue title to reflect that. Great job filing the issue.

@thammegowda thammegowda changed the title Support for flexible focus crawling framework Support for flexible focus language crawling framework Nov 15, 2017
@thammegowda
Copy link
Member Author

thammegowda commented Nov 15, 2017

Thanks for the suggestion. the title is now updated 👍
Focus crawling is needed for everybody, but no existing crawler seems to do it right.
we/sparkler now has the thinking cap for this task, we will propose a good solution for languages, multimedia, etc..

@wmburke
Copy link

wmburke commented Nov 15, 2017

Yeah - this could be really cool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants