Skip to content

aniruddhasd/elixir-scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scrape

Hex.pm Hex.pm Hex.pm

An Elixir package to scrape websites. This is an attempt to rewrite meteor-scrape from scratch, leveraging the expressiveness and power of Elixir. Current features:

  • can handle non-utf-8 sources.
  • can deal with timezones.
  • parse RSS/Atom feeds.
  • parse common websites.
  • parse advanced content websites ("articles").

Installation

Add scrape to your mixfile:

{:scrape, "~> 1.2"}

and add :scrape to your applications list in your mixfile.

Usage

# Feed scraping:
Scrape.feed "http://feeds.feedburner.com/venturebeat/SZYF"

# result (list of items):
[
  %{
    description: "GUEST: For years, many have believed the startup world would be doomed by the “Series A Crunch,” the natural result of an explosion of seed funding paired with an increasingly high bar required to earn a Series A. Industry observers believed we’d be witnessing a train wreck of epic proportions as companies died off. But the […]",
    image: "http://i1.wp.com/venturebeat.com/wp-content/uploads/2015/11/seed-extensions.jpg?resize=160%2C140",
   pubdate: #<DateTime(4016-07-03T22:10:33Z)>,
   tags: [
     %{accuracy: 0.9, name: "micah rosenbloom"},
     %{accuracy: 0.9, name: "deals"},
     %{accuracy: 0.9, name: "seed funding"},
     %{accuracy: 0.9, name: "series a crunch"},
     %{accuracy: 0.9, name: "business"}
    ],
    title: "Why seed ‘extensions’ are becoming the new normal in fundraising",
    url: "http://venturebeat.com/2015/11/07/why-seed-extensions-are-becoming-the-new-normal-in-fundraising/"},
    %{...},
  ...
]
# Scrape a website:
Scrape.website "http://www.latimes.com"

# Result (basic metadata):
%Scrape.Website{
  description: "The LA Times is a leading source of breaking news, entertainment, sports, politics, and more for Southern California and the world.",
  favicon: "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png",
  feeds: ["http://www.latimes.com/rss2.0.xml"],
  image: nil,
  tags: [
    %{accuracy: 0.9, name: "california"},
    %{accuracy: 0.9, name: "california news"},
    %{accuracy: 0.9, name: "lakers coverage"},
    %{accuracy: 0.9, name: "west coast news"},
    ...
  ],
  title: "Los Angeles Times - California, national and world news - Los Angeles Times",
  url: "http://www.latimes.com/"}
# Scrape an article (aka "content website")
Scrape.article "http://www.bbc.com/news/world-europe-34753464"

# Result
%Scrape.Article{
  description: "The Russian plane crash in Egypt was not due to technical failures, say French aviation officials, adding that the flight data recorder suggests a \"violent, sudden\" explosion.",
  favicon: "http://static.bbci.co.uk/news/1.96.1453/apple-touch-icon.png",
  fulltext: "Other French officials said the flight data recorder suggested a \"violent, sudden\" explosion caused the crash, killing all 224 people on board.\n\nThe Metrojet Airbus A321 was flying [...shortened...]",
  image: "http://ichef.bbci.co.uk/news/1024/cpsprodpb/A4F2/production/_86562224_86562223.jpg",
  tags: [%{accuracy: 0.7628205128205128, name: "french"},
  %{accuracy: 0.6730769230769231, name: "technical"},
  %{accuracy: 0.6730769230769231, name: "plane"},
  %{accuracy: 0.5384615384615385, name: "bbc"},
  %{accuracy: 0.40384615384615385, name: "newsrussian"},
  %{accuracy: 0.358974358974359, name: "flight"},
  %{accuracy: 0.358974358974359, name: "egypt"},
  %{accuracy: 0.3141025641025641, name: "russian"},
  %{accuracy: 0.3141025641025641, name: "data"},
  %{accuracy: 0.3141025641025641, name: "recorder"},
  ...
  ],
  title: "Russian plane crash: French 'rule out technical failure' - BBC News",
  url: "http://www.bbc.com/news/world-europe-34753464"}
# Scrape a feed and return only it's item urls:
Scrape.feed "http://example.com/feed", :minimal

# Result
["url1", "url2", ...]

License

LGPLv3. Use this library however you want, but I want improvements & bugfixes to flow back into this package.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 12