Skip to content

Keksov/html-parsers-benchmark

 
 

Repository files navigation

HTML parsers benchmark

Simple HTML DOM parser benchmark.

Competitors

Erlang

CPython

PyPi

  • BeautifulSoup 3
  • BeautifulSoup 4
  • html5lib

Node.JS

Ruby

C

Google Go

Preparation

Install OS dependencies python-virtualenv, erlang, pypy, C compiler and libxml2 dev packages

sudo apt-get install python-virtualenv python-lxml erlang-base pypy \
    libxml2-dev libxslt1-dev build-essential, nodejs, npm

Then run (it will prepare virtual environments, fetch dependencies, compile sources etc)

./prepare.sh

In case of errors, I recommended to install also cython, python-dev and retry.

To prepare only some of the platforms, define PLATFORMS environment variable:

PLATFORMS="pypy python" ./prepare.sh

RUN

Just run

./run.sh <number of parser iterations>

eg

./run.sh 5000

To run tests only for some of the platforms, define PLATFORMS envifonment variable:

PLATFORMS="pypy python" ./run.sh 5000

Results

To convert results to CSV file, use to_csv.py

./run.sh 5000 2>&1 | ./to_csv.py

or smth like

./run.sh 2>&1 | tee output.txt
./to_csv.py < output.txt

There is also R - script that can build some pretty graphs: stats/main.r.

How to add my %platformname% to benchmark set?

Create directory %platformname%

mkdir %platformname%

Create run.sh and prepare.sh scripts:

  • run.sh - called every time when benchmark starts. Must use print_header() and timeit() functions from lib.sh to format output for each test. It must accept 2 arguments: HTML file path and number of iterations and pass them unchanged to benchmark scripts.
  • prepare.sh - called only once, before runing any benchmarks. It can download dependencies, compile sources etc.

Create your benchmark scripts. Requirements:

  • Must accept 2 arguments: path to HTML file and number of iterations
  • Must read HTML file once, then perform "number of iterations" parse cycles
  • Must print parser-loop runtime in seconds, calculated like start = time(); do_n_iterations(N); print time() - start
  • On each iteration must build full DOM tree in memory

Add %platformname% to platforms.txt file.

How to add new HTML to benchmark?

Just create HTML file named page_<some_page_name>.html.

About

Simple HTML DOM parsers benchmark.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 22.8%
  • Python 19.9%
  • Shell 18.5%
  • Go 9.9%
  • C 9.4%
  • JavaScript 7.9%
  • Other 11.6%