forked from bgamari/html-parse
-
Notifications
You must be signed in to change notification settings - Fork 0
/
html-parse.cabal
105 lines (99 loc) · 3.95 KB
/
html-parse.cabal
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
name: html-parse
version: 0.2.1.0
synopsis: A high-performance HTML tokenizer
description:
This package provides a fast and reasonably robust HTML5 tokenizer built
upon the @attoparsec@ library. The parsing strategy is based upon the HTML5
parsing specification with few deviations.
.
For instance,
.
>>> parseTokens "<div><h1 class=widget>Hello World</h1><br/>"
[TagOpen "div" [],
TagOpen "h1" [Attr "class" "widget"],
ContentText "Hello World",
TagClose "h1",
TagSelfClose "br" []]
.
The package targets similar use-cases to the venerable @tagsoup@ library,
but is significantly more efficient, achieving parsing speeds of over 80
megabytes per second on modern hardware and typical web documents.
Here are some typical performance numbers taken from parsing a Wikipedia
article of moderate length:
.
@
benchmarking Forced/tagsoup fast Text
time 186.1 ms (175.3 ms .. 194.6 ms)
0.999 R² (0.995 R² .. 1.000 R²)
mean 191.7 ms (188.9 ms .. 198.3 ms)
std dev 5.053 ms (1.092 ms .. 6.809 ms)
variance introduced by outliers: 14% (moderately inflated)
.
benchmarking Forced/tagsoup normal Text
time 189.7 ms (182.8 ms .. 197.7 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 196.5 ms (193.1 ms .. 202.1 ms)
std dev 5.481 ms (2.141 ms .. 7.383 ms)
variance introduced by outliers: 14% (moderately inflated)
.
benchmarking Forced/html-parser
time 15.81 ms (15.75 ms .. 15.89 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 15.72 ms (15.66 ms .. 15.77 ms)
std dev 140.9 μs (113.6 μs .. 174.5 μs)
@
homepage: http://github.com/bgamari/html-parse
license: BSD3
license-file: LICENSE
author: Ben Gamari
maintainer: [email protected]
copyright: (c) 2016 Ben Gamari
category: Text
build-type: Simple
cabal-version: >=1.10
tested-with: GHC==8.4.*, GHC==8.6.*, GHC==8.8.*, GHC==8.10.*, GHC==9.0.*, GHC==9.2.*, GHC==9.4.*
extra-source-files: changelog.md
source-repository head
type: git
location: git://github.com/bgamari/html-parse
library
exposed-modules: Text.HTML.Parser, Text.HTML.Tree
other-modules: Text.HTML.Parser.Entities, Data.Trie
ghc-options: -Wall
hs-source-dirs: src
other-extensions: OverloadedStrings, DeriveGeneric
build-depends: base >=4.7 && <4.18,
deepseq >=1.3 && <1.5,
attoparsec >=0.13 && <0.15,
text >=1.2 && <2.1,
containers >=0.5 && <0.7
default-language: Haskell2010
benchmark bench
type: exitcode-stdio-1.0
main-is: Benchmark.hs
other-extensions: OverloadedStrings, DeriveGeneric
build-depends: base,
deepseq,
attoparsec,
text,
tagsoup >= 0.13,
criterion >= 1.1,
html-parse
default-language: Haskell2010
test-suite spec
type: exitcode-stdio-1.0
hs-source-dirs: tests
main-is: Spec.hs
other-modules: Text.HTML.ParserSpec, Text.HTML.TreeSpec
ghc-options: -Wall -with-rtsopts=-T
build-tool-depends: hspec-discover:hspec-discover
build-depends: base,
containers,
hspec,
hspec-discover,
html-parse,
QuickCheck,
quickcheck-instances,
string-conversions,
text
default-language: Haskell2010