- json-ld parsing issue is fixed;
- deprecation warning for
url
argument points to caller code; - better Python 3.7 support (fixed warnings, setup running 3.7 tests on CI).
In this release OpenGraph parsing is improved:
- known OpenGraph namespaces (og, music, video, article, book, profile) work without an explicitly defined prefix;
- prefix is extracted both from
<head>
and<html>
element attributes, not only from<head>
; - prefix parsing is more permissive.
Other changes:
- pypi version badge is added to the README;
- html parsing code is cleaned up.
- JSON-LD parsing is less strict now: control characters are allowed.
- Add OpenGraph and Microformat extractors.
- Add argument
syntaxes
toextract
and command line function, it allows to select which syntaxes to extract. - Add argument
uniform
toextract
and command line function, if True it maps the output of Microdata, OpenGraph, Microformat and Json-ld to the same template. - Add argument
errors
toextract
and command line function, it allows to define if errors should be raised, logged or ignored. - Fix RDFa memory leak, now RDfaExtractor resets
_lookups
after each extraction. - Fixed regex pattern in
JsonLdExtractor
to avoid removing comments from within valid JSON. - In
w3microdata
strip whitespaces, newlines, etc from urls extracted from html nodes. base_url
substitutesurl
inMicroformatExtractor
,JsonLdExtractor
,OpenGraphExtractor
,RDFaExtractor
andMicrodataExtractor
- individual extractors accpet
base_url
instead ofurl
, unused keyword arguments are removed. - In
w3microdata.extract_items
items_seen
andurl
are no longer class variables but are passed as arguments. - In
w3microdata
the following functions are now private:extract_item
,extract_property_value
,extract_textContent
,_extract_property
,_extract_properties
,_extract_property_refs
and_extract_textContent
. - In
w3microdata
_extract_properties
,_extract_property_refs
,_extract_property
,_extract_property_value
and_extract_item
now needitems_seen
andurl
to be passed as arguments. - Add argument
return_html_node
toextract
, it allows to return HTML node with the result of metadata extraction. It is supported only by microdata syntax.
Warning: backward-incompatible change:
base_url
is used instead ofurl
inextruct.extract
,url
is still supported by deprecated.- In
extruct.extract
defaultbase_url
is nowNone
to avoid wrong results withurljoin
.
- New
extruct
command line tool to fetch a page and extract its metadata. Works either viaextruct
directly orpython -m extruct
. - Accept leading HTML comment in JSON-LD payload.
- rdflib log messages were silenced to avoid the noise when importing extruct.
- Fix dependencies and support RDFa by default (hence depend on rdflib by default).
- Update README with all-in-one extractor examples.
- All extractors have an
.extract_items()
method, taking an lxml-parsed document as input, if you want to reuse one you already have. - Add generic extraction: use
extruct.extract()
to call all extractors at once.
Warning: backward-incompatible change:
.extract()
methods now return a list of Python dicts (the items) instead of a dict with an "items" key having this list as value.
- Use rdflib's pyRdfa directly instead of pyRdfa3 code copy.
- (Very) Experimental support for RDFa extraction using rdflib+lxml
- Web service response content-type set to 'application/json'
- Web service Python 3 compatiblity
- Code coverage reports
- Fix extraction of
<object>
"data" URL with microdata - Handle textContent mixed with
<script>
and<style>
tags - Add JSON-LD extraction example to README
- Tests added for non-nested microdata output
- Tests added for text content option
- Tests added for "meter" and "data" attributes
- First release on PyPI.