All notable changes to this project will be documented in this file. This project adheres to Semantic Versioning. In order to read more about upgrading and BC breaks have a look at the UPGRADE Document.
- Added catch for throwable when parsing pdfs, also updated to latest version of
smalot/pdfparser
.
- #20 Improve the strip tags for html parser in order to generate a more clean and readable output when
$stripTags
is enabled. Things like<p>foo</p><p>bar</p>
are now handled asfoo bar
instead offoobar
.
- #18 Fix issue with pages where utf8 chars are in title tag.
- #17 Fixed issue where crawler group is not generated correctly.
- #15 Do not follow links which have
rel="nofollow"
by default. This can be configured in theHtmlParser::$ignoreRels
property.
- #14 Pass the StatusCode of the response into the parsers and process only HTML and PDFs with code 200 (OK).
- #13 New Crawler method
getCycles()
returns the number of times therun()
method was called.
- #10 Add relative url check to
Url
class. - #8 Merge the path of an url when only a query param is provided.
- #9 Fix issue where
CRAWL_IGNORE
tag had no effect. Trim the array value for found linkes, which is equals to the link title.
- #7 By default, response content which is bigger then 5MB won't be passed to Parsers. In order to turn off this behavior use
'maxSize' => false
or increase the limit'maxSize' => 15000000
(which is 15MB for example). The value must be provided in Bytes. The main goal is to ensure that the PDF Parser won't run into very large memory consumption. This restriction won't stop the Crawler from downloading the URL (whether its large the the maxSize definition or not), but preventing memory leaks when the Parsers start to interact with the response content.
- Decrease the CURL Request Timeout. A CURL request for a given URL will now timeout after 5 seconds.
- #5 Fix a bug with not done function
isValid
to check whether an url is a mailto link or similar.
- #4 Add option to encode the url paths.
- First stable release.