You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I guess I found a bug in the way the scoring is done.
For example a article page from cnn:
DEBUG:root:Candidate p#cnnContentContainer.cnn_storyarea with score 163.5
DEBUG:root:Candidate p#.cnn_contentarea with score 138.0
DEBUG:root:Candidate p#cnnContainer. with score 118.5
DEBUG:root:Candidate body#. with score 113.5
DEBUG:root:Candidate p#.cnn_strycntntlft with score 111.0
all of those 5 candidates are somehow childs of eachother (body#->p.*). So it happens, that the result is showing to much text which is not needed.
An idea would be to remove child nodes from the parent before calculating the score.
The text was updated successfully, but these errors were encountered:
Thanks for the report. To be honest, I haven't looked too close into the scoring of nodes (I didn't write this library, I merely ported it to python).
I do know that it's unfortunately not as simple as disregarding children from the scoring calculation, because then you lose good content candidates which are composed of multiple children - imagine a "body" div which has very little text inside it, but contains 5 large
tags comprising the article. You'd want to select the containing div, rather than any individual
I guess I found a bug in the way the scoring is done.
For example a article page from cnn:
DEBUG:root:Candidate p#cnnContentContainer.cnn_storyarea with score 163.5
DEBUG:root:Candidate p#.cnn_contentarea with score 138.0
DEBUG:root:Candidate p#cnnContainer. with score 118.5
DEBUG:root:Candidate body#. with score 113.5
DEBUG:root:Candidate p#.cnn_strycntntlft with score 111.0
all of those 5 candidates are somehow childs of eachother (body#->p.*). So it happens, that the result is showing to much text which is not needed.
An idea would be to remove child nodes from the parent before calculating the score.
The text was updated successfully, but these errors were encountered: