-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document plugin - add error checking for collected links #251
Comments
It's my feeling that we shouldn't even resolve the URLs unless there's a need to get the data stored there. |
@edsu want to weigh in? |
Agreed, it definitely shouldn't throw an error. I'll take a look at a fix. |
It's actually the link element for the copyright URL that it seems to be examining, not the meta tag: <link rel="copyright" title="Szerzői jogok" href="/copyright/" /> I'm still looking to see where the problem is. |
I noticed this in the console when trying to persist an annotation for this page:
The POST to the API fails with a 500. I took a look and the JSON that is being POSTed looks like this: {
"document": {
"dc": {},
"eprints": {},
"facebook": {
"description": "[\"Egy tus kellett az utols\u00c3\u00b3 p\u00c3\u00a1rban Szil\u00c3\u00a1gyi \u00c3\u0081ronnak, de zsin\u00c3\u00b3rban \u00c3\u00b6t\u00c3\u00b6t kapott. Oda az \u00c3\u00bajabb vb-\u00c3\u00a9rem.\"]",
"image": "[\"http://kep.cdn.index.hu/1/0/463/4632/46327/4632792_360be8786b7c3e436f9f4e5e2910ace8_wm.jpg\"]",
"title": "[\"Iszony\u00c3\u00ba f\u00c3\u00a1jdalmas veres\u00c3\u00a9g a rom\u00c3\u00a1nokt\u00c3\u00b3l\"]",
"type": "[\"article\"]",
"url": "[\"http://sportgeza.hu/sport/vivovb/2013/08/10/iszonyu_fajdalmas_vereseg_a_romanoktol/\"]"
},
"favicon": "http://sportgeza.hu/assets/images/favicon_big.ico",
"highwire": {},
"link": "[{\"href\": \"http://sportgeza.hu/sport/vivovb/2013/08/10/iszonyu_fajdalmas_vereseg_a_romanoktol/\"}]",
"prism": {},
"title": "[\"Iszony\u00c3\u00ba f\u00c3\u00a1jdalmas veres\u00c3\u00a9g a rom\u00c3\u00a1nokt\u00c3\u00b3l\"]",
"twitter": {}
},
"permissions": {
"admin": [
"acct:edsu@localhost"
],
"delete": [
"acct:edsu@localhost"
],
"read": [
"group:__world__",
"acct:edsu@localhost"
],
"update": [
"acct:edsu@localhost"
]
},
"quote": "argentinok",
"ranges": "[{\"startContainer\": \"/div[7]/div[2]/div[1]/div[1]/div[3]/p[1]/span[1]/span[1]/span[1]\", \"startOffset\": 33, \"endContainer\": \"/div[7]/div[2]/div[1]/div[1]/div[3]/p[1]/span[1]/span[1]/span[1]\", \"endOffset\": 43}]",
"target": "[{\"source\": \"http://sportgeza.hu/sport/vivovb/2013/08/10/iszonyu_fajdalmas_vereseg_a_romanoktol/\", \"selector\": [{\"type\": \"RangeSelector\", \"startContainer\": \"/div[7]/div[2]/div[1]/div[1]/div[3]/p[1]/span[1]/span[1]/span[1]\", \"startOffset\": 33, \"endContainer\": \"/div[7]/div[2]/div[1]/div[1]/div[3]/p[1]/span[1]/span[1]/span[1]\", \"endOffset\": 43}, {\"type\": \"TextQuoteSelector\", \"exact\": \"argentinok\", \"prefix\": \"magyar f\u00c3\u00a9rfi kardv\u00c3\u00a1logatott az\", \"suffix\": \"elleni k\u00c3\u00b6nnyed bemeleg\u00c3\u00adt\u00c3\u00a9s (45-\"}, {\"type\": \"TextPositionSelector\", \"start\": 491, \"end\": 501}], \"quote\": \"argentinok\"}]",
"text": "test",
"uri": "http://sportgeza.hu/sport/vivovb/2013/08/10/iszonyu_fajdalmas_vereseg_a_romanoktol/",
"user": "acct:edsu@localhost"
} For some reason the |
That symptom is very similar to what I have found in hypothesis/h#608; probably caused by the same problem. (I am also seeing an array being flattened into a string. No idea why.) |
I agree with @tilgovi that it might be best to let the Document plugin simply collect the data as best it can, and leave it up to consuming applications that want to use the data to validate it for their purposes. For example we could write a program that walked the annotation-store and scrubbed the links, or archived the documents, etc. If we make the Document plugin ensure that URLs are resolvable before returning them it would slow things down significantly, and could fail to include some URLs that were temporarily unavailable for whatever reason. Currently the Document plugin converts relative URLs to absolute URLs since we need to query them as absolute URLs later. This is done in the _absoluteUrl function, which is a bit of a hack, but a fairly common trick that gets the browser to make them absolute. The only downside is that the browser will report URLs that failed to load in the console. The failed GET request happens asynchronously, and do not halt document scanning, or otherwise effect the normal functioning of the annotator, at least I found no evidence of that. To verify this I disabled |
Currently, when the document plugins collects all links in the
getLinks()
function it does not check if these links are not head when calling the_absoluteUrl()
for them.Take these page as an example: http://sportgeza.hu/sport/vivovb/2013/08/10/iszonyu_fajdalmas_vereseg_a_romanoktol/
It throws back the following error:
And of course it is the page which gives the wrong metadata, here:
<meta name="copyright" content="http://sportgeza.hu/copyright/" />
But we should not throw errors in these cases (as I see it then aborts the whole document scanning).
The text was updated successfully, but these errors were encountered: