Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document plugin - add error checking for collected links #251

Closed
gergely-ujvari opened this issue Aug 10, 2013 · 7 comments
Closed

Document plugin - add error checking for collected links #251

gergely-ujvari opened this issue Aug 10, 2013 · 7 comments

Comments

@gergely-ujvari
Copy link
Contributor

Currently, when the document plugins collects all links in the getLinks() function it does not check if these links are not head when calling the _absoluteUrl() for them.

Take these page as an example: http://sportgeza.hu/sport/vivovb/2013/08/10/iszonyu_fajdalmas_vereseg_a_romanoktol/

It throws back the following error:

GET http://sportgeza.hu/copyright/ 404 (Not Found) jquery-1.10.2.js?91515770:6569
jQuery.extend.buildFragment jquery-1.10.2.js?91515770:6569
jQuery.extend.parseHTML jquery-1.10.2.js?91515770:540
jQuery.fn.jQuery.init jquery-1.10.2.js?91515770:150
jQuery jquery-1.10.2.js?91515770:63
Annotator.Plugin.Document.Document._absoluteUrl annotator.document.js?d8714605:296
Annotator.Plugin.Document.Document._getLinks annotator.document.js?d8714605:209
(anonymous function) annotator.document.js?d8714605:15
Annotator.Plugin.Document.Document.getDocumentMetadata annotator.document.js?d8714605:95
(anonymous function) annotator.document.js?d8714605:15
Annotator.Plugin.Document.Document.pluginInit annotator.document.js?d8714605:45
Annotator.addPlugin annotator.js?c4672558:1353
Annotator.Host.Host._setupXDM host.js?9bed4e07:103
(anonymous function) host.js?9bed4e07:60
jQuery.event.dispatch jquery-1.10.2.js?91515770:5095
elemData.handle

And of course it is the page which gives the wrong metadata, here:
<meta name="copyright" content="http://sportgeza.hu/copyright/" />

But we should not throw errors in these cases (as I see it then aborts the whole document scanning).

@tilgovi
Copy link
Member

tilgovi commented Aug 10, 2013

It's my feeling that we shouldn't even resolve the URLs unless there's a need to get the data stored there.

@tilgovi
Copy link
Member

tilgovi commented Aug 10, 2013

@edsu want to weigh in?

@edsu
Copy link
Contributor

edsu commented Aug 10, 2013

Agreed, it definitely shouldn't throw an error. I'll take a look at a fix.

@edsu
Copy link
Contributor

edsu commented Aug 13, 2013

It's actually the link element for the copyright URL that it seems to be examining, not the meta tag:

<link rel="copyright" title="Szerzői jogok" href="/copyright/" />

I'm still looking to see where the problem is.

@edsu
Copy link
Contributor

edsu commented Aug 13, 2013

I noticed this in the console when trying to persist an annotation for this page:

2013-08-13 08:01:43,830 ERROR [annotator] Exception on /annotations [POST]
Traceback (most recent call last):
  File "/Users/ed/Projects/h/lib/python2.7/site-packages/flask/app.py", line 1687, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/ed/Projects/h/lib/python2.7/site-packages/flask/app.py", line 1360, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/Users/ed/Projects/h/lib/python2.7/site-packages/flask/app.py", line 1358, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/ed/Projects/h/lib/python2.7/site-packages/flask/app.py", line 1344, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/Users/ed/Projects/h/src/annotator/annotator/store.py", line 148, in create_annotation
    annotation.save(refresh=refresh)
  File "/Users/ed/Projects/h/src/annotator/annotator/annotation.py", line 96, in save
    uris = [link['href'] for link in d['link']]
TypeError: string indices must be integers, not str

The POST to the API fails with a 500. I took a look and the JSON that is being POSTed looks like this:

{
    "document": {
        "dc": {}, 
        "eprints": {}, 
        "facebook": {
            "description": "[\"Egy tus kellett az utols\u00c3\u00b3 p\u00c3\u00a1rban Szil\u00c3\u00a1gyi \u00c3\u0081ronnak, de zsin\u00c3\u00b3rban \u00c3\u00b6t\u00c3\u00b6t kapott. Oda az \u00c3\u00bajabb vb-\u00c3\u00a9rem.\"]", 
            "image": "[\"http://kep.cdn.index.hu/1/0/463/4632/46327/4632792_360be8786b7c3e436f9f4e5e2910ace8_wm.jpg\"]", 
            "title": "[\"Iszony\u00c3\u00ba f\u00c3\u00a1jdalmas veres\u00c3\u00a9g a rom\u00c3\u00a1nokt\u00c3\u00b3l\"]", 
            "type": "[\"article\"]", 
            "url": "[\"http://sportgeza.hu/sport/vivovb/2013/08/10/iszonyu_fajdalmas_vereseg_a_romanoktol/\"]"
        }, 
        "favicon": "http://sportgeza.hu/assets/images/favicon_big.ico", 
        "highwire": {}, 
        "link": "[{\"href\": \"http://sportgeza.hu/sport/vivovb/2013/08/10/iszonyu_fajdalmas_vereseg_a_romanoktol/\"}]", 
        "prism": {}, 
        "title": "[\"Iszony\u00c3\u00ba f\u00c3\u00a1jdalmas veres\u00c3\u00a9g a rom\u00c3\u00a1nokt\u00c3\u00b3l\"]", 
        "twitter": {}
    }, 
    "permissions": {
        "admin": [
            "acct:edsu@localhost"
        ], 
        "delete": [
            "acct:edsu@localhost"
        ], 
        "read": [
            "group:__world__", 
            "acct:edsu@localhost"
        ], 
        "update": [
            "acct:edsu@localhost"
        ]
    }, 
    "quote": "argentinok", 
    "ranges": "[{\"startContainer\": \"/div[7]/div[2]/div[1]/div[1]/div[3]/p[1]/span[1]/span[1]/span[1]\", \"startOffset\": 33, \"endContainer\": \"/div[7]/div[2]/div[1]/div[1]/div[3]/p[1]/span[1]/span[1]/span[1]\", \"endOffset\": 43}]", 
    "target": "[{\"source\": \"http://sportgeza.hu/sport/vivovb/2013/08/10/iszonyu_fajdalmas_vereseg_a_romanoktol/\", \"selector\": [{\"type\": \"RangeSelector\", \"startContainer\": \"/div[7]/div[2]/div[1]/div[1]/div[3]/p[1]/span[1]/span[1]/span[1]\", \"startOffset\": 33, \"endContainer\": \"/div[7]/div[2]/div[1]/div[1]/div[3]/p[1]/span[1]/span[1]/span[1]\", \"endOffset\": 43}, {\"type\": \"TextQuoteSelector\", \"exact\": \"argentinok\", \"prefix\": \"magyar f\u00c3\u00a9rfi kardv\u00c3\u00a1logatott az\", \"suffix\": \"elleni k\u00c3\u00b6nnyed bemeleg\u00c3\u00adt\u00c3\u00a9s (45-\"}, {\"type\": \"TextPositionSelector\", \"start\": 491, \"end\": 501}], \"quote\": \"argentinok\"}]", 
    "text": "test", 
    "uri": "http://sportgeza.hu/sport/vivovb/2013/08/10/iszonyu_fajdalmas_vereseg_a_romanoktol/", 
    "user": "acct:edsu@localhost"
}

For some reason the link key is getting flattened into a string--it should be an array. Which is odd because I can see in the debugger that window.annotator.plugins.Document.metadata.link is an array. It's even stranger because an array is posted when annotating other pages ...

@csillag
Copy link
Contributor

csillag commented Aug 13, 2013

That symptom is very similar to what I have found in hypothesis/h#608; probably caused by the same problem. (I am also seeing an array being flattened into a string. No idea why.)

@edsu
Copy link
Contributor

edsu commented Aug 15, 2013

I agree with @tilgovi that it might be best to let the Document plugin simply collect the data as best it can, and leave it up to consuming applications that want to use the data to validate it for their purposes. For example we could write a program that walked the annotation-store and scrubbed the links, or archived the documents, etc. If we make the Document plugin ensure that URLs are resolvable before returning them it would slow things down significantly, and could fail to include some URLs that were temporarily unavailable for whatever reason.

Currently the Document plugin converts relative URLs to absolute URLs since we need to query them as absolute URLs later. This is done in the _absoluteUrl function, which is a bit of a hack, but a fairly common trick that gets the browser to make them absolute. The only downside is that the browser will report URLs that failed to load in the console. The failed GET request happens asynchronously, and do not halt document scanning, or otherwise effect the normal functioning of the annotator, at least I found no evidence of that.

To verify this I disabled _absoluteUrl by simply returning what was passed in. This caused the console error @gergely-ujvari included above to go away, since the browser was no longer trying to load the broken URL. However the annotation still failed to save, because the document links were flattened from an array into a string. This demonstrates that _absoluteUrl has nothing to do with hypothesis/h#608 and that the problem of list flattening (which isn't limited to annotation.document.link by the way) lies elsewhere.

@edsu edsu closed this as completed Aug 15, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants