You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While twitter(#167) is nice for reaching a broad audience, among the notebooks we serve that are broadly linked, many are scholarly in nature. At present, zero of these are findable on Google Scholar, one of the best tools for getting the right eyeballs on a notebook to have a broader impact to science and engineering.
While Google does index nbviewer (really, the only way to find things until we hammer out #405), Google Scholar remains aloof.
Because of this de facto status of Google Scholar's fields as the way to publish metadata, other tools like Mendeley and Zotero would immediately work.
It looks like we have to do a few things for this to work.
Metadata
The most arduous is getting users to add/maintain rich metadata to each notebook we serve. Looking at that page, it looks like we would need to encourage content authors to provide, at a minimum, the following fields:
citation_title or DC.title
citation_author or DC.creator
citation_publication_date or DC.issued
The challenge here is in improving the notebook metadata UI, a la ipython/ipython#6073, in helping users select, populate and observe the status of their metadata experience. Echoing my comments from there, raising the reward to include in-notebook UI awareness of the metadata, i.e. special formatting for title, authors, abstract, publication date, which nbviewer would recognize would make this seem less like drudge-work, and help ensure that the metadata stays accurate through forking, etc.
References
Allowing the crawler to find references used in the notebook would add considerably to their impact. Oddly, Scholar doesn't specify a metadata-driven way to do this, instead relying on some semantic markup:
Mark the section of the paper that contains references to other works with a standard heading, such as "References" or "Bibliography", on a line just by itself. Individual references inside this section should be either numbered "1. - 2. - 3." or "[1] - [2] - [3]" in PDF, or put inside an "
" list in HTML. The text of each reference must be a formal bibliographic citation in a commonly used format, without free-form commentary.
It could well be that notebooks linked in other places would get picked up by scholar.
However, if we want to optimize the downstream experience, then need to create a place that the crawler can actually find things in a reasonable fashion. If successful, we would likely immediately fall under the "more than 100k" club. Here's the guidance from the site:
For websites with more than a hundred thousand papers, we recommend that you create an additional browse interface that lists only the articles added in the last two weeks. This smaller set of webpages can be recrawled more frequently than your entire browse interface, which will facilitate timely coverage of your recent papers by the search robots.
So perhaps the first step of getting to indexed notebooks would be to index those that meet the minimum viable metadata standard, and make "recently published" documents easily discoverable from the front page.
Quality
Undoubtedly, this could become an avenue for spam or just junk. No doubt a simple robot could generate an infinite stream of properly-formatted, difficult-to-detect junk notebooks, with proper MathJax and figures... in fact more easily than humans!
I have no suggestion as to how we might combat this, as we obviously don't have the manpower to validate every link, or even really put a sensible PR-based, crowd-sourced mechanism in place that would be robust against even a trivial amount.
The text was updated successfully, but these errors were encountered:
While twitter(#167) is nice for reaching a broad audience, among the notebooks we serve that are broadly linked, many are scholarly in nature. At present, zero of these are findable on Google Scholar, one of the best tools for getting the right eyeballs on a notebook to have a broader impact to science and engineering.
While Google does index nbviewer (really, the only way to find things until we hammer out #405), Google Scholar remains aloof.
Because of this
de facto
status of Google Scholar's fields as the way to publish metadata, other tools like Mendeley and Zotero would immediately work.It looks like we have to do a few things for this to work.
Metadata
The most arduous is getting users to add/maintain rich metadata to each notebook we serve. Looking at that page, it looks like we would need to encourage content authors to provide, at a minimum, the following fields:
citation_title
orDC.title
citation_author
orDC.creator
citation_publication_date
orDC.issued
The challenge here is in improving the notebook metadata UI, a la ipython/ipython#6073, in helping users select, populate and observe the status of their metadata experience. Echoing my comments from there, raising the reward to include in-notebook UI awareness of the metadata, i.e. special formatting for title, authors, abstract, publication date, which nbviewer would recognize would make this seem less like drudge-work, and help ensure that the metadata stays accurate through forking, etc.
References
Allowing the crawler to find references used in the notebook would add considerably to their impact. Oddly, Scholar doesn't specify a metadata-driven way to do this, instead relying on some semantic markup:
This suggests that something like takluyver/cite2c#10 would be valuable.
Discoverability
It could well be that notebooks linked in other places would get picked up by scholar.
However, if we want to optimize the downstream experience, then need to create a place that the crawler can actually find things in a reasonable fashion. If successful, we would likely immediately fall under the "more than 100k" club. Here's the guidance from the site:
So perhaps the first step of getting to indexed notebooks would be to index those that meet the minimum viable metadata standard, and make "recently published" documents easily discoverable from the front page.
Quality
Undoubtedly, this could become an avenue for spam or just junk. No doubt a simple robot could generate an infinite stream of properly-formatted, difficult-to-detect junk notebooks, with proper MathJax and figures... in fact more easily than humans!
I have no suggestion as to how we might combat this, as we obviously don't have the manpower to validate every link, or even really put a sensible PR-based, crowd-sourced mechanism in place that would be robust against even a trivial amount.
The text was updated successfully, but these errors were encountered: