Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize notebook presentation on Google Scholar #437

Open
bollwyvl opened this issue Mar 26, 2015 · 0 comments
Open

Optimize notebook presentation on Google Scholar #437

bollwyvl opened this issue Mar 26, 2015 · 0 comments
Labels
type:Enhancement A proposed extension to the behavior of the project

Comments

@bollwyvl
Copy link
Contributor

While twitter(#167) is nice for reaching a broad audience, among the notebooks we serve that are broadly linked, many are scholarly in nature. At present, zero of these are findable on Google Scholar, one of the best tools for getting the right eyeballs on a notebook to have a broader impact to science and engineering.

While Google does index nbviewer (really, the only way to find things until we hammer out #405), Google Scholar remains aloof.

Because of this de facto status of Google Scholar's fields as the way to publish metadata, other tools like Mendeley and Zotero would immediately work.

It looks like we have to do a few things for this to work.

Metadata

The most arduous is getting users to add/maintain rich metadata to each notebook we serve. Looking at that page, it looks like we would need to encourage content authors to provide, at a minimum, the following fields:

  • citation_title or DC.title
  • citation_author or DC.creator
  • citation_publication_date or DC.issued

The challenge here is in improving the notebook metadata UI, a la ipython/ipython#6073, in helping users select, populate and observe the status of their metadata experience. Echoing my comments from there, raising the reward to include in-notebook UI awareness of the metadata, i.e. special formatting for title, authors, abstract, publication date, which nbviewer would recognize would make this seem less like drudge-work, and help ensure that the metadata stays accurate through forking, etc.

References

Allowing the crawler to find references used in the notebook would add considerably to their impact. Oddly, Scholar doesn't specify a metadata-driven way to do this, instead relying on some semantic markup:

Mark the section of the paper that contains references to other works with a standard heading, such as "References" or "Bibliography", on a line just by itself. Individual references inside this section should be either numbered "1. - 2. - 3." or "[1] - [2] - [3]" in PDF, or put inside an "

    " list in HTML. The text of each reference must be a formal bibliographic citation in a commonly used format, without free-form commentary.

This suggests that something like takluyver/cite2c#10 would be valuable.

Discoverability

It could well be that notebooks linked in other places would get picked up by scholar.

However, if we want to optimize the downstream experience, then need to create a place that the crawler can actually find things in a reasonable fashion. If successful, we would likely immediately fall under the "more than 100k" club. Here's the guidance from the site:

For websites with more than a hundred thousand papers, we recommend that you create an additional browse interface that lists only the articles added in the last two weeks. This smaller set of webpages can be recrawled more frequently than your entire browse interface, which will facilitate timely coverage of your recent papers by the search robots.

So perhaps the first step of getting to indexed notebooks would be to index those that meet the minimum viable metadata standard, and make "recently published" documents easily discoverable from the front page.

Quality

Undoubtedly, this could become an avenue for spam or just junk. No doubt a simple robot could generate an infinite stream of properly-formatted, difficult-to-detect junk notebooks, with proper MathJax and figures... in fact more easily than humans!

I have no suggestion as to how we might combat this, as we obviously don't have the manpower to validate every link, or even really put a sensible PR-based, crowd-sourced mechanism in place that would be robust against even a trivial amount.

@bollwyvl bollwyvl added the type:Enhancement A proposed extension to the behavior of the project label Sep 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:Enhancement A proposed extension to the behavior of the project
Projects
None yet
Development

No branches or pull requests

1 participant