Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider schema.org as the normalized schema? #33

Open
svanzoest opened this issue Dec 30, 2016 · 8 comments
Open

consider schema.org as the normalized schema? #33

svanzoest opened this issue Dec 30, 2016 · 8 comments

Comments

@svanzoest
Copy link

This project makes a lot of sense to me as there are so many different ways this information is embedded these days. I was wondering if it made sense to leverage some of the work at schema.org to standardize the json blob that comes back based on the schemas already defined.

@danbri any thoughts, guidance?

@blakeembrey
Copy link
Member

I've honestly thought a lot about it and couldn't think of any way to reconcile all the different formats into a better standardisation. Currently I'm parsing as many as possible to JSON-LD (RDFa, microdata and JSON-LD) but the others don't really fit. I could make custom identifiers for them, I suppose, but I also couldn't figure out how to resolve multiple representations of the data into the single JSON-LD format.

FYI, JSON-LD is the format. Schema.org is just a vocabulary.

@blakeembrey
Copy link
Member

One example is that the extracted result (not the direct parse result) could be a manual combination of the different results into a standard vocabulary like JSON-LD. That's the primary motivator I've seen, but I currently haven't put in the effort to squash down the multiple formats into a single valid JSON-LD object.

@blakeembrey
Copy link
Member

Oh, but if you have ideas, let me know! I'm definitely very interested in being able to squash the snippets down into a standard object instead of defining a bunch of new interfaces. It seems to me like it'd mostly be like what I do today though, and I got stuck thinking through the squashing algorithm in this case. One big issue I've had is understanding how it should properly be represented. E.g. should all webpages be put into https://schema.org/WebPage and attempt to put everything else into about or something else? Any guidance would be appreciated here, I've mostly been learning that side of the semantic web through lots and lots of reading.

@svanzoest svanzoest changed the title consider schema.org as the normalized format? consider schema.org as the normalized schema? Dec 30, 2016
@svanzoest
Copy link
Author

svanzoest commented Dec 30, 2016

Sweet! It sounds like we are on the same page and will think through this more as I get more familiar with the code. I added @danbri in particular as he likely would be the best person to give the guidance you seek.

Just a quick note: I think the new format including canonicalUrl is much improved version compared to the current release.

@blakeembrey
Copy link
Member

So I did give this some more thought. It'd be great to move everything to using some sort of vocabulary and JSON-LD. What that looks like still seems quite messy though, since it'll probably end up as a mishmash of vocabularies and graph entities in JSON-LD. I'm still not sure how we'd merge them automatically, or even if that's a real issue for users since I've never really consumed RDF or JSON from a service.

I did find some related projects though. There's Apache Tika (who does not use a schema and instead outputs the propriety metadata in header - HTTP - style) and Apache Any23 (which does use RDF vocabularies, but I couldn't seem to find where or how they are merging it - if at all since my offline tests failed to extract RDFa or microdata).

Currently I'm aiming for a 0.3 release this week that improves the current scrape result format a little. The current extraction format will likely remain the same.

@svanzoest
Copy link
Author

I think doing a release with whats in master is a good idea. We can think this through more longer term.

@blakeembrey
Copy link
Member

I did a bit (lots) more refactoring of how this module parses and generates metadata/snippets internally and released it as 0.3. The model is hopefully a bit simpler and getting closer to schema.org overall, since it's largely inspired by that. The only things missing at this point, I believe, is the main entity tidy up for web pages, figuring out how to handle the non-schema.org entities and then adding @context. I'm thinking I'll just prefix the non-schema.org entities with this repo as the vocabulary prefix and use the # style to link to places in the README for custom type documentation - what do you think?

@danbri
Copy link

danbri commented Feb 7, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants