consider schema.org as the normalized schema? #33

svanzoest · 2016-12-30T00:10:58Z

This project makes a lot of sense to me as there are so many different ways this information is embedded these days. I was wondering if it made sense to leverage some of the work at schema.org to standardize the json blob that comes back based on the schemas already defined.

@danbri any thoughts, guidance?

blakeembrey · 2016-12-30T00:36:42Z

I've honestly thought a lot about it and couldn't think of any way to reconcile all the different formats into a better standardisation. Currently I'm parsing as many as possible to JSON-LD (RDFa, microdata and JSON-LD) but the others don't really fit. I could make custom identifiers for them, I suppose, but I also couldn't figure out how to resolve multiple representations of the data into the single JSON-LD format.

FYI, JSON-LD is the format. Schema.org is just a vocabulary.

blakeembrey · 2016-12-30T00:38:29Z

One example is that the extracted result (not the direct parse result) could be a manual combination of the different results into a standard vocabulary like JSON-LD. That's the primary motivator I've seen, but I currently haven't put in the effort to squash down the multiple formats into a single valid JSON-LD object.

blakeembrey · 2016-12-30T00:43:32Z

Oh, but if you have ideas, let me know! I'm definitely very interested in being able to squash the snippets down into a standard object instead of defining a bunch of new interfaces. It seems to me like it'd mostly be like what I do today though, and I got stuck thinking through the squashing algorithm in this case. One big issue I've had is understanding how it should properly be represented. E.g. should all webpages be put into https://schema.org/WebPage and attempt to put everything else into about or something else? Any guidance would be appreciated here, I've mostly been learning that side of the semantic web through lots and lots of reading.

svanzoest · 2016-12-30T00:54:14Z

Sweet! It sounds like we are on the same page and will think through this more as I get more familiar with the code. I added @danbri in particular as he likely would be the best person to give the guidance you seek.

Just a quick note: I think the new format including canonicalUrl is much improved version compared to the current release.

blakeembrey · 2017-02-01T01:24:56Z

So I did give this some more thought. It'd be great to move everything to using some sort of vocabulary and JSON-LD. What that looks like still seems quite messy though, since it'll probably end up as a mishmash of vocabularies and graph entities in JSON-LD. I'm still not sure how we'd merge them automatically, or even if that's a real issue for users since I've never really consumed RDF or JSON from a service.

I did find some related projects though. There's Apache Tika (who does not use a schema and instead outputs the propriety metadata in header - HTTP - style) and Apache Any23 (which does use RDF vocabularies, but I couldn't seem to find where or how they are merging it - if at all since my offline tests failed to extract RDFa or microdata).

Currently I'm aiming for a 0.3 release this week that improves the current scrape result format a little. The current extraction format will likely remain the same.

svanzoest · 2017-02-01T02:14:10Z

I think doing a release with whats in master is a good idea. We can think this through more longer term.

blakeembrey · 2017-02-07T19:49:57Z

I did a bit (lots) more refactoring of how this module parses and generates metadata/snippets internally and released it as 0.3. The model is hopefully a bit simpler and getting closer to schema.org overall, since it's largely inspired by that. The only things missing at this point, I believe, is the main entity tidy up for web pages, figuring out how to handle the non-schema.org entities and then adding @context. I'm thinking I'll just prefix the non-schema.org entities with this repo as the vocabulary prefix and use the # style to link to places in the README for custom type documentation - what do you think?

danbri · 2017-02-07T20:15:13Z

Sounds good re non-schema parts On Feb 7, 2017 14:49, "Blake Embrey" <[email protected]> wrote: I did a bit (lots) more refactoring of how this module parses and generates metadata/snippets internally and released it as 0.3. The model is hopefully a bit simpler and getting closer to schema.org overall, since it's largely inspired by that. The only things missing at this point, I believe, is the main entity tidy up for web pages, figuring out how to handle the non-schema.org entities and then adding @context. I'm thinking I'll just prefix the non-schema.org entities with this repo as the vocabulary prefix and use the # style to link to places in the README for custom type documentation - what do you think? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKZGZ_YEaUvIIMlRIoJgqP_KwQKUXXLks5raMrmgaJpZM4LX7Qd> .

blakeembrey added enhancement help wanted research labels Dec 30, 2016

svanzoest changed the title ~~consider schema.org as the normalized format?~~ consider schema.org as the normalized schema? Dec 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consider schema.org as the normalized schema? #33

consider schema.org as the normalized schema? #33

svanzoest commented Dec 30, 2016

blakeembrey commented Dec 30, 2016

blakeembrey commented Dec 30, 2016

blakeembrey commented Dec 30, 2016

svanzoest commented Dec 30, 2016 •

edited

Loading

blakeembrey commented Feb 1, 2017

svanzoest commented Feb 1, 2017

blakeembrey commented Feb 7, 2017

danbri commented Feb 7, 2017 via email

consider schema.org as the normalized schema? #33

consider schema.org as the normalized schema? #33

Comments

svanzoest commented Dec 30, 2016

blakeembrey commented Dec 30, 2016

blakeembrey commented Dec 30, 2016

blakeembrey commented Dec 30, 2016

svanzoest commented Dec 30, 2016 • edited Loading

blakeembrey commented Feb 1, 2017

svanzoest commented Feb 1, 2017

blakeembrey commented Feb 7, 2017

danbri commented Feb 7, 2017 via email

svanzoest commented Dec 30, 2016 •

edited

Loading