-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consider schema.org as the normalized schema? #33
Comments
I've honestly thought a lot about it and couldn't think of any way to reconcile all the different formats into a better standardisation. Currently I'm parsing as many as possible to JSON-LD (RDFa, microdata and JSON-LD) but the others don't really fit. I could make custom identifiers for them, I suppose, but I also couldn't figure out how to resolve multiple representations of the data into the single JSON-LD format. FYI, JSON-LD is the format. Schema.org is just a vocabulary. |
One example is that the extracted result (not the direct parse result) could be a manual combination of the different results into a standard vocabulary like JSON-LD. That's the primary motivator I've seen, but I currently haven't put in the effort to squash down the multiple formats into a single valid JSON-LD object. |
Oh, but if you have ideas, let me know! I'm definitely very interested in being able to squash the snippets down into a standard object instead of defining a bunch of new interfaces. It seems to me like it'd mostly be like what I do today though, and I got stuck thinking through the squashing algorithm in this case. One big issue I've had is understanding how it should properly be represented. E.g. should all webpages be put into https://schema.org/WebPage and attempt to put everything else into |
Sweet! It sounds like we are on the same page and will think through this more as I get more familiar with the code. I added @danbri in particular as he likely would be the best person to give the guidance you seek. Just a quick note: I think the new format including |
So I did give this some more thought. It'd be great to move everything to using some sort of vocabulary and JSON-LD. What that looks like still seems quite messy though, since it'll probably end up as a mishmash of vocabularies and graph entities in JSON-LD. I'm still not sure how we'd merge them automatically, or even if that's a real issue for users since I've never really consumed RDF or JSON from a service. I did find some related projects though. There's Apache Tika (who does not use a schema and instead outputs the propriety metadata in header - HTTP - style) and Apache Any23 (which does use RDF vocabularies, but I couldn't seem to find where or how they are merging it - if at all since my offline tests failed to extract RDFa or microdata). Currently I'm aiming for a 0.3 release this week that improves the current scrape result format a little. The current extraction format will likely remain the same. |
I think doing a release with whats in master is a good idea. We can think this through more longer term. |
I did a bit (lots) more refactoring of how this module parses and generates metadata/snippets internally and released it as |
Sounds good re non-schema parts
On Feb 7, 2017 14:49, "Blake Embrey" <[email protected]> wrote:
I did a bit (lots) more refactoring of how this module parses and generates
metadata/snippets internally and released it as 0.3. The model is hopefully
a bit simpler and getting closer to schema.org overall, since it's largely
inspired by that. The only things missing at this point, I believe, is the
main entity tidy up for web pages, figuring out how to handle the
non-schema.org entities and then adding @context. I'm thinking I'll just
prefix the non-schema.org entities with this repo as the vocabulary prefix
and use the # style to link to places in the README for custom type
documentation - what do you think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#33 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKZGZ_YEaUvIIMlRIoJgqP_KwQKUXXLks5raMrmgaJpZM4LX7Qd>
.
|
This project makes a lot of sense to me as there are so many different ways this information is embedded these days. I was wondering if it made sense to leverage some of the work at schema.org to standardize the json blob that comes back based on the schemas already defined.
@danbri any thoughts, guidance?
The text was updated successfully, but these errors were encountered: