Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parsing microdata when itemProps contains multiple space separated properties #26

Open
iaincollins opened this issue Jul 22, 2019 · 0 comments

Comments

@iaincollins
Copy link

Thanks for creating such a great project!

I ran into a bug parsing microdata content where itemprop contained multiple properties, like in these examples and thought I'd share what I ran into:

<meta data-rh="true" property="article:published" itemprop="datePublished dateCreated" content="2019-07-21T09:00:06.000Z"/>
<span itemProp="publisher copyrightHolder provider sourceOrganization" itemscope="" itemType="http://schema.org/NewsMediaOrganization" itemID="https://www.nytimes.com">
<figure itemprop="associatedMedia image" itemscope itemtype="http://schema.org/ImageObject" data-component="image" class="element element-image img--landscape  fig--narrow-caption fig--has-shares " data-media-id="f82028d62b1edd7417d7d3773c4abf0d4fa86174" id="img-3">
  <meta itemprop="url" content="https://i.guim.co.uk/img/media/f82028d62b1edd7417d7d3773c4abf0d4fa86174/0_272_6435_3861/master/6435.jpg?width=700&amp;quality=85&amp;auto=format&amp;fit=max&amp;s=016df6a3f33eabe3cbca39eb389a60fb">
</figure>

Markup like this is parsed correctly in Google's Structured Data Testing Tool, but web-auto-extractor does not currently split input based on spaces.

I resolved this in a project which uses web-auto-extractor by doing this:

const __transformStructuredData = (structuredData) => {
   let result = structuredData
   Object.keys(result.microdata).forEach(schema => {
     result.microdata[schema].forEach(object => {
       Object.keys(object).forEach(key => {
         if (key.includes(' ')) {
           key.split(' ').forEach(newKey => {
             object[newKey] = object[key]
           })
           delete object[key]
         }
       })
     })
   })
   return result
 }

I'm aware there are some other PRs related to handling whitespace trimming open.

If an enhancement like this appeals I'd be happy to raise a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant