How to parse HTML in markdown as a single node instead of separate nodes for opening tag, content, and closing tag? #745

rr-codes · 2021-06-13T01:52:25Z

rr-codes
Jun 13, 2021

When parsing <sub>hi</sub> for example with the following configuration:

import unified from 'unified';
import markdown from 'remark-parse';
import type {Block} from '@notionhq/client/build/src/api-types';
import {parseRoot} from './internal';
import gfm from 'remark-gfm';

export function parseBody(body: string): Block[] {
  const tokens = unified().use(markdown).use(gfm).parse(body);
  return parseRoot(tokens);
}

the generated MD AST is

[
  {
    "type": "paragraph",
    "children": [
      {
        "type": "html",
        "value": "<sub>",
      },
      {
        "type": "text",
        "value": "hi",
      },
      {
        "type": "html",
        "value": "</sub>",
      }
    ]
  }
]

For my purposes, it would be more convenient if instead, the HTML tag and content were processed as a single node with children, like:

[
  {
    "type": "paragraph",
    "children": [
      {
        "type": "html",
        "value": "sub",
        "children": [
          {
            "type": "text",
            "value": "hi",
          },
        ]
      },
    ]
  }
]

Is this at all possible to do? Specifically, I just want to remove all HTML tags as well as their content when parsing, which would be very easy to do if it was a single node with children.

Answered by wooorm

Jun 13, 2021

I just want to remove all HTML tags as well as their content

What about:

<main>

All
the
markdown
content

</main>

Short answer: no. But it sounds like an XY problem

View full answer

wooorm · 2021-06-13T08:24:56Z

wooorm
Jun 13, 2021
Maintainer

I just want to remove all HTML tags as well as their content

What about:

<main>

All
the
markdown
content

</main>

Short answer: no. But it sounds like an XY problem

8 replies

wooorm Jun 13, 2021
Maintainer

rr-codes Jun 13, 2021
Author

Depending on your notion goals, it might be better to parse HTML in markdown: go from remark, through remark-rehype, with rehype-raw, to get an HTML AST. That HTML AST then can get transformed to Notion, treating > a and
a
the same.

@wooorm do you mean like this?

import unified from 'unified';
import markdown from 'remark-parse';
import type {Block} from '@notionhq/client/build/src/api-types';
import {parseRoot} from './internal';
import raw from 'rehype-raw';
import remark2rehype from 'remark-rehype';

export function parseBody(body: string): Block[] {
  const tokens = unified()
    .use(markdown)
    .use(remark2rehype, {allowDangerousHtml: true})
    .use(raw)
    .parse(body);
  return parseRoot(tokens);
}

?

wooorm Jun 13, 2021
Maintainer

Yeah, and then transform the ast in internal!

rr-codes Jun 13, 2021
Author

@wooorm Would this change how <sub>hi</sub> for example is parsed? It doesn't appear to parse any different than how it did before. I'm a bit confused about what exactly remark2rehype / rehype-raw do differently that can help me accomplish my goal.

wooorm Jun 13, 2021
Maintainer

You’re missing out on the run (runSync) step: https://github.com/unifiedjs/unified#description

const processor = unified().use(...)

await processor.run(processor.parse(doc))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remark

How to parse HTML in markdown as a single node instead of separate nodes for opening tag, content, and closing tag? #745

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

remark

How to parse HTML in markdown as a single node instead of separate nodes for opening tag, content, and closing tag? #745

rr-codes Jun 13, 2021

Replies: 1 comment · 8 replies

wooorm Jun 13, 2021 Maintainer

wooorm Jun 13, 2021 Maintainer

rr-codes Jun 13, 2021 Author

wooorm Jun 13, 2021 Maintainer

rr-codes Jun 13, 2021 Author

wooorm Jun 13, 2021 Maintainer

rr-codes
Jun 13, 2021

Replies: 1 comment 8 replies

wooorm
Jun 13, 2021
Maintainer

wooorm Jun 13, 2021
Maintainer

rr-codes Jun 13, 2021
Author

wooorm Jun 13, 2021
Maintainer

rr-codes Jun 13, 2021
Author

wooorm Jun 13, 2021
Maintainer