Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Want to handle whitespace in pasted HTML according to HTML spec #2718

Closed
wants to merge 11 commits into from
Closed

Want to handle whitespace in pasted HTML according to HTML spec #2718

wants to merge 11 commits into from

Conversation

jkcs
Copy link

@jkcs jkcs commented Oct 28, 2023

Description

See changesets.

Close #2713
/claim #2713

@changeset-bot
Copy link

changeset-bot bot commented Oct 28, 2023

⚠️ No Changeset found

Latest commit: 1c3ca24

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link

vercel bot commented Oct 28, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
plate ❌ Failed (Inspect) Nov 4, 2023 9:05am

@reviewpad
Copy link
Contributor

reviewpad bot commented Oct 28, 2023

Thank you @jkcs for this first contribution!

Copy link
Collaborator

@12joan 12joan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Like I said in the issue, I'll leave the bounty open until Thursday and award the bounty to the most complete PR. This PR is already looking pretty good, aside from the few comments I've left.

Comment on lines 6 to 37
function getStyleFromNode(node: HTMLElement | ChildNode): string {
if (!(node as HTMLElement).getAttribute) return '';

const styleAttr = (node as HTMLElement).getAttribute('style');

return styleAttr ? styleAttr.trim() : '';
}

// Can use the npm package style-to-object.
// https://www.npmjs.com/package/style-to-object
function styleToObject(styleText: string): Record<string, string> {
const styleObject = {};
const declarations = styleText.split(';');

const camelCase = (str: string) => {
return str.replaceAll(/-([a-z])/g, (match: string, letter: string) => {
return letter.toUpperCase();
});
};

declarations.forEach((declaration) => {
declaration = declaration.trim();
if (declaration) {
const [property, value] = declaration.split(':');
const propertyName = camelCase(property.trim());
// @ts-ignore
styleObject[propertyName] = value.trim();
}
});

return styleObject;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we can't just use node.parentNode.style, which returns a fully typed CSSProperties already?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style attribute only produces results when the DOM is actually inserted into the window.document. Before being inserted into the page, it will always be an empty string.

Copy link
Collaborator

@12joan 12joan Oct 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. That explains some weird results I saw while testing in the console. 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I've just noticed is we're already using .style in other parts of the deserialization pipeline. For example, in createParagraphPlugin, we check if the paragraph element has a particular font-family using el.style.fontFamily. The element object comes from (in reverse order): pluginDeserializeHtml, pipeDeserializeHtmlElement, htmlElementToElement and deserializeHtmlNode, the last of which is the function that also calls htmlTextNodeToString. Any idea why style is working in createParagraphPlugin but not here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this further, and here's what I've found:

  • el.style always returns a CSSProperties object matching the literal style attribute on a HTML element
    • This works regardless of whether the element has been inserted into the DOM
    • The default value for missing styles is the empty string
    • The style property for a <pre> tag does not contain whiteSpace unless this is set explicitly
  • window.getComputedStyle(el) also returns a CSSProperties object
    • This only works if the element has been inserted into the DOM
    • The computed whiteSpace property for a <pre> element is 'pre'

Inserting the element into the DOM prior to checking its style isn't a viable solution (since this would evaluate any script tags). However, it looks like the normal style property works the same way as your styleToObject function, while also being more strongly typed.

Can you check to see if replacing styleToObject with just node.style works here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my mistake, I realize it is possible.


if (parentNode.nodeType === Node.ELEMENT_NODE && styles['whiteSpace']) {
// If parentNode.nodeType === 'PRE', do not perform any processing here.
return styles['whiteSpace'];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever solution! Do we need to handle the white-space: inherit case here?

Is there any special processing we need for actual pre elements, or is this handled elsewhere?

Copy link
Author

@jkcs jkcs Oct 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the <pre> tag, if it needs to be handled, it should be addressed elsewhere. My suggestion is not to process the content within the <pre> tag. The expected behavior of the <pre> tag is already satisfactory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle it when it is set to "initial"?

Copy link
Collaborator

@12joan 12joan Oct 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing whitespace within the <pre> tag works correctly when Plate's code block plugin is enabled, but not otherwise. You can test this by pressing the "Customize" button on the playground and unchecking "Code block".

If users are pasting HTML content containing a <pre> tag into an editor that doesn't support code block, I think they would still expect the line breaks to be preserved.

Please could you add support for the <pre> tag to the core deserializeHtml? If you need to remove any redundant logic in deserializeHtmlCodeBlockPre.ts to prevent conflicts, that should be fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle it when it is set to "initial"?

Yes, nice catch. 👍

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize for missing the implementation of the <pre> tag. It does not conflict with a code block.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. You were right to question whether it was (or should be) within the spec. 🙂

@jkcs
Copy link
Author

jkcs commented Oct 29, 2023

@12joan I have made adjustments to the code and added some tests. If I have more time, I will add more test cases. Regarding the behavior of the <pre> tag, I believe it should not be handled in the html-deserializer plugin.

Copy link
Collaborator

@12joan 12joan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making those changes. I've resolved most of the previous batch of review comments, except for the one where we discuss the <pre> tag.

Ideally, the core deserializeHtml function should be able to parse whitespace in HTML self-sufficiently (without relying on other plugins); since <pre> tags get special treatment in the HTML spec, they need special treatment here too.

Copy link
Collaborator

@12joan 12joan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again! The behaviour looks correct to me now.

Just a couple of housekeeping things before we can consider this PR mergeable:

  • Make sure all the CI tests are passing correctly
    • yarn build
    • yarn lint:fix
    • yarn test
  • Don't forget to add a changeset describing what's changed
    • This feels to me like enough of a breaking change that it should be a major release, considering that we've changed how pasted HTML is parsed and also made breaking changes to several function interfaces. (@zbeyens Any comment?)

@jkcs
Copy link
Author

jkcs commented Oct 30, 2023

Thanks again! The behaviour looks correct to me now.

Just a couple of housekeeping things before we can consider this PR mergeable:

  • Make sure all the CI tests are passing correctly

    • yarn build
    • yarn lint:fix
    • yarn test
  • Don't forget to add a changeset describing what's changed

    • This feels to me like enough of a breaking change that it should be a major release, considering that we've changed how pasted HTML is parsed and also made breaking changes to several function interfaces. (@zbeyens Any comment?)

Please wait for other contributors' pull requests. I'm really curious to know if there are any other solutions.

@12joan
Copy link
Collaborator

12joan commented Nov 1, 2023

We've got a failure in the DOCX serializer, and I believe the failure is warranted (i.e. the code needs to be fixed, not the test).

https://github.com/udecode/plate/actions/runs/6687641341/job/18271730409?pr=2718#step:9:582

It looks like we're trimming whitespace from text nodes irrespective of their position in the enclosing element. I can reproduce the bug with the input HTML string <p><strong>Hello </strong>world</p>, in which the space is incorrectly removed.

My apologies; I had overlooked this case when compiling the issue spec. I'll tip an additional $25 if you can resolve this issue as well.

For a given list of DOM children, we want to:

  • Trim all spaces from the start of the text node only if the text node is the first child
  • Trim all spaces from the end of the text node only if the text node is the last child

Conversion of newlines into spaces and space collapsing should happen regardless of the text node's position, as it is now. Compare with standard browser behaviour if anything's unclear.

@jkcs jkcs closed this Nov 2, 2023
@jkcs jkcs reopened this Nov 2, 2023
@jkcs
Copy link
Author

jkcs commented Nov 2, 2023

okay, I will fix it and add more scenarios for testing.

@jkcs
Copy link
Author

jkcs commented Nov 2, 2023

This problem cannot be simply solved by patching. I have come up with many different scenarios. I need to reconsider the implementation approach, and the next time I submit, it will be a completely new implementation. Currently, I am considering adding a separate mergeWhitespace method within deserializeHtml instead of placing it inside htmlTextNodeToString. mergeWhitespace will traverse the entire node tree to achieve the desired effect.

If you have any suggestions, please feel free to let me know.

@12joan
Copy link
Collaborator

12joan commented Nov 2, 2023

I haven't thought about it in much detail, so I might be missing something, but couldn't we just pass a couple of options { isFirst, isLast } down the callstack to our htmlTextNodeToString? Setting the options in deserializeHtmlNodeChildren would be a good start, although there might be other places we need to consider.

Failing that, a dedicated mergeWhitespace function is a good idea.

@jkcs
Copy link
Author

jkcs commented Nov 3, 2023

I have discovered something interesting.

The browser behavior for <p><strong>Hello </strong>world</p> is consistent with <p><strong>Hello</strong> world</p>.
This causes some issues with the formatting of my new approach for <p><strong> Hello </strong> world </p>.

My current solution involves using a binary tree where a node on the left side determines whether to include preceding spaces based on its previous element, and a node on the right side determines whether to remove trailing spaces based on its next element.
This idea would format the document in this scenario as <p><strong>Hello </strong> world</p>.

Let me think of other possible solutions.

@12joan
Copy link
Collaborator

12joan commented Nov 3, 2023

The rules for whitespace parsing are much more complicated than I originally thought.

In the following HTML string, the space before "world" is included.

<div><span>Hello</span><span> world</span></div>
hello (space) world

But here, with the added <div></div>, the space before "world" is collapsed.

<div><span>Hello</span><div></div><span> world</span></div>
hello (no space) world, on separate lines due to the div

I think this is because the added <div></div> interrupts the inline formatting context, causing the inline nodes before and after the interruption to be treated as separate inline formatting contexts.

Additionally, I've just discovered that up to one count of trailing whitespace is allowed for all text nodes. I'm not sure how I've never noticed this before.

<!-- Exactly one space is kept -->
<div>Hello      </div>

<!-- Line break is converted to a space -->
<div>Hello
</div>

<!-- All leading whitespace is removed -->
<div>    Hello</div>

<!-- Space between block elements is removed -->
<div>Hello</div> <div>world</div>

<!-- Whitespace between inline elements is collapsed to a single space -->
<div style="display: inline">Hello</div>   <div style="display: inline">world</div>

@12joan
Copy link
Collaborator

12joan commented Nov 3, 2023

Another tricky case: Whitespace in one inline element can affect whitespace in a subsequent inline element.

<!-- A single space is rendered inside the strong -->
<strong>Hello </strong><em> world</em>

<!-- The space is rendered inside the em instead -->
<strong>Hello</strong><em> world</em>

@12joan
Copy link
Collaborator

12joan commented Nov 3, 2023

If, at any point, you want to stop working on this and accept the bounty, just let me know and I can take over from here. You've done all that was asked of you in the original issue and more. 😅

@jkcs
Copy link
Author

jkcs commented Nov 3, 2023

I have spent a lot of time on this today, and it is indeed much more complex than I initially imagined. I have been reading the source code of various other editors today and found that their solutions are also quite complex, and none of them address the issue with the white-space style.

Nevertheless, I have come up with a possible solution that I would like to try and complete the code for, although it may not cover all scenarios. Dealing with the removal and preservation of spaces is much more complicated than anticipated. However, I believe that completing about 50% of the code may solve 90% of the problem.

Thank you very much, and I suggest keeping the bounty. I will finish the code for this solution in the next few days. It may not be ready for merging, but it can serve as a valuable reference for future contributors.

@reviewpad reviewpad bot added large Pull request is large and removed medium Pull request is medium labels Nov 4, 2023
@jkcs jkcs closed this Nov 4, 2023
@jkcs
Copy link
Author

jkcs commented Nov 4, 2023

@12joan I have closed this PR. I apologize, this is my final code, but there are still some issues. These are some references I used for the handling methods, you can take a look when you have time.

@jkcs jkcs changed the title fix: fix TextNode space Want to handle whitespace in pasted HTML according to HTML spec Nov 4, 2023
@12joan
Copy link
Collaborator

12joan commented Nov 4, 2023

@jkcs Thanks so much for all your work on this. I'll definitely take a look when I get the chance.

@12joan
Copy link
Collaborator

12joan commented Nov 6, 2023

@jkcs Here's what I eventually came up with to solve this: #2729.

I'm going to go ahead and award you the bounty, since you've put more time into this than anyone else (including me), and I wouldn't have been able to solve this fully without your contributions.

@jkcs
Copy link
Author

jkcs commented Nov 7, 2023

@12joan Wow, that's amazing! Thank you for the compliment and the reward. I just saw your PR, and it's even more comprehensive and elegant than what I had considered. It's definitely something worth learning from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🙋 Bounty claim large Pull request is large
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Whitespace in pasted HTML is not handled according to the HTML spec
2 participants