-
-
Notifications
You must be signed in to change notification settings - Fork 755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Want to handle whitespace in pasted HTML according to HTML spec #2718
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Thank you @jkcs for this first contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Like I said in the issue, I'll leave the bounty open until Thursday and award the bounty to the most complete PR. This PR is already looking pretty good, aside from the few comments I've left.
packages/core/src/plugins/html-deserializer/utils/htmlTextNodeToString.spec.ts
Outdated
Show resolved
Hide resolved
function getStyleFromNode(node: HTMLElement | ChildNode): string { | ||
if (!(node as HTMLElement).getAttribute) return ''; | ||
|
||
const styleAttr = (node as HTMLElement).getAttribute('style'); | ||
|
||
return styleAttr ? styleAttr.trim() : ''; | ||
} | ||
|
||
// Can use the npm package style-to-object. | ||
// https://www.npmjs.com/package/style-to-object | ||
function styleToObject(styleText: string): Record<string, string> { | ||
const styleObject = {}; | ||
const declarations = styleText.split(';'); | ||
|
||
const camelCase = (str: string) => { | ||
return str.replaceAll(/-([a-z])/g, (match: string, letter: string) => { | ||
return letter.toUpperCase(); | ||
}); | ||
}; | ||
|
||
declarations.forEach((declaration) => { | ||
declaration = declaration.trim(); | ||
if (declaration) { | ||
const [property, value] = declaration.split(':'); | ||
const propertyName = camelCase(property.trim()); | ||
// @ts-ignore | ||
styleObject[propertyName] = value.trim(); | ||
} | ||
}); | ||
|
||
return styleObject; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we can't just use node.parentNode.style, which returns a fully typed CSSProperties already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The style
attribute only produces results when the DOM is actually inserted into the window.document
. Before being inserted into the page, it will always be an empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. That explains some weird results I saw while testing in the console. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I've just noticed is we're already using .style
in other parts of the deserialization pipeline. For example, in createParagraphPlugin
, we check if the paragraph element has a particular font-family using el.style.fontFamily
. The element object comes from (in reverse order): pluginDeserializeHtml
, pipeDeserializeHtmlElement
, htmlElementToElement
and deserializeHtmlNode
, the last of which is the function that also calls htmlTextNodeToString
. Any idea why style
is working in createParagraphPlugin
but not here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this further, and here's what I've found:
el.style
always returns a CSSProperties object matching the literalstyle
attribute on a HTML element- This works regardless of whether the element has been inserted into the DOM
- The default value for missing styles is the empty string
- The
style
property for a<pre>
tag does not containwhiteSpace
unless this is set explicitly
window.getComputedStyle(el)
also returns a CSSProperties object- This only works if the element has been inserted into the DOM
- The computed
whiteSpace
property for a<pre>
element is'pre'
Inserting the element into the DOM prior to checking its style isn't a viable solution (since this would evaluate any script tags). However, it looks like the normal style
property works the same way as your styleToObject
function, while also being more strongly typed.
Can you check to see if replacing styleToObject
with just node.style
works here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my mistake, I realize it is possible.
|
||
if (parentNode.nodeType === Node.ELEMENT_NODE && styles['whiteSpace']) { | ||
// If parentNode.nodeType === 'PRE', do not perform any processing here. | ||
return styles['whiteSpace']; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever solution! Do we need to handle the white-space: inherit
case here?
Is there any special processing we need for actual pre elements, or is this handled elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the <pre>
tag, if it needs to be handled, it should be addressed elsewhere. My suggestion is not to process the content within the <pre>
tag. The expected behavior of the <pre>
tag is already satisfactory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle it when it is set to "initial"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parsing whitespace within the <pre>
tag works correctly when Plate's code block plugin is enabled, but not otherwise. You can test this by pressing the "Customize" button on the playground and unchecking "Code block".
If users are pasting HTML content containing a <pre>
tag into an editor that doesn't support code block, I think they would still expect the line breaks to be preserved.
Please could you add support for the <pre>
tag to the core deserializeHtml
? If you need to remove any redundant logic in deserializeHtmlCodeBlockPre.ts
to prevent conflicts, that should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle it when it is set to "initial"?
Yes, nice catch. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize for missing the implementation of the <pre>
tag. It does not conflict with a code block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. You were right to question whether it was (or should be) within the spec. 🙂
packages/core/src/plugins/html-deserializer/utils/htmlTextNodeToString.ts
Outdated
Show resolved
Hide resolved
packages/core/src/plugins/html-deserializer/utils/htmlTextNodeToString.spec.ts
Outdated
Show resolved
Hide resolved
packages/core/src/plugins/html-deserializer/utils/htmlTextNodeToString.spec.ts
Show resolved
Hide resolved
@12joan I have made adjustments to the code and added some tests. If I have more time, I will add more test cases. Regarding the behavior of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making those changes. I've resolved most of the previous batch of review comments, except for the one where we discuss the <pre>
tag.
Ideally, the core deserializeHtml
function should be able to parse whitespace in HTML self-sufficiently (without relying on other plugins); since <pre>
tags get special treatment in the HTML spec, they need special treatment here too.
packages/core/src/plugins/html-deserializer/utils/stripWhitespaceConfig.ts
Outdated
Show resolved
Hide resolved
packages/core/src/plugins/html-deserializer/utils/deserializeHtml.ts
Outdated
Show resolved
Hide resolved
packages/core/src/plugins/html-deserializer/utils/deserializeHtml.spec.tsx
Outdated
Show resolved
Hide resolved
packages/core/src/plugins/html-deserializer/utils/deserializeHtml.spec.tsx
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again! The behaviour looks correct to me now.
Just a couple of housekeeping things before we can consider this PR mergeable:
- Make sure all the CI tests are passing correctly
yarn build
yarn lint:fix
yarn test
- Don't forget to add a changeset describing what's changed
- This feels to me like enough of a breaking change that it should be a major release, considering that we've changed how pasted HTML is parsed and also made breaking changes to several function interfaces. (@zbeyens Any comment?)
Please wait for other contributors' pull requests. I'm really curious to know if there are any other solutions. |
We've got a failure in the DOCX serializer, and I believe the failure is warranted (i.e. the code needs to be fixed, not the test). https://github.com/udecode/plate/actions/runs/6687641341/job/18271730409?pr=2718#step:9:582 It looks like we're trimming whitespace from text nodes irrespective of their position in the enclosing element. I can reproduce the bug with the input HTML string My apologies; I had overlooked this case when compiling the issue spec. I'll tip an additional $25 if you can resolve this issue as well. For a given list of DOM children, we want to:
Conversion of newlines into spaces and space collapsing should happen regardless of the text node's position, as it is now. Compare with standard browser behaviour if anything's unclear. |
okay, I will fix it and add more scenarios for testing. |
This problem cannot be simply solved by patching. I have come up with many different scenarios. I need to reconsider the implementation approach, and the next time I submit, it will be a completely new implementation. Currently, I am considering adding a separate If you have any suggestions, please feel free to let me know. |
I haven't thought about it in much detail, so I might be missing something, but couldn't we just pass a couple of options Failing that, a dedicated |
I have discovered something interesting. The browser behavior for My current solution involves using a binary tree where a node on the left side determines whether to include preceding spaces based on its previous element, and a node on the right side determines whether to remove trailing spaces based on its next element. Let me think of other possible solutions. |
The rules for whitespace parsing are much more complicated than I originally thought. In the following HTML string, the space before "world" is included. <div><span>Hello</span><span> world</span></div> But here, with the added <div><span>Hello</span><div></div><span> world</span></div> I think this is because the added Additionally, I've just discovered that up to one count of trailing whitespace is allowed for all text nodes. I'm not sure how I've never noticed this before. <!-- Exactly one space is kept -->
<div>Hello </div>
<!-- Line break is converted to a space -->
<div>Hello
</div>
<!-- All leading whitespace is removed -->
<div> Hello</div>
<!-- Space between block elements is removed -->
<div>Hello</div> <div>world</div>
<!-- Whitespace between inline elements is collapsed to a single space -->
<div style="display: inline">Hello</div> <div style="display: inline">world</div> |
Another tricky case: Whitespace in one inline element can affect whitespace in a subsequent inline element. <!-- A single space is rendered inside the strong -->
<strong>Hello </strong><em> world</em>
<!-- The space is rendered inside the em instead -->
<strong>Hello</strong><em> world</em> |
If, at any point, you want to stop working on this and accept the bounty, just let me know and I can take over from here. You've done all that was asked of you in the original issue and more. 😅 |
I have spent a lot of time on this today, and it is indeed much more complex than I initially imagined. I have been reading the source code of various other editors today and found that their solutions are also quite complex, and none of them address the issue with the Nevertheless, I have come up with a possible solution that I would like to try and complete the code for, although it may not cover all scenarios. Dealing with the removal and preservation of spaces is much more complicated than anticipated. However, I believe that completing about 50% of the code may solve 90% of the problem. Thank you very much, and I suggest keeping the bounty. I will finish the code for this solution in the next few days. It may not be ready for merging, but it can serve as a valuable reference for future contributors. |
@12joan I have closed this PR. I apologize, this is my final code, but there are still some issues. These are some references I used for the handling methods, you can take a look when you have time. |
@jkcs Thanks so much for all your work on this. I'll definitely take a look when I get the chance. |
@12joan Wow, that's amazing! Thank you for the compliment and the reward. I just saw your PR, and it's even more comprehensive and elegant than what I had considered. It's definitely something worth learning from. |
Description
See changesets.
Close #2713
/claim #2713