Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitespace in pasted HTML is not handled according to the HTML spec #2713

Closed
12joan opened this issue Oct 24, 2023 · 23 comments · Fixed by #2729
Closed

Whitespace in pasted HTML is not handled according to the HTML spec #2713

12joan opened this issue Oct 24, 2023 · 23 comments · Fixed by #2729
Labels
💎 Bounty bug Something isn't working plugin:deserialize-html Html deserializer 💰 Rewarded

Comments

@12joan
Copy link
Collaborator

12joan commented Oct 24, 2023

Note to anyone attempting the bounty: Please read the Expected Behaviour section carefully and write tests to check that all cases are implemented correctly. Discuss it with us via this issue or Discord if anything is unclear.

Deadline: Thursday 2 November 2023

Clarifications:

Description

When pasting HTML into Plate via the deserialize HTML plugin, the pasted HTML is parsed inside getFragment using parseHtmlDocument before being passed as a HTML element to deserializeHtml. As a result of the logic inside deserializeHtml, the stripWhitespace option is ignored when the given argument is a HTML element instead of a string, hence whitespace is not stripped from pasted HTML.

Note that this is the intended behaviour in some circumstances, such as <pre> tags, but not all. See Expected Behaviour.

Context

When copying HTML from Firefox, Firefox inserts additional line feed characters (\n) at regular intervals. For example, the same paragraph with no newlines becomes the following when opened in and copied from Firefox:

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do 
eiusmod tempor incididunt ut labore et dolore magna aliqua. Risus 
pretium quam vulputate dignissim suspendisse in est. Commodo elit at 
imperdiet dui accumsan sit amet nulla facilisi. Bibendum at varius vel 
pharetra vel turpis. Urna nec tincidunt praesent semper. Libero volutpat
 sed cras ornare arcu. Phasellus vestibulum lorem sed risus ultricies 
tristique. Tempus iaculis urna id volutpat lacus laoreet non. Lacus 
vestibulum sed arcu non odio euismod lacinia at quis. Quis lectus nulla 
at volutpat. Auctor urna nunc id cursus metus aliquam. Diam volutpat 
commodo sed egestas egestas fringilla phasellus faucibus scelerisque. 
Odio morbi quis commodo odio aenean sed adipiscing diam. Enim tortor at 
auctor urna. Pulvinar sapien et ligula ullamcorper malesuada proin 
libero nunc consequat. Duis convallis convallis tellus id interdum 
velit. Amet dictum sit amet justo donec enim diam vulputate. Etiam erat 
velit scelerisque in dictum.</p>

This bug results in those same line feed characters appearing inside text nodes when pasting into Plate. While this browser quirk is specific to Firefox, the HTML pasting bug in Plate applies to all browsers.

Steps to Reproduce

  1. Place the above HTML in test.html
  2. Open the HTML file in Firefox
  3. Select the paragraph and copy
  4. Paste the HTML into Plate
Screenshot of HTML pasted into Plate

Expected Behavior

HTML should be parsed as per the HTML spec:

  1. By default, newlines should be ignored
  2. By default, sequences of two or more spaces should be collapsed into a single space

Additionally, this behaviour should be modifiable using the white-space CSS property, regardless of whether this property is included explicitly using a style prop or implicitly though default browser styles. (The <pre> element applies an implicit white-space: pre style.)

See MDN's docs for a complete description of each possible value

image

Finally, as per the HTML spec, 4.4.3 The pre element:"In the HTML syntax, a leading newline character immediately following the pre element start tag is stripped."

Environment

  • slate: N/A
  • slate-react: N/A
  • browser: all browsers

Bounty

Click here to add a bounty via Algora.

Funding

  • You can sponsor this specific effort via a Polar.sh pledge below
  • We receive the pledge once the issue is completed & verified
Fund with Polar
@12joan 12joan added bug Something isn't working plugin:deserialize-html Html deserializer labels Oct 24, 2023
@12joan
Copy link
Collaborator Author

12joan commented Oct 24, 2023

/bounty 150

@algora-pbc
Copy link

algora-pbc bot commented Oct 24, 2023

💎 $150 bounty created by 12joan
🙋 If you start working on this, comment /attempt #2713 to notify everyone
👉 To claim this bounty, submit a pull request that includes the text /claim #2713 somewhere in its body
📝 Before proceeding, please make sure you can receive payouts in your country
💵 Payment arrives in your account 2-5 days after the bounty is rewarded
💯 You keep 100% of the bounty award
🙏 Thank you for contributing to udecode/plate!














































Attempt Started (GMT+0) Solution
🔴 @mrkirthi-24 Oct 24, 2023, 11:56:43 AM WIP
🔴 @AayushMohan Oct 24, 2023, 9:36:38 PM WIP
🟢 @Dhoni77 Oct 25, 2023, 3:54:50 AM WIP
🟢 @RANJEETJ06 Oct 25, 2023, 10:13:17 AM WIP
🔴 @rishi-raj-jain Oct 25, 2023, 8:54:41 PM WIP
🟢 @jkcs Oct 27, 2023, 2:20:25 PM #2718
🔴 @XooTB Oct 29, 2023, 6:41:37 AM WIP

@mrkirthi-24
Copy link

mrkirthi-24 commented Oct 24, 2023

/attempt #2713

Options

1 similar comment
@Dhoni77
Copy link

Dhoni77 commented Oct 25, 2023

/attempt #2713

Options

@algora-pbc
Copy link

algora-pbc bot commented Oct 25, 2023

Note: The user @AayushMohan is already attempting to complete issue #2713 and claim the bounty. If you attempt to complete the same issue, there is a chance that @AayushMohan will complete the issue first, and be awarded the bounty. We recommend discussing with @AayushMohan and potentially collaborating on the same solution versus creating an alternate solution.

@12joan
Copy link
Collaborator Author

12joan commented Oct 25, 2023

To anyone wondering where in the code to implement the solution:

Proper HTML parsing as per the spec should be part of the deserializeHtml function or the chain of functions it depends on. One solution would be to add a pre-processor function that transforms a HTML tree according to the whitespace rules before passing the result to deserializeHtmlElement. This pre-processor could be disabled if the existing stripWhitespace option is set to false.

@12joan
Copy link
Collaborator Author

12joan commented Oct 25, 2023

I encourage anyone working on this to join us on Discord and share your thoughts in the #contributing channel. Better to collaborate on one good solution than everyone racing to complete one hastily-built solution, right? 🙂

@12joan
Copy link
Collaborator Author

12joan commented Oct 25, 2023

I found some more detailed inforamtion about how HTML parsers should handle whitespace, in case it's helpful to anyone:

https://www.w3.org/TR/CSS21/text.html#white-space-model

@RANJEETJ06
Copy link

RANJEETJ06 commented Oct 25, 2023

/attempt #2713

Options

@12joan
Copy link
Collaborator Author

12joan commented Oct 25, 2023

Discussion on Discord is here: https://discord.com/channels/1026227597115396188/1166709783865331723

@rishi-raj-jain
Copy link
Contributor

@12joan

How to run tests of the file: deserializeHtml.spec.tsx?

Screenshot 2023-10-26 at 2 24 06 AM

@12joan
Copy link
Collaborator Author

12joan commented Oct 25, 2023

@rishi-raj-jain You can run those tests like this:

yarn jest packages/core/src/plugins/html-deserializer/utils/deserializeHtml.spec.tsx

@12joan
Copy link
Collaborator Author

12joan commented Oct 25, 2023

Discussion on potential approaches: https://github.com/udecode/plate/pull/2715/files#r1372355759

@12joan
Copy link
Collaborator Author

12joan commented Oct 25, 2023

For testing purposes, here's an example of an input and its equivalent collapsed HTML. Both of these HTML strings should result in identical Slate values.

Original:

<p>
  Hello  world
  </p>

  <p>
  one     two   
  three
</p>

<pre>
hello     one two
three
four
</pre>

<div style="white-space: pre">
hello     one two
three
four
</div>

<div style="white-space: pre-line">
hello     one two
three
four
</div>

Collapsed:

<p>Hello world</p><p>one two three</p><pre>hello     one two
three
four
</pre><div style="white-space: pre">
hello     one two
three
four
</div><div style="white-space: pre-line">
hello one two
three
four
</div>

@12joan
Copy link
Collaborator Author

12joan commented Oct 26, 2023

Just to let everyone know, it looks like there are a lot of people working on this. I'm going to leave the bounty open for a few days after the first mergable PR is opened, and I'm going to award the bounty to whichever PR presents the highest quality solution (completeness, testing, general code quality), regardless of who submits first.

If you want to avoid wasting your time, I suggest collaborating on a single solution like the Algora comment recommends.

@jkcs
Copy link

jkcs commented Oct 27, 2023

/attempt #2713

Options

@12joan
Copy link
Collaborator Author

12joan commented Oct 28, 2023

As I said here, I'll leave this bounty open until Thursday to give anyone else a chance to submit a PR.

Please do not copy from other contributors' PRs without their permission. This will be treated as a copyright violation.

@XooTB
Copy link

XooTB commented Oct 29, 2023

/attempt #2713

Options

@12joan
Copy link
Collaborator Author

12joan commented Oct 29, 2023

@jkcs and I have discovered some more details about how browsers handle leading and trailing whitespace in <pre> tags and white-space: pre / pre-line.

#2718 (comment)

@12joan
Copy link
Collaborator Author

12joan commented Nov 2, 2023

Thanks everyone for your time. We'll no longer be accepting new PRs for this issue. The bounty will be awarded to @jkcs once the last few issues are resolved.

Copy link

algora-pbc bot commented Nov 4, 2023

💡 @jkcs submitted a pull request that claims the bounty. You can visit your org dashboard to reward.

Copy link

algora-pbc bot commented Nov 6, 2023

@jkcs: Your claim has been rewarded! We'll notify you once it is processed.

Copy link

algora-pbc bot commented Nov 6, 2023

🎉🎈 @jkcs has been awarded $150! 🎈🎊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💎 Bounty bug Something isn't working plugin:deserialize-html Html deserializer 💰 Rewarded
Projects
None yet
7 participants