Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle all Unicode whitespace for UnicodeBreakProperties.find_words #575

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

phy1729
Copy link

@phy1729 phy1729 commented Dec 20, 2024

Currently UnicodeBreakProperties.find_words("foo \nbar") results in two Words where the first has word: "foo \n", whitespace: "". I think it makes more sense when following the Unicode line break algorithm to include all whitespace not just spaces as UAX 14 breaks after a newline.

I left Word::from_unicode pub(crate) to not change the public API.

@mgeisler
Copy link
Owner

Hi @phy1729,

Thanks for the PR and happy new year!

Currently UnicodeBreakProperties.find_words("foo \nbar") results in two Words where the first has word: "foo \n", whitespace: "". I think it makes more sense when following the Unicode line break algorithm to include all whitespace not just spaces as UAX 14 breaks after a newline.

I see... I think the API I made is a bit clumsy or misleading. Textwrap is generally meant to wrap a single paragraph of text at a time. This is reflected in how wrap and fill will split on newlines and wrap each line independently.

However, there is of course nothing about this in the documentation for Word itself — the only hint is that UnicodeBreakProperties talk about line here and there.

In general, I was hoping that Word would be a trivial little struct that others could replace as needed. So you should not feel constrained by how it works. An example of this is in the Wasm demo:

pub struct CanvasWord<'a> {
word: &'a str,
width: f64,
whitespace: &'a str,
whitespace_width: f64,
penalty: &'a str,
penalty_width: f64,
}

However, I realize that this idea isn't how people want to use the current API and I would like to improve on this.

@mgeisler
Copy link
Owner

Looking at this some more, I think the output between the AsciiSpace and UnicodeBreakProperties enum variants ought to be aligned for ASCII input. This is not at all the case today, where this little program

use textwrap::core::Word;
use textwrap::WordSeparator::{AsciiSpace, UnicodeBreakProperties};

fn main() {
    let text = "foo   \nbar";
    dbg!(UnicodeBreakProperties.find_words(text).collect::<Vec<_>>());
    dbg!(AsciiSpace.find_words(text).collect::<Vec<_>>());

gives me

[examples/tmp.rs:7:5] UnicodeBreakProperties.find_words(text).collect::<Vec<_>>() = [
    Word {
        word: "foo   \n",
        whitespace: "",
        penalty: "",
        width: 6,
    },
    Word {
        word: "bar",
        whitespace: "",
        penalty: "",
        width: 3,
    },
]
[examples/tmp.rs:8:5] AsciiSpace.find_words(text).collect::<Vec<_>>() = [
    Word {
        word: "foo",
        whitespace: "   ",
        penalty: "",
        width: 3,
    },
    Word {
        word: "\nbar",
        whitespace: "",
        penalty: "",
        width: 3,
    },
]

@mgeisler
Copy link
Owner

Can I ask what your use-case is for constructing words yourself? Also, have you tried the Custom variant?

I recently realized that the constructor for Word is very limiting since it doesn't allow you to set the word, penalty and whitespace yourself — and you cannot touch the cached length. This ought to be reworked, I'm afraid.

@phy1729
Copy link
Author

phy1729 commented Jan 14, 2025

I'm using the result of find_words to find the widest word in each column of a table to set a minimum width. https://github.com/phy1729/zxcv/blob/8764f7ad0ce3b19e16fa922cb2f4163b454ef6a1/src/html/table.rs#L62-L66 . So I'm not constructing words myself, I was just surprised that there was trailing whitespace in the word member which caused some incorrect minimums.

I don't think the words for AsciiSpace and UnicodeBreakProperties can be aligned even for just ASCII inputs without an API break as UAX 14 mandates the break happen after the newline and AsciiSpace as its name implies only considers spaces. If it were AsciiWhitespace and char::is_ascii_whitespace used, I think the words would be the same, but I haven't done a careful reading of the standard to be sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants