Handle all Unicode whitespace for UnicodeBreakProperties.find_words #575

phy1729 · 2024-12-20T20:49:24Z

Currently UnicodeBreakProperties.find_words("foo \nbar") results in two Words where the first has word: "foo \n", whitespace: "". I think it makes more sense when following the Unicode line break algorithm to include all whitespace not just spaces as UAX 14 breaks after a newline.

I left Word::from_unicode pub(crate) to not change the public API.

mgeisler · 2025-01-12T18:31:36Z

Hi @phy1729,

Thanks for the PR and happy new year!

Currently UnicodeBreakProperties.find_words("foo \nbar") results in two Words where the first has word: "foo \n", whitespace: "". I think it makes more sense when following the Unicode line break algorithm to include all whitespace not just spaces as UAX 14 breaks after a newline.

I see... I think the API I made is a bit clumsy or misleading. Textwrap is generally meant to wrap a single paragraph of text at a time. This is reflected in how wrap and fill will split on newlines and wrap each line independently.

However, there is of course nothing about this in the documentation for Word itself — the only hint is that UnicodeBreakProperties talk about line here and there.

In general, I was hoping that Word would be a trivial little struct that others could replace as needed. So you should not feel constrained by how it works. An example of this is in the Wasm demo:

textwrap/examples/wasm/src/lib.rs

Lines 44 to 51 in c9bd8b0

    
           pub struct CanvasWord<'a> { 
        
               word: &'a str, 
        
               width: f64, 
        
               whitespace: &'a str, 
        
               whitespace_width: f64, 
        
               penalty: &'a str, 
        
               penalty_width: f64, 
        
           }

However, I realize that this idea isn't how people want to use the current API and I would like to improve on this.

mgeisler · 2025-01-12T18:36:33Z

Looking at this some more, I think the output between the AsciiSpace and UnicodeBreakProperties enum variants ought to be aligned for ASCII input. This is not at all the case today, where this little program

use textwrap::core::Word;
use textwrap::WordSeparator::{AsciiSpace, UnicodeBreakProperties};

fn main() {
    let text = "foo   \nbar";
    dbg!(UnicodeBreakProperties.find_words(text).collect::<Vec<_>>());
    dbg!(AsciiSpace.find_words(text).collect::<Vec<_>>());

gives me

[examples/tmp.rs:7:5] UnicodeBreakProperties.find_words(text).collect::<Vec<_>>() = [
    Word {
        word: "foo   \n",
        whitespace: "",
        penalty: "",
        width: 6,
    },
    Word {
        word: "bar",
        whitespace: "",
        penalty: "",
        width: 3,
    },
]
[examples/tmp.rs:8:5] AsciiSpace.find_words(text).collect::<Vec<_>>() = [
    Word {
        word: "foo",
        whitespace: "   ",
        penalty: "",
        width: 3,
    },
    Word {
        word: "\nbar",
        whitespace: "",
        penalty: "",
        width: 3,
    },
]

mgeisler · 2025-01-12T18:38:57Z

Can I ask what your use-case is for constructing words yourself? Also, have you tried the Custom variant?

I recently realized that the constructor for Word is very limiting since it doesn't allow you to set the word, penalty and whitespace yourself — and you cannot touch the cached length. This ought to be reworked, I'm afraid.

phy1729 · 2025-01-14T05:53:54Z

I'm using the result of find_words to find the widest word in each column of a table to set a minimum width. https://github.com/phy1729/zxcv/blob/8764f7ad0ce3b19e16fa922cb2f4163b454ef6a1/src/html/table.rs#L62-L66 . So I'm not constructing words myself, I was just surprised that there was trailing whitespace in the word member which caused some incorrect minimums.

I don't think the words for AsciiSpace and UnicodeBreakProperties can be aligned even for just ASCII inputs without an API break as UAX 14 mandates the break happen after the newline and AsciiSpace as its name implies only considers spaces. If it were AsciiWhitespace and char::is_ascii_whitespace used, I think the words would be the same, but I haven't done a careful reading of the standard to be sure.

Handle all Unicode whitespace for UnicodeBreakProperties.find_words

6ccf7d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle all Unicode whitespace for UnicodeBreakProperties.find_words #575

Handle all Unicode whitespace for UnicodeBreakProperties.find_words #575

phy1729 commented Dec 20, 2024

mgeisler commented Jan 12, 2025

mgeisler commented Jan 12, 2025

mgeisler commented Jan 12, 2025

phy1729 commented Jan 14, 2025

Handle all Unicode whitespace for UnicodeBreakProperties.find_words #575

Are you sure you want to change the base?

Handle all Unicode whitespace for UnicodeBreakProperties.find_words #575

Conversation

phy1729 commented Dec 20, 2024

mgeisler commented Jan 12, 2025

mgeisler commented Jan 12, 2025

mgeisler commented Jan 12, 2025

phy1729 commented Jan 14, 2025