Skip to content

ACP: add str::chunks, str::chunks_exact, and str::windows #590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
folkertdev opened this issue May 19, 2025 · 8 comments
Closed

ACP: add str::chunks, str::chunks_exact, and str::windows #590

folkertdev opened this issue May 19, 2025 · 8 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@folkertdev
Copy link

Proposal

Problem statement

The slice::chunks, slice::chunks_exact, and slice::windows functions do no exist on str. This is inconsistent with the ability to index a &str by a range. If that is allowed, then intuitively these iterators should also be.

Motivating examples or use cases

This recently came up in the rustweek python FFI workshop exercises:

fn count_ngrams(sequence: &str, k: usize) -> HashMap<&str, usize> {
    let mut ngrams = HashMap::default();

    for window in sequence.as_bytes().windows(k) {
        let key = std::str::from_utf8(window).unwrap();
        *ngrams.entry(key).or_insert(0) += 1;
    }

    ngrams
}

Solution sketch

The str type (and core::str module) should provide the windows, chunks and chunks_exact functions, so that we can instead write:

fn count_ngrams(sequence: &str, k: usize) -> HashMap<&str, usize> {
    let mut ngrams = HashMap::default();

    for key in sequence.windows(k) {
        *ngrams.entry(key).or_insert(0) += 1;
    }

    ngrams
}

The iterator will panic if it tries to yield a subslice that is not valid utf-8, roughly:

struct Windows<'a>(core::slice::Windows<'a, u8>);

impl<'a> Iterator for Windows<'a> {
    type Item = &'a str;

    #[track_caller]
    fn next(&mut self) -> Option<Self::Item> {
        match core::str::from_utf8(self.0.next()?) {
            Ok(subslice) => Some(subslice),
            Err(e) => panic!("{:?}", e),
        }
    }
}

So that

#[test]
fn foo() {
    let mut it = Windows("x🦀".as_bytes().windows(1));

    dbg!(it.next());
    dbg!(it.next());
}

Prints something like

[src/main.rs:19:5] it.next() = Some(
    "x",
)

thread 'foo' panicked at src/main.rs:20:13:
Utf8Error { valid_up_to: 0, error_len: None }

However, we can probably do a better job for the error, similar to &"x🦀"[1..2] showing:

thread 'foo' panicked at src/main.rs:20:16:
byte index 2 is not a char boundary; it is inside '🦀'  (bytes 1..5) of `x🦀`

Alternatives

Just have users do this manually.

Links and related work

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.
@folkertdev folkertdev added T-libs-api api-change-proposal A proposal to add or alter unstable APIs in the standard libraries labels May 19, 2025
@shepmaster
Copy link
Member

This feels like it should be covered by the ascii::Char feature:

#![feature(ascii_char)]

fn main() {
    let text = "hello world";
    let ascii_text = text.as_ascii().unwrap();
    for w in ascii_text.windows(3) {
        let w = w.as_str();
        println!("{w}");
    }
}

Notably, this pushes the panic up to a specific point that explains why it could fail (the conversion to ASCII).

@shepmaster
Copy link
Member

Another alternative is to combine unicode-segmentation and itertools. This then works with non-ASCII text:

use itertools::Itertools; // 0.14.0
use unicode_segmentation::UnicodeSegmentation; // 1.12.0

fn main() {
    let text = "x🦀abcdef";
    
    for (w0, w1, w2) in text.graphemes(true).tuple_windows() {
        println!("{w0}{w1}{w2}");
    }
}

@hanna-kruppe
Copy link

hanna-kruppe commented May 19, 2025

This is inconsistent with the ability to index a &str by a range. If that is allowed, then intuitively these iterators should also be.

The difference is that slicing by byte positions can be used correctly with non-ASCII strings. Usually by only slicing at byte positions that you got from some source that ensures those positions are char boundaries in the string. Chunks or windows defined in terms of number of bytes are virtually impossible to use on non-ASCII strings without panicking (and the usages that don’t panic are useless).

However, it would be possible to define windows and chunks in terms of number of chars while still yielding sub-&strs. I’m not sure if how strong the motivation there is for this, and it would lose some nice properties of the slice equivalents, but at least it wouldn’t be restricted to ASCII (where [ascii::Char] suffices, as @shepmaster pointed out).

@traviscross
Copy link

traviscross commented May 20, 2025

This recently came up in the rustweek python FFI workshop exercises...

@folkertdev, did this come up in the context of trying to match some comparable Python code? If so, what did that code look like?

@folkertdev
Copy link
Author

it's here https://github.com/tweedegolf/rust-training/blob/4bd69eea6378eec2e73748ddbad0e82164a8694a/exercises/7-rust-for-data-science/1-rust-from-python/main.py#L18-L40

    kmers = defaultdict(int)
    for i in range(len(sequence) - k + 1):
        kmer = sequence[i: i + k]
        assert len(kmer) == k
        kmers[kmer] += 1

    return dict(kmers)

So it does just use a bunch of slicing into the string. Note the len(sequence) - k + 1, that manual indexing is something I'd like to avoid.

Btw, rejecting this is totally a valid outcome. It's just something we ran into having to explain (that rust has these nice iterator functions, but we can't quite use them here). This is a tradeoff between convenience and robustness.

@hanna-kruppe
Copy link

Since that's Python 3, assuming the sequence: str type hint is correct, that code looks at windows of k code points (Rust chars), not of k bytes as this ACP suggested.

@Amanieu
Copy link
Member

Amanieu commented May 20, 2025

Given the issues related to UTF-8 boundaries causing potential foot-guns and the fact that we're at least not worse than the python version, we have decided to reject this ACP.

@Amanieu Amanieu closed this as not planned Won't fix, can't repro, duplicate, stale May 20, 2025
@programmerjake
Copy link
Member

programmerjake commented May 20, 2025

it seems like a better solution would to add a generic iterator adaptor, allowing you to do the_str.chars().array_windows():

pub trait Iterator {
    ...
    fn array_windows<const N: usize>(self) -> ArrayWindows<Self, N>
    where
        Self: Sized,
        Self::Item: Clone,
    {
        ArrayWindows { iter: self, buf: MaybeUninit::uninit(), len: 0 }
    }
}

pub struct ArrayWindows<I: Iterator, const N: usize> {
    iter: I,
    buf: MaybeUninit<[I::Item; N]>,
    len: usize,
}

// impl Drop, Clone, etc.

impl<I: Iterator<Item: Clone>, const N: usize> Iterator for ArrayWindows<I, N> {
    type Item = [I::Item; N];
    fn next(&mut self) -> Option<Self::Item> {
        loop {
            let v = self.iter.next()?;
            if self.len < N {
                let buf: &mut [MaybeUninit<I::Item>; N] = unsafe { &mut *(&raw mut buf).cast::<[MaybeUninit<I::Item>; N]>() };
                buf[self.len].write(v);
                self.len += 1;
            } else {
                let buf = unsafe { self.buf.assume_init_mut() };
                buf[0] = v;
                buf.rotate_left(1);
                return Some(buf.clone());
            }
        }
    }
    ...
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

6 participants