ACP: add `str::chunks`, `str::chunks_exact`, and `str::windows` #590

folkertdev · 2025-05-19T12:33:07Z

Proposal

Problem statement

The slice::chunks, slice::chunks_exact, and slice::windows functions do no exist on str. This is inconsistent with the ability to index a &str by a range. If that is allowed, then intuitively these iterators should also be.

Motivating examples or use cases

This recently came up in the rustweek python FFI workshop exercises:

fn count_ngrams(sequence: &str, k: usize) -> HashMap<&str, usize> {
    let mut ngrams = HashMap::default();

    for window in sequence.as_bytes().windows(k) {
        let key = std::str::from_utf8(window).unwrap();
        *ngrams.entry(key).or_insert(0) += 1;
    }

    ngrams
}

Solution sketch

The str type (and core::str module) should provide the windows, chunks and chunks_exact functions, so that we can instead write:

fn count_ngrams(sequence: &str, k: usize) -> HashMap<&str, usize> {
    let mut ngrams = HashMap::default();

    for key in sequence.windows(k) {
        *ngrams.entry(key).or_insert(0) += 1;
    }

    ngrams
}

The iterator will panic if it tries to yield a subslice that is not valid utf-8, roughly:

struct Windows<'a>(core::slice::Windows<'a, u8>);

impl<'a> Iterator for Windows<'a> {
    type Item = &'a str;

    #[track_caller]
    fn next(&mut self) -> Option<Self::Item> {
        match core::str::from_utf8(self.0.next()?) {
            Ok(subslice) => Some(subslice),
            Err(e) => panic!("{:?}", e),
        }
    }
}

So that

#[test]
fn foo() {
    let mut it = Windows("x🦀".as_bytes().windows(1));

    dbg!(it.next());
    dbg!(it.next());
}

Prints something like

[src/main.rs:19:5] it.next() = Some(
    "x",
)

thread 'foo' panicked at src/main.rs:20:13:
Utf8Error { valid_up_to: 0, error_len: None }

However, we can probably do a better job for the error, similar to &"x🦀"[1..2] showing:

thread 'foo' panicked at src/main.rs:20:16:
byte index 2 is not a char boundary; it is inside '🦀'  (bytes 1..5) of `x🦀`

Alternatives

Just have users do this manually.

Links and related work

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

We think this problem seems worth solving, and the standard library might be the right place to solve it.
We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.

The text was updated successfully, but these errors were encountered:

shepmaster · 2025-05-19T12:43:29Z

This feels like it should be covered by the ascii::Char feature:

#![feature(ascii_char)]

fn main() {
    let text = "hello world";
    let ascii_text = text.as_ascii().unwrap();
    for w in ascii_text.windows(3) {
        let w = w.as_str();
        println!("{w}");
    }
}

Notably, this pushes the panic up to a specific point that explains why it could fail (the conversion to ASCII).

shepmaster · 2025-05-19T12:47:51Z

Another alternative is to combine unicode-segmentation and itertools. This then works with non-ASCII text:

use itertools::Itertools; // 0.14.0
use unicode_segmentation::UnicodeSegmentation; // 1.12.0

fn main() {
    let text = "x🦀abcdef";
    
    for (w0, w1, w2) in text.graphemes(true).tuple_windows() {
        println!("{w0}{w1}{w2}");
    }
}

hanna-kruppe · 2025-05-19T17:22:03Z

This is inconsistent with the ability to index a &str by a range. If that is allowed, then intuitively these iterators should also be.

The difference is that slicing by byte positions can be used correctly with non-ASCII strings. Usually by only slicing at byte positions that you got from some source that ensures those positions are char boundaries in the string. Chunks or windows defined in terms of number of bytes are virtually impossible to use on non-ASCII strings without panicking (and the usages that don’t panic are useless).

However, it would be possible to define windows and chunks in terms of number of chars while still yielding sub-&strs. I’m not sure if how strong the motivation there is for this, and it would lose some nice properties of the slice equivalents, but at least it wouldn’t be restricted to ASCII (where [ascii::Char] suffices, as @shepmaster pointed out).

traviscross · 2025-05-20T17:32:13Z

This recently came up in the rustweek python FFI workshop exercises...

@folkertdev, did this come up in the context of trying to match some comparable Python code? If so, what did that code look like?

folkertdev · 2025-05-20T18:11:13Z

it's here https://github.com/tweedegolf/rust-training/blob/4bd69eea6378eec2e73748ddbad0e82164a8694a/exercises/7-rust-for-data-science/1-rust-from-python/main.py#L18-L40

    kmers = defaultdict(int)
    for i in range(len(sequence) - k + 1):
        kmer = sequence[i: i + k]
        assert len(kmer) == k
        kmers[kmer] += 1

    return dict(kmers)

So it does just use a bunch of slicing into the string. Note the len(sequence) - k + 1, that manual indexing is something I'd like to avoid.

Btw, rejecting this is totally a valid outcome. It's just something we ran into having to explain (that rust has these nice iterator functions, but we can't quite use them here). This is a tradeoff between convenience and robustness.

hanna-kruppe · 2025-05-20T18:19:27Z

Since that's Python 3, assuming the sequence: str type hint is correct, that code looks at windows of k code points (Rust chars), not of k bytes as this ACP suggested.

Amanieu · 2025-05-20T18:42:36Z

Given the issues related to UTF-8 boundaries causing potential foot-guns and the fact that we're at least not worse than the python version, we have decided to reject this ACP.

programmerjake · 2025-05-20T21:23:46Z

it seems like a better solution would to add a generic iterator adaptor, allowing you to do the_str.chars().array_windows():

pub trait Iterator {
    ...
    fn array_windows<const N: usize>(self) -> ArrayWindows<Self, N>
    where
        Self: Sized,
        Self::Item: Clone,
    {
        ArrayWindows { iter: self, buf: MaybeUninit::uninit(), len: 0 }
    }
}

pub struct ArrayWindows<I: Iterator, const N: usize> {
    iter: I,
    buf: MaybeUninit<[I::Item; N]>,
    len: usize,
}

// impl Drop, Clone, etc.

impl<I: Iterator<Item: Clone>, const N: usize> Iterator for ArrayWindows<I, N> {
    type Item = [I::Item; N];
    fn next(&mut self) -> Option<Self::Item> {
        loop {
            let v = self.iter.next()?;
            if self.len < N {
                let buf: &mut [MaybeUninit<I::Item>; N] = unsafe { &mut *(&raw mut buf).cast::<[MaybeUninit<I::Item>; N]>() };
                buf[self.len].write(v);
                self.len += 1;
            } else {
                let buf = unsafe { self.buf.assume_init_mut() };
                buf[0] = v;
                buf.rotate_left(1);
                return Some(buf.clone());
            }
        }
    }
    ...
}

folkertdev added T-libs-api api-change-proposal A proposal to add or alter unstable APIs in the standard libraries labels May 19, 2025

Amanieu closed this as not planned Won't fix, can't repro, duplicate, stale May 20, 2025

ptrca mentioned this issue May 24, 2025

RFC: New method for str: str::chunks(usize) rust-lang/rfcs#3818

Closed

tkr-sh mentioned this issue May 25, 2025

ACP: str::chunks with chunks being &str #592

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ACP: add `str::chunks`, `str::chunks_exact`, and `str::windows` #590

ACP: add `str::chunks`, `str::chunks_exact`, and `str::windows` #590

folkertdev commented May 19, 2025

shepmaster commented May 19, 2025

Uh oh!

shepmaster commented May 19, 2025

Uh oh!

hanna-kruppe commented May 19, 2025 •

edited

Loading

Uh oh!

traviscross commented May 20, 2025 •

edited

Loading

Uh oh!

folkertdev commented May 20, 2025

Uh oh!

hanna-kruppe commented May 20, 2025

Uh oh!

Amanieu commented May 20, 2025

Uh oh!

programmerjake commented May 20, 2025 •

edited

Loading

Uh oh!

ACP: add str::chunks, str::chunks_exact, and str::windows #590

ACP: add str::chunks, str::chunks_exact, and str::windows #590

Comments

folkertdev commented May 19, 2025

Proposal

Problem statement

Motivating examples or use cases

Solution sketch

Alternatives

Links and related work

What happens now?

Possible responses

shepmaster commented May 19, 2025

Uh oh!

shepmaster commented May 19, 2025

Uh oh!

hanna-kruppe commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

traviscross commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

folkertdev commented May 20, 2025

Uh oh!

hanna-kruppe commented May 20, 2025

Uh oh!

Amanieu commented May 20, 2025

Uh oh!

programmerjake commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ACP: add `str::chunks`, `str::chunks_exact`, and `str::windows` #590

ACP: add `str::chunks`, `str::chunks_exact`, and `str::windows` #590

hanna-kruppe commented May 19, 2025 •

edited

Loading

traviscross commented May 20, 2025 •

edited

Loading

programmerjake commented May 20, 2025 •

edited

Loading