Skip to content

ACP: str::chunks with chunks being &str #592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tkr-sh opened this issue May 25, 2025 · 4 comments
Open

ACP: str::chunks with chunks being &str #592

tkr-sh opened this issue May 25, 2025 · 4 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@tkr-sh
Copy link

tkr-sh commented May 25, 2025

Proposal

Problem statement

The std currently provides various methods for chunking slices (array_chunks) and iterators (chunks, chunks_exact, rchunks, array_chunks, utf8_chunks, ...).
However, there is no equivalent method for string slices. And currently, the "developer experience" related to chunks in &str can be improved.

Motivating examples or use cases

Chunking is an action that may often be needed when working with data that can be seen as an iterator.
This is why there are methods for this with slices and iterators.
But, there are none for &str even tho it can be useful a lot of time!
Here are some examples:

  • Converting binary or hexadecimal strings into an iterator of an integer.
    Currently we would do
let hex = "0xABCDEF";
let values = hex[2..]
    .bytes()
    .array_chunks::<2>()  // unstable
    .map(|arr| u8::from_str_radix(str::from_utf8(&arr).unwrap(), 16))  // .unwrap()

// Instead of possibly doing

let values = hex[2..]
    .chunks(2)
    .map(|str| u8::from_str_radix(str, 16))
  • Processsing some padded data like hello---only----8-------chars---
  • Wrapping some text safely
let user_text = "...";
user_text.chunks(width).intersperse("\n").collect::<String>()

Overall, everything that is about handling data with repetitive pattern or with some wrapping or formatting would benefit from this function.

Another problem is that, array_chunks doesn't have the same behaviour as slice::chunk since the last element is discarded if it doesn't do the same size as chunk_size which isn't always wanted.
But, if you want to achieve the same thing in the current context, you will have create an unecessary vector:

let vec = "hello world".chars().collect::<Vec<_>>(); // Really inneficient
vec.as_slice().chunks(4) // ["hell", "o wo", "rld"]
// instead of just
"hello world".chunks(4) // ["hell", "o wo", "rld"]

It's

  1. more code
  2. less readable
  3. owning some unecessary data
  4. losing the borrowing lifetime of the initial string slice
fn example_when_owning(s: &str) -> Vec<&str> {
    let vec = "hello world".bytes().collect::<Vec<_>>();
    vec.as_slice()
        .chunks(4)
        .map(|bytes| str::from_utf8(bytes).unwrap())
        .collect() // Error! The function tries to return some borrowed data (str::from_utf8) declared in this function
}

fn example_when_borrowing(s: &str) -> Vec<&str> {
    "hello world".chunks(4).collect() // works fine!
}

Also, str::chunks() is faster than Chars::array_chunks() (without even considering str::from_utf8().unwrap())

Solution sketch

  • Create a new str::Chunks in core/src/str/iter.rs and implement Iterator & DoubleEndedIterator on it
  • Create a new method on str:
pub fn chunks(&self, chunk_size: usize) -> str::Chunks<'_> {
    str::Chunks::new(self, chunk_size)
}

Implementation at https://github.com/tkr-sh/rust/tree/str-chunks

Drawbacks

.chunks() on &str isn't necessary clear if it's on u8 or char. Tho, if chunks are &str it makes sens that it's on chars.

Alternatives

  • .chars().collect() then vec.as_slice().chunks() but it's significantly longer and is owning data that could be avoided. See motivation.
  • .chars().array_chunks() but it's unstable, slower and doesn't behave in the same way. See motivation.

Links and related work


From rust-lang/rfcs#3818

@tkr-sh tkr-sh added T-libs-api api-change-proposal A proposal to add or alter unstable APIs in the standard libraries labels May 25, 2025
@tkr-sh tkr-sh changed the title str::chunks with chunks being &str ACP: str::chunks with chunks being &str May 25, 2025
@scottmcm
Copy link
Member

Why is a consistent number of USVs a useful operation to do?

To me this seems like something for unicode-segmentation rather than std.

@clarfonthey
Copy link

I agree that "number of characters" is generally not a desired operation, and that it's much better to defer to the various Unicode segmentation algorithms instead.

I would argue that most of the issues here are the absence of methods like from_str_radix being available on bytes, although that method in particular is tracked as from_ascii_radix and is currently unstable.

@bluebear94
Copy link

If you wanted to chunk over chars, you could do str.chars().array_chunks::<N>() (on nightly), though this gives arrays of char instead of string slices. Also, a lot of uses for this seem to be focused on ASCII-based formats, in which case it makes more sense to taken in a slice of (the still unstable) ascii::Char (or try to convert your &str to one).

@tkr-sh
Copy link
Author

tkr-sh commented May 25, 2025

I'm ok to open a PR for unicode_segmentation if you think that this is a better idea!

Tho, I think that wrapping some text to fit a specific format can also be a common usage


If you wanted to chunk over chars, you could do str.chars().array_chunks::()

=>

Another problem is that, array_chunks doesn't have the same behaviour as slice::chunk since the last element is discarded if it doesn't do the same size as chunk_size which isn't always wanted.

.chars().array_chunks() but it's unstable, slower and doesn't behave in the same way. See motivation.


Also, a lot of uses for this seem to be focused on ASCII-based formats

I think that only the first one is about ASCII

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

4 participants