-
Notifications
You must be signed in to change notification settings - Fork 21
ACP: add str::chunks
, str::chunks_exact
, and str::windows
#590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This feels like it should be covered by the #![feature(ascii_char)]
fn main() {
let text = "hello world";
let ascii_text = text.as_ascii().unwrap();
for w in ascii_text.windows(3) {
let w = w.as_str();
println!("{w}");
}
} Notably, this pushes the panic up to a specific point that explains why it could fail (the conversion to ASCII). |
Another alternative is to combine unicode-segmentation and itertools. This then works with non-ASCII text: use itertools::Itertools; // 0.14.0
use unicode_segmentation::UnicodeSegmentation; // 1.12.0
fn main() {
let text = "x🦀abcdef";
for (w0, w1, w2) in text.graphemes(true).tuple_windows() {
println!("{w0}{w1}{w2}");
}
} |
The difference is that slicing by byte positions can be used correctly with non-ASCII strings. Usually by only slicing at byte positions that you got from some source that ensures those positions are char boundaries in the string. Chunks or windows defined in terms of number of bytes are virtually impossible to use on non-ASCII strings without panicking (and the usages that don’t panic are useless). However, it would be possible to define windows and chunks in terms of number of |
@folkertdev, did this come up in the context of trying to match some comparable Python code? If so, what did that code look like? |
kmers = defaultdict(int)
for i in range(len(sequence) - k + 1):
kmer = sequence[i: i + k]
assert len(kmer) == k
kmers[kmer] += 1
return dict(kmers) So it does just use a bunch of slicing into the string. Note the Btw, rejecting this is totally a valid outcome. It's just something we ran into having to explain (that rust has these nice iterator functions, but we can't quite use them here). This is a tradeoff between convenience and robustness. |
Since that's Python 3, assuming the |
Given the issues related to UTF-8 boundaries causing potential foot-guns and the fact that we're at least not worse than the python version, we have decided to reject this ACP. |
it seems like a better solution would to add a generic iterator adaptor, allowing you to do pub trait Iterator {
...
fn array_windows<const N: usize>(self) -> ArrayWindows<Self, N>
where
Self: Sized,
Self::Item: Clone,
{
ArrayWindows { iter: self, buf: MaybeUninit::uninit(), len: 0 }
}
}
pub struct ArrayWindows<I: Iterator, const N: usize> {
iter: I,
buf: MaybeUninit<[I::Item; N]>,
len: usize,
}
// impl Drop, Clone, etc.
impl<I: Iterator<Item: Clone>, const N: usize> Iterator for ArrayWindows<I, N> {
type Item = [I::Item; N];
fn next(&mut self) -> Option<Self::Item> {
loop {
let v = self.iter.next()?;
if self.len < N {
let buf: &mut [MaybeUninit<I::Item>; N] = unsafe { &mut *(&raw mut buf).cast::<[MaybeUninit<I::Item>; N]>() };
buf[self.len].write(v);
self.len += 1;
} else {
let buf = unsafe { self.buf.assume_init_mut() };
buf[0] = v;
buf.rotate_left(1);
return Some(buf.clone());
}
}
}
...
} |
Proposal
Problem statement
The
slice::chunks
,slice::chunks_exact
, andslice::windows
functions do no exist onstr
. This is inconsistent with the ability to index a&str
by a range. If that is allowed, then intuitively these iterators should also be.Motivating examples or use cases
This recently came up in the rustweek python FFI workshop exercises:
Solution sketch
The
str
type (andcore::str
module) should provide thewindows
,chunks
andchunks_exact
functions, so that we can instead write:The iterator will panic if it tries to yield a subslice that is not valid utf-8, roughly:
So that
Prints something like
However, we can probably do a better job for the error, similar to
&"x🦀"[1..2]
showing:Alternatives
Just have users do this manually.
Links and related work
What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
Second, if there's a concrete solution:
The text was updated successfully, but these errors were encountered: