Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to use utf-16 positions instead of utf-8 #126

Open
nilskch opened this issue Dec 9, 2024 · 5 comments
Open

Add option to use utf-16 positions instead of utf-8 #126

nilskch opened this issue Dec 9, 2024 · 5 comments

Comments

@nilskch
Copy link

nilskch commented Dec 9, 2024

We use this parser for a custom linter that we ship through a VSCode extension (rustpython_parser::ast::Suite::parse_without_path). The VSCode extension api expects byte positions to the text encoded in utf-16, but the parser returns the byte positions to text encoded in utf-8. Can we add an option to use utf-16 encoded byte positions?

I'd be happy to implement this feature if you're comfortable with adding it to the parser. Do you have suggestions on how the API should look like?

@nilskch
Copy link
Author

nilskch commented Dec 10, 2024

@youknowone, what do you think?

@youknowone
Copy link
Member

I'd like to understand the problem better. Is utf-16 position calculation a performance bottleneck of the project?
If not, you can simply get the line number and transform utf-8 position to utf-16 position using the source code and library like https://docs.rs/utf16string/latest/utf16string/
Transforming source code to utf-16 is not avoidable for this case. Then either way transforming utf-8 to utf-16 is not avoidable and will cost cpu power. Then doing it outside of parser can be simple, as long as it fits your requirements.

@nilskch
Copy link
Author

nilskch commented Dec 11, 2024

The linter that we built parses the code on every key stroke, so performance is quite important to us. The utf-16 conversion is not the most critical performance bottleneck, but it would still be a nice win for us to get rid of it. We currently iterate over the python code once and create a lookup table that maps utf-8 positions to utf-16 positions (without libraries). It would be nice for us if the AST would provide the utf-16 positions directly. I thought we could maybe parameterize the text_len method to return utf-8 or utf-16 positions:

impl TextLen for char {
#[inline]
#[allow(clippy::cast_possible_truncation)]
fn text_len(self) -> TextSize {
(self.len_utf8() as u32).into()
}
}

That way we would not need to create that lookup table in the linter.

@nilskch
Copy link
Author

nilskch commented Dec 11, 2024

I just saw this comment #125 (comment).

I thought this is the parser that ruff uses? The README is misleading if this is not true. Would you recommend us to also move to the Ruff parser?

@youknowone
Copy link
Member

Rust ecosystem usually don't parametrize len() itself for utf16. Rather than that, adding text_len_utf16 method is making sense. Though still I have no idea how it can improve performance. Since the parser internal is utf8, there must transformation somewhere. If it can be cached inside of the parser, it still can be cached outside of the parser.

That would be better to be changed. I forgot the version number, but Ruff is how using its own parser. If performance is critical, that has better performance by their benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants