Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster PDB parsing #133

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft

Faster PDB parsing #133

wants to merge 7 commits into from

Conversation

maxall41
Copy link

@maxall41 maxall41 commented Dec 16, 2024

This is a draft.

Changes:

  • Use byte slices instead of Vec
  • Implemented custom mildly unsafe functions for string parsing: fast_parse_u64_from_string, and fast_trim

TODO:

  • Error handling
  • Cleanup code
  • Comprehensive final benchmarking
  • Safety tests

Copy link
Owner

@douweschulte douweschulte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, improving the performance of the parser is always nice. I added some comments inline. If any of these comments where already on your radar for changing I am sorry. Let me know if/when you can use another set of eyes on the changes.

use crate::error::*;
use crate::reference_tables;
use crate::ReadOptions;
use crate::StrictnessLevel;

use core::str;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to load from std (core is a reexport that only exports a subset of std for other uses).

errors.push(PDBError::new(
ErrorLevel::InvalidatingError,
"Atom charge is not correct",
"The charge is not properly signed, it is defined to be [0-9][+-], so two characters in total.",
Context::line(linenumber, line, 79, 1),
));
} else {
charge = isize::try_from(chars[78].to_digit(10).unwrap()).unwrap();
if chars[79] == '-' {
let c = chars[78] as char;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do 'charge = chars[78].checked_sub(b'0').filter(|n| n <= 9).unwrap() as isize`, here of course the panic needs to be dealt with, but that was part of the reason for the PR being in draft status still. This uses the fact that the numbers are listed very nicely in ascii/utf to make the parsing as fast as possible.

let database = parse(linenumber, line, 24..28, &mut errors);
let database_accession = parse(linenumber, line, 29..38, &mut errors);

let mut db_pos = None;
if !chars[39..48].iter().all(|c| *c == ' ') {
if !chars[39..48].iter().all(|c| *c as char == ' ') {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for the char conversion here, so it could be *c == b' ' (as you did before).

let res_seq_1: isize = parse(linenumber, line, 17..21, &mut errors);
let icode_1 = if chars[21] == ' ' {
let icode_1 = if chars[21] as char == ' ' {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for the char conversion here, so it could be chars[21] == b' ' (as you did before).

let res_seq_2 = parse(linenumber, line, 31..35, &mut errors);
let icode_2 = if chars[35] == ' ' {
let icode_2 = if chars[35] as char == ' ' {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for the char conversion here, so it could be chars[35] == b' ' (as you did before).

i += 1;
}

result
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a check for trailing non-whitespace in the text?

}

// Parse digits
while i < len && bytes[i].is_ascii_digit() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be rewritten as an iterator? bytes.skip_while(|b| b.is_ascii_whitespace()).take_while(|b| b.is_ascii_digit()). Only if you think the code will improve from that change.

}

//TODO: Implement miri tests and fuzzing
pub(crate) fn fast_trim(s: &str) -> &str {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue that might pop up is that you slice halfway into a character that has the byte pattern of a whitespace character either leading or trailing (yes these character exists). So checking only full characters could be a better fit. So either using chars or char_indices could potentially be better, but there is also the function is_char_boundary that might be of help.

use std::cmp;
use std::convert::TryFrom;
use std::io::Read;
use std::mem::transmute;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem to be used anywhere.

@douweschulte douweschulte linked an issue Dec 17, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prefer Vec<u8>/[u8] over Vec<char>/[char]
2 participants