-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster PDB parsing #133
base: master
Are you sure you want to change the base?
Faster PDB parsing #133
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, improving the performance of the parser is always nice. I added some comments inline. If any of these comments where already on your radar for changing I am sorry. Let me know if/when you can use another set of eyes on the changes.
use crate::error::*; | ||
use crate::reference_tables; | ||
use crate::ReadOptions; | ||
use crate::StrictnessLevel; | ||
|
||
use core::str; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to load from std (core is a reexport that only exports a subset of std for other uses).
errors.push(PDBError::new( | ||
ErrorLevel::InvalidatingError, | ||
"Atom charge is not correct", | ||
"The charge is not properly signed, it is defined to be [0-9][+-], so two characters in total.", | ||
Context::line(linenumber, line, 79, 1), | ||
)); | ||
} else { | ||
charge = isize::try_from(chars[78].to_digit(10).unwrap()).unwrap(); | ||
if chars[79] == '-' { | ||
let c = chars[78] as char; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do 'charge = chars[78].checked_sub(b'0').filter(|n| n <= 9).unwrap() as isize`, here of course the panic needs to be dealt with, but that was part of the reason for the PR being in draft status still. This uses the fact that the numbers are listed very nicely in ascii/utf to make the parsing as fast as possible.
let database = parse(linenumber, line, 24..28, &mut errors); | ||
let database_accession = parse(linenumber, line, 29..38, &mut errors); | ||
|
||
let mut db_pos = None; | ||
if !chars[39..48].iter().all(|c| *c == ' ') { | ||
if !chars[39..48].iter().all(|c| *c as char == ' ') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for the char conversion here, so it could be *c == b' '
(as you did before).
let res_seq_1: isize = parse(linenumber, line, 17..21, &mut errors); | ||
let icode_1 = if chars[21] == ' ' { | ||
let icode_1 = if chars[21] as char == ' ' { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for the char conversion here, so it could be chars[21] == b' '
(as you did before).
let res_seq_2 = parse(linenumber, line, 31..35, &mut errors); | ||
let icode_2 = if chars[35] == ' ' { | ||
let icode_2 = if chars[35] as char == ' ' { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for the char conversion here, so it could be chars[35] == b' '
(as you did before).
i += 1; | ||
} | ||
|
||
result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a check for trailing non-whitespace in the text?
} | ||
|
||
// Parse digits | ||
while i < len && bytes[i].is_ascii_digit() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be rewritten as an iterator? bytes.skip_while(|b| b.is_ascii_whitespace()).take_while(|b| b.is_ascii_digit())
. Only if you think the code will improve from that change.
} | ||
|
||
//TODO: Implement miri tests and fuzzing | ||
pub(crate) fn fast_trim(s: &str) -> &str { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue that might pop up is that you slice halfway into a character that has the byte pattern of a whitespace character either leading or trailing (yes these character exists). So checking only full characters could be a better fit. So either using chars
or char_indices
could potentially be better, but there is also the function is_char_boundary
that might be of help.
use std::cmp; | ||
use std::convert::TryFrom; | ||
use std::io::Read; | ||
use std::mem::transmute; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem to be used anywhere.
This is a draft.
Changes:
fast_parse_u64_from_string
, andfast_trim
TODO: