Faster PDB parsing #133

maxall41 · 2024-12-16T21:21:35Z

This is a draft.

Changes:

Use byte slices instead of Vec
Implemented custom mildly unsafe functions for string parsing: fast_parse_u64_from_string, and fast_trim

TODO:

Error handling
Cleanup code
Comprehensive final benchmarking
Safety tests

douweschulte

Nice work, improving the performance of the parser is always nice. I added some comments inline. If any of these comments where already on your radar for changing I am sorry. Let me know if/when you can use another set of eyes on the changes.

douweschulte · 2024-12-17T13:31:14Z

src/read/pdb/lexer.rs

 use crate::error::*;
 use crate::reference_tables;
 use crate::ReadOptions;
 use crate::StrictnessLevel;

+use core::str;


I prefer to load from std (core is a reexport that only exports a subset of std for other uses).

douweschulte · 2024-12-17T13:40:14Z

src/read/pdb/lexer.rs

            errors.push(PDBError::new(
                ErrorLevel::InvalidatingError,
                "Atom charge is not correct",
                "The charge is not properly signed, it is defined to be [0-9][+-], so two characters in total.",
                Context::line(linenumber, line, 79, 1),
            ));
        } else {
-            charge = isize::try_from(chars[78].to_digit(10).unwrap()).unwrap();
-            if chars[79] == '-' {
+            let c = chars[78] as char;


You could do 'charge = chars[78].checked_sub(b'0').filter(|n| n <= 9).unwrap() as isize`, here of course the panic needs to be dealt with, but that was part of the reason for the PR being in draft status still. This uses the fact that the numbers are listed very nicely in ascii/utf to make the parsing as fast as possible.

douweschulte · 2024-12-17T13:41:37Z

src/read/pdb/lexer.rs

    let database = parse(linenumber, line, 24..28, &mut errors);
    let database_accession = parse(linenumber, line, 29..38, &mut errors);

    let mut db_pos = None;
-    if !chars[39..48].iter().all(|c| *c == ' ') {
+    if !chars[39..48].iter().all(|c| *c as char == ' ') {


There is no need for the char conversion here, so it could be *c == b' ' (as you did before).

douweschulte · 2024-12-17T13:41:56Z

src/read/pdb/lexer.rs

    let res_seq_1: isize = parse(linenumber, line, 17..21, &mut errors);
-    let icode_1 = if chars[21] == ' ' {
+    let icode_1 = if chars[21] as char == ' ' {


There is no need for the char conversion here, so it could be chars[21] == b' ' (as you did before).

douweschulte · 2024-12-17T13:42:09Z

src/read/pdb/lexer.rs

    let res_seq_2 = parse(linenumber, line, 31..35, &mut errors);
-    let icode_2 = if chars[35] == ' ' {
+    let icode_2 = if chars[35] as char == ' ' {


There is no need for the char conversion here, so it could be chars[35] == b' ' (as you did before).

douweschulte · 2024-12-17T13:45:03Z

src/read/pdb/utils.rs

+        i += 1;
+    }
+
+    result


Should there be a check for trailing non-whitespace in the text?

douweschulte · 2024-12-17T13:47:05Z

src/read/pdb/utils.rs

+    }
+
+    // Parse digits
+    while i < len && bytes[i].is_ascii_digit() {


Could this be rewritten as an iterator? bytes.skip_while(|b| b.is_ascii_whitespace()).take_while(|b| b.is_ascii_digit()). Only if you think the code will improve from that change.

douweschulte · 2024-12-17T13:50:41Z

src/read/pdb/utils.rs

+}
+
+//TODO: Implement miri tests and fuzzing
+pub(crate) fn fast_trim(s: &str) -> &str {


One issue that might pop up is that you slice halfway into a character that has the byte pattern of a whitespace character either leading or trailing (yes these character exists). So checking only full characters could be a better fit. So either using chars or char_indices could potentially be better, but there is also the function is_char_boundary that might be of help.

douweschulte · 2024-12-17T13:53:06Z

src/read/pdb/lexer.rs

 use std::cmp;
 use std::convert::TryFrom;
+use std::io::Read;
+use std::mem::transmute;


This does not seem to be used anywhere.

maxall41 and others added 7 commits September 30, 2024 16:44

Update parser.rs

6724d7b

Update parser.rs

47c514b

Update parser.rs

ffa0a37

Update parser.rs

fc181ba

Benchmark

a2b9749

Benchmark

4d1ed15

Faster!!!

c2826c3

douweschulte reviewed Dec 17, 2024

View reviewed changes

douweschulte linked an issue Dec 17, 2024 that may be closed by this pull request

Prefer Vec<u8>/[u8] over Vec<char>/[char] #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster PDB parsing #133

Faster PDB parsing #133

maxall41 commented Dec 16, 2024 •

edited

Loading

douweschulte left a comment

douweschulte Dec 17, 2024

douweschulte Dec 17, 2024

douweschulte Dec 17, 2024

douweschulte Dec 17, 2024

douweschulte Dec 17, 2024

douweschulte Dec 17, 2024

douweschulte Dec 17, 2024

douweschulte Dec 17, 2024

douweschulte Dec 17, 2024

+                      i += 1;
+                  }
+                  result

Faster PDB parsing #133

Are you sure you want to change the base?

Faster PDB parsing #133

Conversation

maxall41 commented Dec 16, 2024 • edited Loading

douweschulte left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxall41 commented Dec 16, 2024 •

edited

Loading