-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Memory Usage Improvements (to fix #65) #197
base: master
Are you sure you want to change the base?
Conversation
Reading an example 4.4MB file in |
The tokenization code was the biggest culprit for memory allocation. While there weren't a lot of retained objects that would suggest a leak, the tokenizer allocated so much short-term garbage that the GC struggled to keep up. It also took up a lot of stack frames, because it was written to use recursion rather than imperative looping. The build for this branch is currently broken. I've made some drastic (but incomplete) changes that should reduce String allocations, along with other changes to further reduce the memory footprint of various classes. Once I have the parts working separately and unit tests written, I'll go about fitting the new tokenizer into the rest of the library. That might result in user-facing API changes, but I'm aiming to minimize that and provide shims where possible. |
Here's some preliminary results. The test script reads a file, hands it to the tokenizer, and iterates through each In the output below, the "Finished" is ultimately the important metric to the user. But the tokenizer's contribution to the total memory use is closer to Some of the post-init memory could be consumed by reading the input file into memory. That could explain why the larger file has a larger post-init size compared to the smaller file. master: 450KB input
wip: 450KB input
The difference is more dramatic on larger files. master: 4.4MB input
wip: 4.4MB input
When I have time, I'd like to plot out the runtime and memory usage as file size increases. One important note is these gains might be diminished when the parser is changed to accommodate the new tokenizer. There also is further savings by not adding the |
- Reader::Tokenizer returns fatal error when zero ISA's found or when an error occurs inside of an ISA..IEA envelope; other errors are not Reader::Result#fatal? - Create Reader::Input to track the position of tokenized elements - Fix subtle bugs in Reader::Pointer and/or Reader::StringPtr - Parser::StateMachine#insert now destructively updates tokenizer state (separators and segment_dict) - Parser::BuilderDsl uses new Reader::StacktracePosition - Parser::DslReader has been merged into Parser::BuilderDsl - Now ElementVal::AN and ElementVal::ID use a StringPtr where possible, but ElementVal::TM, ::DT, ::Nn, ::R all need to "parse" the string so it requires allocating the substring first (TODO)
65d8e65
to
2424ce6
Compare
- Don't use def_delegators on methods called frequently, it allocates *args - Build native extension String.strncmp to compare two substrings, which results in faster StringPtr#== and StringPtr#start_with?
implement zero-allocation lstrip, rstrip, is_control_character_at?(offset), and lstrip_control_characters_offset(offset)
f747e4b
to
e64b6d2
Compare
23db4b5
to
75756c6
Compare
…] char-at-a-time accesses
This is still WIP. Hoping for only minimal API changes.