-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python 3 dict behaviors for PMap #178
Comments
Creating an umbrella issue may be good to keep things together although I think it would be fully reasonable to only implement parts of the characteristics described here. Keeping the pmap behaviour close to the common dict is probably a good idea even though it has never been part of any behavioural contract to do that. As observed the behaviour differs depending on version for Python and I've chosen to stick to a behaviour close to Python 2 so far. I would not have anything against PRs moving the behaviour closer to that of recent Python 3 versions given that performance does not take a too big hit though. Especially since Python 2 is soon EOL. I have recently done some shallow investigations into switching the underlying implementation of the pmap to make it more performant (see #160). I've tried using https://github.com/MagicStack/immutables (and hence bring in an additional dependency since I don't have the time required myself to write at HAMT or similar in C at the moment). The commit for this is on https://github.com/tobgu/pyrsistent/tree/test-different-map-implementation. Unfortunately the immutables map seg faults sporadically in some of the pmap tests. I have not dug into the root cause of these crashes (help would be much appreciated!). I could see these behavioural changes going into the same, general pmap renovation effort. |
I should have mentioned this issue in pull request #243. That pull request causes It also returns a view object for values and items that mimic the Maintaining key order is totally possible with a persistent dict type but would probably take some serious work—I don't imagine such a change would be efficient unless it involved rewriting the guts of the C implementation. @tobgu I have a lot of experience with C programming and with writing low-level HAMTs and similar data structures if you want to chat about the prospect of rewriting these data structures from scratch. |
Thanks @noahbenson! Yeah, looking into a speedier implementation of PMap would be nice. It's currently bolted onto the PVector to enjoy some of perf-benefits of that low level implementation while avoiding the maintenance burden of multiple implementations for other data structures than the PVector. It would most definitely be much quicker with a proper C-implementation though. See #160 for more on the same topic. I have not looked at https://github.com/MagicStack/immutables for quite a while. There seems to have been some activity since last so it might be worth revisiting. If it still isn't up to standards for use as underlying implementation for PMap it may be possible to contribute to that project to make it happen. In general I'm slightly worried that adding too much low level C-code (with required duplication in a pure Python version for PyPy, etc) will make the maintenance burden and barrier to entry for outside contributors higher. I personally don't have a ton of time available to spend on pyrsistent. As for the PVector I think the performance is decent but there are probably optimizations and structural changes that would improve it further. One thing that would be interesting to look into (as discussed in #127) would be to try to reduce the places were ref. counting is needed to improve memory reuse between parent and child processes after fork for example. |
I took a look at the MagicStack library earlier—IMHO it's 80% of the way there for maps? That said I didn't actually pull it and do tests or anything, so this is all based on just browsing their repo. (Also, full confession: I haven't looked closely at the One of the really useful things about the I spent a few hours writing out some of the base code for a C extension of this kind of persistent HAMT last night, and one of my conclusions is that the whole thing can be written using only the To be honest I don't have a ton of time to work on this kind of thing either right now, but writing a c implementation of a persistent HAMT is a fairly small lift for me, so I'd be interested in writing/testing then pull-requesting something like the data structure described above if you'd be interested in doing the bulk of the integration work. Though, now that I think about it, a more logical approach would probably be for me to just publish the c-implementation of the HAMT as a small independent library and let you incorporate it into Re: ref-counting. Eee. I'm not sure how you are going to be able to get away with that. Any given node in any persistent structure that you make might get reused in an arbitrary number of other structures, and higher nodes that point to them might get garbage collected at any time. Fundamentally, every node you allocate in C has to be garbage-collectible somehow. If you aren't going to use Python's garbage collection schema, then you need something that is compatible with it, and I don't see how to write a sweep-based garbage collector on top of Python's ref-counting. I thought about doing the refcounting internally (i.e., a separate more efficient refcounter for the HAMT nodes), but if you do this, then you break Python's ability to detect loops in the reference graph (for example: All that said, I could be mistaken about any number of things in the above paragraph (it would take someone who knows the guts of the Python refcounting/GC system better than I to evaluate this). Also, it's not clear to me from the original question why the refcounts for the internal HAMT nodes are getting updated on reads—so maybe this is the root of the problem? (I'll have to take a closer look at the code). |
I'd be very interested in seeing what such a structure would look like. If it turns out to be a feasible replacement for the current PVector (feature-wise and perf-wise) I may consider switching the underlying implementation given that I can find the time to do the integration work. If you have an initial implementation sketched out I'd be happy to have a look at it. :-) |
Well, short answer is that I have a woking prototype of it that has even been extensively tested already... it's just written in Julia 😅. This kind of low-level Julia is pretty similar to C in many ways (enough that I'm almost translating this library into C line-by-line), so if you want to take a look, all the relevant code is in this file. Once I check-in the C version I'm working on, I'll ping you also. The For simplicity, let's assume that the hash type is a 64-bit integer. A very simple version of a persistent hash array mapped trie (PHAMT) is one in which, in order to look up the leaf-node (value) for a hash integer (key), you start by dicing the hash into a sequence of integers represented by every five bits of the hash. Every 5 bits can be represented by an unsigned integer from 0 to 31 (the first 4 bits become an integer in [0,15]). If your hash integer gets diced into the values The version I'm currently translating from Julia to C is like this, but it makes 2 major optimizations:
So because of optimization 1 above, when using a PHAMT as the backend for a Just to be clear: I'm not claiming to have discovered either of these optimizations, and the MagicStack implementation includes a slightly more sophisticated version of optimization 2 (they have a more sophisticated version of the same idea, but they're both so low-level and similar that it'll be hard to guess which is faster). |
Thanks, keeping an eye on it! I guess some sort of hash collision handling must be implemented outside of the structure to make it usable for mappings (which is probably less efficient but cleaner from an implementation perspective)? I also think that some kind of transient ("evolver" in pyrsistent lingo) is required to make the structure efficient for bulk operations. I've not looked at the corresponding Julia code if there's something similar in place though. |
Yeah—the collision detection can pretty easily be handled by having one PHAMT ( Anyway, long story short—putting collision outside the PHAMT makes the PHAMT more performant for persistent vectors/lists, since they don't need that kind of collision detection. Re:transients—I just haven't gotten around to writing these yet, but since it's all in C, it can be written pretty trivially by sticking a bit in the PHAMT data-structure that flags whether the object is transient or not. Edits to transient objects can then easily reallocate the persistent nodes into transient ones as changes are made then just flip all those bits over to "persistent" when the object gets persisted. When I wrote the transient system in Julia, I was surprised to discover that building a very large dictionary (a million elements) with transients was only ~20% faster than doing it all with persistent objects. This is likely very different for arrays, where appending to a persistent structure will constantly reallocate the same node, but I think that with dictionaries, the sparsity of keys is so high that you don't save much time with the transient approach. Anyway, worth testing with this implementation as it comes along. |
Python 3 introduces several new behaviors for
dict
which have been mentioned in previous issues, but a unifying issue might be useful to the discussion.Both of these changes are semantically significant in practice, so it might be worth documenting the behavior until a Python 3-alike version of
PMap
is implemented to ensure nobody gets caught unaware?Setlike view objects
Python 3
dicts
have set-likekeys()
,values()
, anditems()
views. In Python 3, order does not matter when comparing these views:and they support
set
operations:Maintaining key order
Since 3.7, dict keys are guaranteed to iterate in the same order in which they were added.
The text was updated successfully, but these errors were encountered: