#13394 closed enhancement (fixed)
Write a WeakValueDictionary with safer key removal
Reported by: | nbruin | Owned by: | rlm |
---|---|---|---|
Priority: | major | Milestone: | sage-5.13 |
Component: | memleak | Keywords: | |
Cc: | SimonKing | Merged in: | sage-5.13.beta3 |
Authors: | Simon King, Nils Bruin | Reviewers: | Nils Bruin, Simon King |
Report Upstream: | None of the above - read trac for reasoning. | Work issues: | |
Branch: | Commit: | ||
Dependencies: | Stopgaps: |
Description (last modified by )
On ticket #12313 we found that the use of WeakValueDictionaries
as caches can cause removal
callbacks in rather harsh environments. Normal WeakValueDictionaries
remove keys with dead values by looking up the key. This involves Python equality testing on the key, which can cause any kind of operation in Sage. We need a dictionary where we can delete entries without key comparisons. See below for possible strategies.
To the release manager:
Attachments (8)
Change History (120)
comment:1 Changed 7 years ago by
- Description modified (diff)
comment:2 Changed 7 years ago by
comment:3 Changed 7 years ago by
- Cc SimonKing added
comment:4 Changed 6 years ago by
- Milestone changed from sage-5.11 to sage-5.12
comment:5 Changed 5 years ago by
I have attached a proof of concept. I think the class WVD
defined in the attachment is feature-complete. According to my benchmarks, it is faster than weakref.WeakValueDictionary
, and it is safer.
Performance
sage: def test_dict(D,L): ....: for k,v in L: ....: D[k] = v ....: for k,v in L: ....: if k not in D: ....: raise RuntimeError("containment") ....: for k,v in L: ....: D[k] = v ....: for k in D.iterkeys(): ....: if k not in D: ....: raise RuntimeError("second containment") ....: for k,v in L: ....: assert D.get(k)==v ....: del D[k] ....: for k,v in L: ....: if k in D: ....: raise RuntimeError("anti-containment") ....: sage: L = [(p,GF(p)) for p in prime_range(2*10^4)] sage: from weakref import WeakValueDictionary sage: D = WeakValueDictionary() sage: %timeit test_dict(D,L) 10 loops, best of 3: 43.2 ms per loop sage: D = WVD() sage: %timeit test_dict(D,L) 10 loops, best of 3: 29.4 ms per loop
Hence, the WVD is faster than the weakref.WeakValueDictionary
.
Safety
sage: class Vals(object): pass sage: class Keys: ....: def __init__(self, val): ....: self.val = weakref.ref(val) ....: def __hash__(self): ....: return hash(self.val()) ....: def __eq__(self, other): ....: return self.val() == other.val() ....: sage: ValList = [Vals() for _ in range(10)] sage: D = WeakValueDictionary() sage: for v in ValList: ....: D[Keys(v)] = v ....: sage: len(D) 10 sage: del ValList Exception KeyError: (<__main__.Keys instance at 0xdcfc96c>,) in <function remove at 0xdcb9df4> ignored Exception KeyError: (<__main__.Keys instance at 0xdcfc90c>,) in <function remove at 0xdcb9df4> ignored Exception KeyError: (<__main__.Keys instance at 0xdcfcc2c>,) in <function remove at 0xdcb9df4> ignored Exception KeyError: (<__main__.Keys instance at 0xdcfc9ec>,) in <function remove at 0xdcb9df4> ignored Exception KeyError: (<__main__.Keys instance at 0xdcfca4c>,) in <function remove at 0xdcb9df4> ignored Exception KeyError: (<__main__.Keys instance at 0xdcfc9cc>,) in <function remove at 0xdcb9df4> ignored Exception KeyError: (<__main__.Keys instance at 0xdcfc98c>,) in <function remove at 0xdcb9df4> ignored Exception KeyError: (<__main__.Keys instance at 0xdcfccac>,) in <function remove at 0xdcb9df4> ignored Exception KeyError: (<__main__.Keys instance at 0xdcfca0c>,) in <function remove at 0xdcb9df4> ignored sage: len(D) 10
Hence, the weakref.WeakValueDictionary
is unsafe.
sage: ValList = [Vals() for _ in range(10)] sage: D = WVD() sage: for v in ValList: ....: D[Keys(v)] = v ....: sage: len(D) 10 sage: del ValList sage: len(D) 1 sage: del v sage: len(D) 0
Hence, the WVD is safe (or at least: safer...)
I suppose I should add tests and documentation to the proof of concept, put into sage.misc and rename WVD
into sage.misc.weak_dict.WeakValueDictionary
. And then we should see if we can really smoothly replace weakref.WeakValueDictionary
in Sage.
comment:6 follow-up: ↓ 7 Changed 5 years ago by
Excellent work! I like how you found a way to reuse python dictionaries. I think there is one fairly easy optimization that you can make that is fairly similar to how we ended up implementing the buckets for MonoDict and TripleDict: Presently, your bucket is a list of tuples. That provides an extra layer of indirection, meaning both slower access and more memory use and allocation.
If instead you "inline" the tuples by making the bucket a list with the layout [key1, weakref_to_value1, key2, weakref_to_value2,...]
you can save yourself some overhead. Normally these lists should only contain one key-value pair anyway.
It may be worth browsing through MonoDict anyway. There may more more little tricks that we used there that may apply here and that I don't remember right now.
comment:7 in reply to: ↑ 6 Changed 5 years ago by
Replying to nbruin:
If instead you "inline" the tuples by making the bucket a list with the layout
[key1, weakref_to_value1, key2, weakref_to_value2,...]
you can save yourself some overhead.
Good point.
In addition, I think that weakref_to_value_n.key
used for the callback should be improved: Storing hash(key_n)
should be enough. Namely, it allows to find the correct hash bucket, and then the hash bucket is searched to find weakref_to_value_n
that is subject to the callback. The latter is simply comparing memory addresses.
comment:8 Changed 5 years ago by
- Branch set to u/SimonKing/ticket/13394
- Created changed from 08/24/12 01:56:58 to 08/24/12 01:56:58
- Modified changed from 10/28/13 12:11:38 to 10/28/13 12:11:38
comment:9 Changed 5 years ago by
- Branch u/SimonKing/ticket/13394 deleted
I have posted an initial version, that still lacks documentation and tests (and with the next commit I will also fix some trailing whitespace).
I am afraid that the current version of a weak value dictionary is not faster than what I have described in the proof of concept, although I used the improved bucket layout and am using C-API functions rather thoroughly.
Anyway. I guess part of the documentation will be benchmarks for each separate task. Then we will see what should be further improved.
comment:10 Changed 5 years ago by
- Branch set to u/SimonKing/ticket/13394
- Commit set to f0ed60fbdf5f2b1c7cd15f6bab98099f4ff8b822
New commits:
[changeset:f0ed60f] | Initial version of a safer and faster WeakValueDictionary? |
comment:11 follow-up: ↓ 12 Changed 5 years ago by
I wonder: Is it really a good idea to have a dictionary storing one list for each hash value? Granted, it does solve the problem of defunct keys in the callback of weak values. However, storing the hash buckets similar to what we do for TripleDict
might turn out to be faster.
Well, I guess one should first have a version that works and is faster and safer than weakref.WeakValueDictionary
. And then we can further optimise.
comment:12 in reply to: ↑ 11 ; follow-up: ↓ 13 Changed 5 years ago by
Replying to SimonKing:
I wonder: Is it really a good idea to have a dictionary storing one list for each hash value?
Well, it's not entirely optimal of course. Python's own dict already has a mechanism to deal with hash collisions, so having the lists stored is, strictly speaking, an unnecessary indirection. However, breaking open python's dict implementation to access the hash buckets there is going to be a very hard to maintain solution (you'd basically be patching python and implement an extra "delete by hash and value id" routine. Entirely doable, but we'd be stuck with a patched python to eternity).
There are two solutions:
- benefit from Python's insanely optimized dict and swallow the cost of an extra indirection
- abandon Python's dict entirely and implement your own hash table.
We followed the latter for MonoDict and TripleDict because the key semantics there are so different that borrowing python's dict wasn't really an option. For WeakValueDictionary it is an option to borrow from dict. I suspect you'd have to work pretty hard to come close to that performance.
Note that almost all of your lists are just going to be pairs. Python internally is making such things all the time: arguments tend to be packaged and copied as tuples.
I guess that brings us to a third possibility: Finish the proof of concept, show the python community what the problem is with WeakValueDictionary, and suggest the extra hook we need to make it safe. Then we might end up with a safe and fast WeakValueDictionary in python proper. That would only be in Python3+, though, so we'd have to backport to Python2.7.
There are already some issues reported: http://bugs.python.org/issue7105 http://bugs.python.org/issue17816. You might want to check if your implementation fixes those.
comment:13 in reply to: ↑ 12 Changed 5 years ago by
Replying to nbruin:
There are already some issues reported: http://bugs.python.org/issue7105
I did not address this one: I do not switch garbage collection off during iteration. I could, of course. Later perhaps.
This is fixed in my implementation. A slight variation of the example that issue 17816 is proposing, showing that the weakref version of WeakValueDict
is a mess:
sage: import weakref sage: import sage.misc.weak_dict sage: class A(object): ....: def __init__(self,n): ....: self.n=n ....: def __repr__(self): ....: return "A(%d)"%self.n ....: sage: def mess(D, n): ....: L=[A(i) for i in range(n)] ....: for i in range(n-1): ....: j=(i+10)%n ....: D[L[i]]=L[j] ....: return D ....: sage: D = weakref.WeakValueDictionary() sage: D = mess(D,10000) sage: len(D) sage: D.clear() Exception KeyError: (A(6760),) in <function remove at 0xbe5110c> ignored Exception KeyError: (A(6761),) in <function remove at 0xbe5110c> ignored ... Exception KeyError: (A(968),) in <function remove at 0xbe5110c> ignored Exception KeyError: (A(958),) in <function remove at 0xbe5110c> ignored sage: len(D) # in spite of the errors, the items *are* removed 0
Instead, sage.misc.weak_dict.WeakValueDictionary
works just fine:
sage: D = sage.misc.weak_dict.WeakValueDictionary() sage: D = mess(D,10000) sage: len(D) 9000 sage: D.clear() sage: len(D) 0
comment:14 Changed 5 years ago by
PS: In a yet-to-be-pushed commit, I am replacing all weakref.WeakValueDictionary
(e.g., used in CachedRepresentation
) by sage.misc.weak_dict.WeakValueDictionary
. I am now running doctests. Keep your fingers crossed...
comment:15 Changed 5 years ago by
Here is an example concerning "garbage collection during iteration". Features of the example:
- Many hash collisions
- Reference cycles, so that Python's cyclic garbage collector comes into play
- The garbage collection is postponed until after the first iteration.
First, with weakref.WeakValueDictionary
:
sage: class Cycle(object): ....: def __init__(self, n): ....: self.n = n ....: self.ref = self ....: def __cmp__(self, other): ....: c = cmp(type(self),type(other)) ....: if c: ....: return c ....: return cmp(self.n, other.n) ....: def __repr__(self): ....: return "Cyc(%s)"%self.n ....: sage: class Keys(object): ....: def __init__(self, n): ....: self.n = n ....: def __hash__(self): ....: return self.n%5 ....: def __cmp__(self, other): ....: c = cmp(type(self),type(other)) ....: if c: ....: return c ....: return cmp(self.n, other.n) ....: def __repr__(self): ....: return "Key(%s)"%self.n ....: sage: C = [Cycle(n) for n in range(100)] sage: K = [Keys(n) for n in range(100)] sage: import gc sage: import sage.misc.weak_dict sage: import weakref sage: D = weakref.WeakValueDictionary() sage: for k,c in zip(K,C): ....: D[k] = c ....: sage: del c sage: gc.disable() sage: del C sage: for k,c in D.iteritems(): ....: print k,c ....: gc.enable() ....: _ = gc.collect() ....: Key(0) Cyc(0) Traceback (most recent call last): ... RuntimeError: dictionary changed size during iteration
And with sage.misc.weak_dict.WeakValueDictionary
:
sage: D = sage.misc.weak_dict.WeakValueDictionary(zip(K,C)) sage: len(D) 100 sage: gc.disable() sage: del C sage: for k,c in D.iteritems(): ....: print k,c ....: gc.enable() ....: _ = gc.collect() ....: Key(0) Cyc(0) Traceback (most recent call last): ... IndexError: list index out of range
So, there is an error, too. Perhaps one could improve the error message.
comment:16 Changed 5 years ago by
- Commit changed from f0ed60fbdf5f2b1c7cd15f6bab98099f4ff8b822 to c3dba989f73c1eec0a66cf72ae203132fe33b5da
Branch pushed to git repo; I updated commit sha1. New commits:
[changeset:c3dba98] | Replace weakref.WeakValueDictionary? by sage.misc.weak_dict.WeakValueDictionary? |
[changeset:17b0236] | Documentation for WeakValueDictionary? |
comment:17 Changed 5 years ago by
- Report Upstream changed from N/A to None of the above - read trac for reasoning.
- Status changed from new to needs_review
With the current commits, weakref.WeakValueDict
is replaced by sage.misc.weak_dict.WeakValueDict
everywhere in Sage. All doc tests pass. The doctest coverage of the new module is 100%.
Hence, I make it "needs review", and we will see whether we will report upstream.
Next, I'll try to construct finer grained benchmarks, to see if there are aspects in which the implementation in weakref
is still better.
New commits:
[changeset:c3dba98] | Replace weakref.WeakValueDictionary? by sage.misc.weak_dict.WeakValueDictionary? |
[changeset:17b0236] | Documentation for WeakValueDictionary? |
comment:18 follow-up: ↓ 19 Changed 5 years ago by
Even though all tests pass, I wonder about one implementation detail. The callback for weakref.WeakValueDictionary
looks like this:
def remove(wr, selfref=ref(self)): self = selfref() if self is not None: del self.data[wr.key]
Hence, the callback has a weak reference to the dictionary. In my implementation, it is just a method.
Why is a weak self-reference needed? Is this just to prevent problems that could arise when one of the internally used weak references creeps out of the dict?
comment:19 in reply to: ↑ 18 ; follow-up: ↓ 20 Changed 5 years ago by
Replying to SimonKing:
Why is a weak self-reference needed? Is this just to prevent problems that could arise when one of the internally used weak references creeps out of the dict?
See http://bugs.python.org/issue417795. I'm not convinced by its reasoning. Certainly in sage we're already depending on cyclic GC left and right, so I don't see the issue (we wouldn't dream of subclassing WeakValueDictionary? with a __del__
method). It's also a very old fix, probably stemming from a time that people were still suspicious about the cyclic GC. If this stuff would get merged in Python, it would probably need the weakref, but I don't think it needs it in Sage.
For iterator stuff: in Python3 this is fixed. See http://hg.python.org/cpython/file/default/Lib/weakref.py for starters (and http://hg.python.org/cpython/file/default/Lib/_weakrefset.py for _IteratorGuard
). You probably just want to backport that stuff wholesale into your own module.
comment:20 in reply to: ↑ 19 Changed 5 years ago by
Hi Nils,
Replying to nbruin:
See http://bugs.python.org/issue417795. I'm not convinced by its reasoning. Certainly in sage we're already depending on cyclic GC left and right, so I don't see the issue (we wouldn't dream of subclassing WeakValueDictionary? with a
__del__
method). It's also a very old fix, probably stemming from a time that people were still suspicious about the cyclic GC.
Yes, I expected it would be something like this.
If this stuff would get merged in Python, it would probably need the weakref, but I don't think it needs it in Sage.
OK.
For iterator stuff: in Python3 this is fixed. See http://hg.python.org/cpython/file/default/Lib/weakref.py for starters (and http://hg.python.org/cpython/file/default/Lib/_weakrefset.py for
_IteratorGuard
). You probably just want to backport that stuff wholesale into your own module.
OK.
Since I believe that this shouldn't be exposed to the user interface, I think I'll try to base it on cdef attributes. Similarly, it might make sense to create the callback function during initialisation of the dictionary (with self provided as a closure, not by a weakref in a default argument) and store it in a cdef attribute of the dictionary. In that way, it won't be exposed to the user. Ideally, dir(WeakValueDictionary)
should return the same names as dir(dict)
.
comment:21 Changed 5 years ago by
What shall be done with pop()/popitem() in an iteration context? should it return an item without removing it (resp. postpone removal)? Or should it raise a straight error? I'd prefer the latter.
If "del D[k]
" is called in an iteration context, I suppose we simply want to postpone deletion until the iteration has finished.
comment:22 Changed 5 years ago by
I currently believe that D.pop(k)
should return D[k]
regardless whether we are iterating or not. But if we are iterating, then deletion of D[k]
should be postponed. Also D.popitem()
should return an item and postpone deletion if we are iterating.
Note that D[k]
should include a test whether the item is waiting to be deleted, and in this case raise a key error. Similarly, the iteration methods should only iterate over items that are not (yet) waiting to be deleted.
comment:23 Changed 5 years ago by
- Commit changed from c3dba989f73c1eec0a66cf72ae203132fe33b5da to 70a7b8accb2d2156f315cbae2511985936977670
Branch pushed to git repo; I updated commit sha1. New commits:
[changeset:70a7b8a] | Guard WeakValueDictionary? against deletions during iteration |
comment:24 Changed 5 years ago by
With the latest commit, there is a context manager preventing the dictionary from negative effects of deletions. The doctests show what the context manager is capable of.
It is still "needs review". Perhaps one should test if the whole thing is still faster than weakref.WeakValueDictionary
. Since sage.misc.weak_dict.WeakValueDictionary
now seems to be bullet proof regarding garbage collection and explicit deletions, we should use it, unless it is significantly slower than Python's own implementation.
Challange: Try to break it...
comment:25 follow-up: ↓ 26 Changed 5 years ago by
I couldn't resist trying to implement our extra primitive straight onto python's dict. It turns out to be fairly straightforward after reading lookdict_string
and PyDict_DelItem
in Python's Objects/dictobjec.c
. As an example:
sage: %attach deldict_exact.pyx Compiling ./deldict_exact.pyx... sage: D={} sage: L=[] sage: for n in prime_range(2,100): ....: R=Integers(n) ....: k=R(1) ....: L.append(k) ....: D[R(1)]=n ....: sage: len(D) 25 sage: init_dummy() sage: for l in L[3:10]: ....: del_dictitem_exact(D,l,hash(l)) ....: sage: len(D) 18 sage: set(L[:3]+ L[10:]) == set(D.keys()) True
So the interface is: del_dictitem_exact(D,key,hash)
, where hash
is hash(key)
(but is separately supplied for efficiency). This routine removes the entry D[key]
if key
identically occurs as a key in D
. We could match on value too if we want, but these semantics make most sense for a dictionary.
Problematic/hackish things:
- I have hardcoded PERTURB_SHIFT since that constant isn't available outside
dictobject.c
. - I had to write a programmatic routine to go looking for the
dummy
sentinel key value (which is used to mark deleted entries) - I am relying on the internals of
PyDictEntry
andPyDictObject
. However, I'm getting those fromPython.h
, so we shouldn't get silent failure on those.
With this extra primitive you could write a safer callback on WeakValueDictionary
without needing the extra indirection. You'd call init_dummy
upon instantiation of a WeakValueDictionary, to ensure that it's initialized.
EDIT: The value of PERTURB_SHIFT is of course critical, but it is a value that was found after extensive tuning. This value is not going to change any time soon. So hardcoding this is not a big issue for maintainability. The extraction of dummy
is of course clumsy, but fairly robust. Therefore, I think this approach is feasible. It may be worth testing how it works. A lot of the routines on your WeakValueDictionary will be a lot simpler and avoiding the "double hashing" that happens now (access to the dict means computing the hash of what is already the hash of an actual key).
(I've tried your branch, by the way, but it crashes. Reverting your last two commits does give me something that seems to work)
comment:26 in reply to: ↑ 25 ; follow-up: ↓ 34 Changed 5 years ago by
Hi Nils,
do I understand correctly: What you tried means patching Python, and you want to propose this upstream, but do not necessarily want to include it in Sage yet?
Replying to nbruin:
(I've tried your branch, by the way, but it crashes. Reverting your last two commits does give me something that seems to work)
Strange. How does it crash? By commit [changeset:c3dba98] the WeakValueDictionary
is used in Sage. Are you saying it crashes at startup? What does it say?
comment:27 Changed 5 years ago by
- Commit changed from 70a7b8accb2d2156f315cbae2511985936977670 to e4adaebf2bd8a219f05b28746210cc0b0d0bba88
Branch pushed to git repo; I updated commit sha1. New commits:
[changeset:e4adaeb] | Implement copy and deepcopy for WeakValueDictionary? |
comment:28 Changed 5 years ago by
I have just pushed a new commit, implementing copy and deepcopy.
New commits:
[changeset:e4adaeb] | Implement copy and deepcopy for WeakValueDictionary? |
New commits:
[changeset:e4adaeb] | Implement copy and deepcopy for WeakValueDictionary? |
comment:29 Changed 5 years ago by
I use the following data for my benchmarks.
sage: import sage.misc.weak_dict sage: import weakref sage: L = [(p,GF(p)) for p in prime_range(10^5)] sage: N = prime_range(10^6)
Note that this has no hash collisions. Below is an example that has lots of hash collisions, on purpose.
Creation of dictionaries
sage: %timeit D = weakref.WeakValueDictionary() 100000 loops, best of 3: 3.39 us per loop sage: %timeit D = sage.misc.weak_dict.WeakValueDictionary() 1000000 loops, best of 3: 1.65 us per loop
=> Sage is faster
Initial assignments
sage: %timeit D = weakref.WeakValueDictionary(L) 10 loops, best of 3: 35.7 ms per loop sage: %timeit D = sage.misc.weak_dict.WeakValueDictionary(L) 10 loops, best of 3: 38.4 ms per loop
=> Python is faster.
Since the creation of the empty dictionary is faster in Sage, it seems that there could be room for improvement.
Overriding assignments
sage: D = weakref.WeakValueDictionary(L) sage: %timeit for k,v in L: D[k]=v 10 loops, best of 3: 40.6 ms per loop sage: D = sage.misc.weak_dict.WeakValueDictionary(L) sage: %timeit for k,v in L: D[k]=v 10 loops, best of 3: 37.1 ms per loop
=> Sage is faster
This may be surprising, as the initial assignment was slower!
Reading
sage: D = weakref.WeakValueDictionary(L) sage: %timeit for k,v in L: w=D[k] 100 loops, best of 3: 12.3 ms per loop sage: %timeit for k in N: w=D.get(k) 1 loops, best of 3: 326 ms per loop sage: D = sage.misc.weak_dict.WeakValueDictionary(L) sage: %timeit for k,v in L: w=D[k] 100 loops, best of 3: 5.54 ms per loop sage: %timeit for k in N: w=D.get(k) 10 loops, best of 3: 31.3 ms per loop
=> Sage is much faster
Containment
sage: D = weakref.WeakValueDictionary(L) sage: %timeit for k in N: k in D 1 loops, best of 3: 347 ms per loop sage: D = sage.misc.weak_dict.WeakValueDictionary(L) sage: %timeit for k in N: k in D 100 loops, best of 3: 18.8 ms per loop
=> Sage is very much faster
Item deletion
sage: def test(): ....: D = weakref.WeakValueDictionary(L) ....: for k,v in L: del D[k] ....: sage: %timeit test() 10 loops, best of 3: 44.9 ms per loop sage: def test(): ....: D = sage.misc.weak_dict.WeakValueDictionary(L) ....: for k,v in L: del D[k] ....: sage: %timeit test() 10 loops, best of 3: 47.5 ms per loop
=> Python is a bit faster
Iteration
sage: D = weakref.WeakValueDictionary(L) sage: %timeit _ = list(D) 1000 loops, best of 3: 312 us per loop sage: %timeit _ = list(D.itervalues()) 100 loops, best of 3: 2.88 ms per loop sage: %timeit _ = list(D.iteritems()) 100 loops, best of 3: 4.9 ms per loop sage: D = sage.misc.weak_dict.WeakValueDictionary(L) sage: %timeit _ = list(D) 100 loops, best of 3: 2.67 ms per loop sage: %timeit _ = list(D.itervalues()) 100 loops, best of 3: 2.22 ms per loop sage: %timeit _ = list(D.iteritems()) 100 loops, best of 3: 3.82 ms per loop
=> Iteration over the keys is much faster in Python, there seems room for improvement. But the other iterations are slightly better in Sage
Copying
Note that deepcopy for Python's weak value dictionaries is not exactly deep, as only the keys, but not the values, are copied:
sage: class C(object): pass sage: v = C() sage: w = C() sage: D = weakref.WeakValueDictionary() sage: D[1] = v sage: D[w] = ZZ sage: E = deepcopy(D) sage: w in E # keys are copied, which is good False sage: E[1] is D[1] # values aren't copied True
But this is actually correct, since copied values would be immediately garbage collected! So, my implementation does alike and only copies the keys, not the values.
Timings:
sage: D = weakref.WeakValueDictionary(L) sage: %timeit E = copy(D) 10 loops, best of 3: 45.5 ms per loop sage: %timeit E = deepcopy(D) 1 loops, best of 3: 453 ms per loop sage: D = sage.misc.weak_dict.WeakValueDictionary(L) sage: %timeit E = copy(D) 10 loops, best of 3: 46.1 ms per loop sage: %timeit E = deepcopy(D) 1 loops, best of 3: 452 ms per loop
=> Python and Sage are compatible
With hash collisions
Here are the corresponding timings with hash collisions.
sage: class Keys(object): ....: def __init__(self, n): ....: self.n = n ....: def __hash__(self): ....: return self.n%5 ....: def __cmp__(self, other): ....: c = cmp(type(self),type(other)) ....: if c: ....: return c ....: return cmp(self.n, other.n) ....: def __repr__(self): ....: return "Key(%s)"%self.n sage: L = [(Keys(n),Integers(n)) for n in range(4000)] sage: N = [Keys(n) for n in range(8000)] # initial assignments sage: %timeit D = weakref.WeakValueDictionary(L) 1 loops, best of 3: 4.47 s per loop sage: %timeit D = sage.misc.weak_dict.WeakValueDictionary(L) # Sage wins 1 loops, best of 3: 2.22 s per loop # overriding sage: Dp = weakref.WeakValueDictionary(L) sage: Ds = sage.misc.weak_dict.WeakValueDictionary(L) sage: %timeit for k,v in L: Dp[k]=v 1 loops, best of 3: 2.28 s per loop sage: %timeit for k,v in L: Ds[k]=v # Sage slightly faster 1 loops, best of 3: 2.17 s per loop # reading sage: %timeit for k,v in L: w=Dp[k] 1 loops, best of 3: 2.33 s per loop sage: %timeit for k,v in L: w=Ds[k] # Sage slightly faster 1 loops, best of 3: 2.17 s per loop # containment sage: %timeit for k in N: k in Dp 1 loops, best of 3: 6.91 s per loop sage: %timeit for k in N: k in Ds # Sage slightly faster 1 loops, best of 3: 6.74 s per loop # item deletion, using a different test sage: def test(D,L): ....: for k,v in L: ....: del D[k] ....: D[k] = v ....: sage: %timeit test(Dp, L) 1 loops, best of 3: 6.88 s per loop sage: %timeit test(Ds, L) # Sage wins 1 loops, best of 3: 4.77 s per loop # iteration sage: %timeit _ = list(Dp) # Python wins 10000 loops, best of 3: 130 us per loop sage: %timeit _ = list(Ds) 1000 loops, best of 3: 508 us per loop sage: %timeit _ = list(Dp.itervalues()) 1000 loops, best of 3: 1.15 ms per loop sage: %timeit _ = list(Ds.itervalues()) # Sage wins 1000 loops, best of 3: 523 us per loop sage: %timeit _ = list(Dp.iteritems()) 1000 loops, best of 3: 1.89 ms per loop sage: %timeit _ = list(Ds.iteritems()) 1000 loops, best of 3: 885 us per loop # Sage wins # copying sage: %timeit E = copy(Dp) 1 loops, best of 3: 2.23 s per loop sage: %timeit E = copy(Ds) # Sage slightly faster 1 loops, best of 3: 2.09 s per loop sage: %timeit E = deepcopy(Dp) 1 loops, best of 3: 2.41 s per loop sage: %timeit E = deepcopy(Ds) # Sage slightly faster 1 loops, best of 3: 2.36 s per loop
Sanity checks
Apparently we should expect the most severe problems in the case of hash
collisions. Hence, we use the above class Keys
. We verify that the results
are as expected. Moreover, we test this during iteration over the keys of the dictionary.
sage: class Vals(object): pass sage: L = [(Keys(n),Vals()) for n in range(4000)] sage: D = sage.misc.weak_dict.WeakValueDictionary(L) sage: for k in D.iterkeys(): ....: del L[0] ....: assert len(L) == len(D) ....: sage: len(D) 1310
It is of course not surprising that some items remain in the dict, since some
are deleted before they are visited in the loop, others are deleted after they
are visited in the loop. The remaining items coincide with what is left in L
:
sage: set(D.items()) == set(L) True
Trying the same with Python's weak value dictionaries of course fails:
sage: L = [(Keys(n),Vals()) for n in range(4000)] sage: D = weakref.WeakValueDictionary(L) sage: for k in D.iterkeys(): ....: del L[0] ....: assert len(L) == len(D) ....: Traceback (most recent call last): ... RuntimeError: dictionary changed size during iteration
At least the remaining items are correctly kept track of:
sage: set(D.items()) == set(L) True
Conclusion
The timings show that with few exceptions the new implementation is faster than the generic implementation in Python's weakref module. This is not much of a surprise, since weakref is written in pure Python, while my implementation is in Cython.
According to the above timings, only the iteration over the keys of the dictionary are considerably faster in the old implementation. This is not much of a surprise, since in this task the old implementation can fully rely on iteration over a Python dict, whereas in the new implementation one needs more work. I will try to make this faster, though.
The new implementation behaves well in the case of hash collisions. This actually is a surprise for me, and in order to be on the safe side, it would make sense to add some consistency tests.
Other than that, the new implementation is not slower but much safer than the old implementation. We should use it---unless the crash you are getting is reproducible.
comment:30 Changed 5 years ago by
PS: I am now trying whether merging this into #15303 prevents the crashes I found there.
comment:31 Changed 5 years ago by
Interestingly, when merging this with #15303, I get one doctest failure in sage.schemes.generic.divisor_group.DivisorGroup_curve._element_constructor_
, namely:
File "src/sage/schemes/generic/divisor_group.py", line 272, in sage.schemes.generic.divisor_group.DivisorGroup_curve._element_constructor_ Failed example: DivZZ=C.divisor_group(ZZ) Exception raised: Traceback (most recent call last): File "/home/king/Sage/git/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 480, in _run self.execute(example, compiled, test.globs) File "/home/king/Sage/git/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 839, in execute exec compiled in globs File "<doctest sage.schemes.generic.divisor_group.DivisorGroup_curve._element_constructor_[2]>", line 1, in <module> DivZZ=C.divisor_group(ZZ) File "/home/king/Sage/git/sage/local/lib/python2.7/site-packages/sage/schemes/plane_curves/curve.py", line 81, in divisor_group return DivisorGroup(self, base_ring) File "/home/king/Sage/git/sage/local/lib/python2.7/site-packages/sage/schemes/generic/divisor_group.py", line 52, in DivisorGroup DG = DivisorGroup_curve(scheme, base_ring) File "classcall_metaclass.pyx", line 330, in sage.misc.classcall_metaclass.ClasscallMetaclass.__call__ (sage/misc/classcall_metaclass.c:1224) File "/home/king/Sage/git/sage/local/lib/python2.7/site-packages/sage/schemes/generic/divisor_group.py", line 100, in __classcall__ return UniqueRepresentation.__classcall__(cls, scheme, base_ring) File "cachefunc.pyx", line 1028, in sage.misc.cachefunc.WeakCachedFunction.__call__ (sage/misc/cachefunc.c:5628) File "weak_dict.pyx", line 655, in sage.misc.weak_dict.WeakValueDictionary.__getitem__ (sage/misc/weak_dict.c:4012) IndexError: list index out of range
So, it seems that I need to fix this.
comment:32 Changed 5 years ago by
PS: Again, it is a Heisenbug. Running the same test separately works fine.
comment:33 Changed 5 years ago by
Probably one should use the iteration context more often. That's why we see an index error, I presume: Garbage collection happens during __getitem__
, while we iterate over the hash bucket.
comment:34 in reply to: ↑ 26 ; follow-up: ↓ 35 Changed 5 years ago by
Replying to SimonKing:
Hi Nils,
do I understand correctly: What you tried means patching Python,
No, it doesn't mean patching python. The file I attached is straight cython, you can compile it on pretty much any sage, and it will give you the routine del_dictitem_exact
. You could use this to subclass dict
now to create a WeakValueDictionary and use that routine in the callback to clean up without triggering python equality testing. You'd end up with an implementation that stays very close to Python's original dict
. The only other thing to do is to ensure that values get wrapped and unwrapped upon entering/leaving with the appropriate KeyedRef (and backport the iteration protection if we care)
Some arguments against doing so are:
- We rely on some of the fields of PyDictObject that are declared in
Python.h
but that are not part of the documented C-API of cpython - We rely on the method python currently uses to "delete" elements from a dict
- We rely on the probing formula that python uses.
Arguments for:
- None of these things have changed in ages and they haven't changed in Python3 either, so from a practical point of view it's unlikely this will trigger incompatibilities
- If this changes, the implementation details of PyDictObject would be different and we'll notice this by changes in
Python.h
. So, even if this changes we won't get silent errors, we'll just get a compilation error, telling us exactly which code to adapt. - I expect that this is more efficient and easier to write.
and you want to propose this upstream, but do not necessarily want to include it in Sage yet?
we could propose it, but it's not going to be in Python2 (I suspect it'll be considered a new feature, not a bug-fix. Even http://bugs.python.org/issue7105 which is recognised as a bug hasn't been backported to 2.7.
Strange. How does it crash? By commit [changeset:c3dba98] the
WeakValueDictionary
is used in Sage. Are you saying it crashes at startup? What does it say?
I still have the problem:
┌────────────────────────────────────────────────────────────────────┐ │ Sage Version 5.13.beta0, Release Date: 2013-10-08 │ │ Type "notebook()" for the browser-based notebook interface. │ │ Type "help()" for help. │ └────────────────────────────────────────────────────────────────────┘ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Warning: this is a prerelease version, and it may be unstable. ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ ------------------------------------------------------------------------------- Resolving lazy import FacadeSets during startup Calling stack: File "/usr/local/sage/sage-git/local/lib/python2.7/site-packages/IPython/core/application.py", line 175, in excepthook return self.crash_handler(etype, evalue, tb) File "/usr/local/sage/sage-git/local/lib/python2.7/site-packages/IPython/core/crashhandler.py", line 158, in __call__ traceback = TBhandler.text(etype,evalue,etb,context=31) File "/usr/local/sage/sage-git/local/lib/python2.7/site-packages/IPython/core/ultratb.py", line 412, in text tb_offset, context) File "/usr/local/sage/sage-git/local/lib/python2.7/site-packages/IPython/core/ultratb.py", line 817, in structured_traceback locals,formatvalue=var_repr)) File "/usr/local/sage/sage-git/local/lib/python/inspect.py", line 887, in formatargvalues specs.append(strseq(args[i], convert, join)) File "/usr/local/sage/sage-git/local/lib/python/inspect.py", line 842, in strseq return convert(object) File "/usr/local/sage/sage-git/local/lib/python/inspect.py", line 884, in convert return formatarg(name) + formatvalue(locals[name]) File "/usr/local/sage/sage-git/local/lib/python2.7/site-packages/IPython/core/ultratb.py", line 725, in eqrepr def eqrepr(value, repr=text_repr): return '=%s' % repr(value) File "/usr/local/sage/sage-git/local/lib/python2.7/site-packages/IPython/core/ultratb.py", line 702, in text_repr return pydoc.text.repr(value) File "/usr/local/sage/sage-git/local/lib/python/repr.py", line 24, in repr return self.repr1(x, self.maxlevel) File "/usr/local/sage/sage-git/local/lib/python/pydoc.py", line 970, in repr1 return cram(stripid(repr(x)), self.maxother) File "/usr/local/sage/sage-git/local/lib/python2.7/site-packages/sage/categories/category.py", line 2293, in _repr_ if len(categories) == 2 and Sets().Facades() in categories: ------------------------------------------------------------------------------- ********************************************************************** Oops, Sage crashed. We do our best to make it stable, but...
I don't think the entire crash report is so insightful, but the error that causes the real problem is:
TypeError: Cannot create a consistent method resolution order (MRO) for bases VectorSpaces.element_class, CommutativeAdditiveGroups.element_class, CommutativeAdditiveGroups.element_class
comment:35 in reply to: ↑ 34 Changed 5 years ago by
Replying to nbruin:
I don't think the entire crash report is so insightful, but the error that causes the real problem is:
TypeError: Cannot create a consistent method resolution order (MRO) for bases VectorSpaces.element_class, CommutativeAdditiveGroups.element_class, CommutativeAdditiveGroups.element_class
Hmm. How can this be related with weak value dictionaries? Sure, categories are put in a weak dictionary (because of CachedRepresentation
), and join categories have an additional weak cache. Perhaps this is why?
The disadvantage of our git workflow is: putting experimental patches on trac wouldn't be a problem. But putting experimental commits on trac is a problem, since the experiments may show that the commit should be changed, which means to "rewrite history" (at least for some weird notion of history). I tend to ignore that this may constitute a problem (I believe it is a flaw in our new workflow) and will thus put an experimental commit in a few minutes.
comment:36 Changed 5 years ago by
- Commit changed from e4adaebf2bd8a219f05b28746210cc0b0d0bba88 to fab0ed4112b9f798e2690f4c885b57cd711ea698
Branch pushed to git repo; I updated commit sha1. New commits:
[changeset:fab0ed4] | Use WeakValueDict?'s iteration guard more consequently |
comment:37 Changed 5 years ago by
Commit's up. Changes: The iteration guard is now used on all internal iterations. So, it should now be impossible that garbage collection changes the length of a hash bucket while we iterate over the bucket.
I could not test yet whether the addition of iteration guards has a noticeable effect on the performance. Can you test whether the new commit fixes the crash for you?
comment:38 Changed 5 years ago by
Àha, I already found one doctest failure: In sage -t src/sage/schemes/elliptic_curves/ell_point.py
, apparently during the __classcall__
of a category, my implementation of weak value dictionaries raises an error.
Actually I think this error is for a good reason. __setitem__
says: RuntimeError: Can not add items while iterating over the dictionary
. Now I need to verify whether there really is an iteration, and either remove the undue attempt to assign an item during iteration, or the iteration guard is incorrectly assuming an iteration.
comment:39 Changed 5 years ago by
More precisely, it happens in the line return JoinCategory((_EuclideanDomains,CommutativeAlgebras(base_ring)))
. It could very well be that one is iterating in the join construction.
comment:40 Changed 5 years ago by
What exactly happens when a callback is called? I learnt that it can essentially happen any time. But is the callback then called in one go, or can it be that some lines of the callback are executed and then again some lines of the "mainline" code?
In the latter case, it would be entirely possible that D[k]=v
is attempted in the mainline code while the callback of another (garbage collected) dictionary value is working inside of an iteration context.
comment:41 follow-up: ↓ 42 Changed 5 years ago by
PS: If this is really the problem, then one could perhaps have self._pending_additions
, and not only self._pending_removals
.
comment:42 in reply to: ↑ 41 ; follow-up: ↓ 43 Changed 5 years ago by
Replying to SimonKing:
PS: If this is really the problem, then one could perhaps have
self._pending_additions
, and not onlyself._pending_removals
.
Two notes:
- hash values are
long
. You're usingint
for them in your code in some places. - I think you might be going overboard a bit with the iteration protection. Python dictionaries (and lists) are clearly documented as not safe to mutate while iterating over them. Weak dictionaries are a little different because you never know if there's a callback triggered by a GC, so iterating over a weak dict would never be safe. However, I think the only thing to be guarded against is callbacks during iteration (so just defer them and execute them when the guard goes down again). Any other mutation is just for the user to live with. Indeed, for debugging, you could raise an error if a mutation is attempted while an iterator is pending on the thing.
I'll try and make a weakvaluedict tonight, based on delete_exact
and see how it performs compared to your implementation.
comment:43 in reply to: ↑ 42 Changed 5 years ago by
Replying to nbruin:
Two notes:
- hash values are
long
. You're usingint
for them in your code in some places.
I thought so, too. However,
sage: type(hash(ZZ)) <type 'int'>
and that's why I chose int
in my code.
- I think you might be going overboard a bit with the iteration protection. Python dictionaries (and lists) are clearly documented as not safe to mutate while iterating over them.
Yes, but there should be an error and not a crash, in this case. Moreover, there should be no error if there is no iteration attempted by the user. And it seems to me that an iteration is detected where the user does not iterate.
Weak dictionaries are a little different because you never know if there's a callback triggered by a GC, so iterating over a weak dict would never be safe. However, I think the only thing to be guarded against is callbacks during iteration (so just defer them and execute them when the guard goes down again). Any other mutation is just for the user to live with.
Yes, but I think the situation here looks different. I really can not see any iteration in the examples that fail for me. Hence, I thought there might something else going on, which motivated my question on the details of the callback function.
With the current commit, the callback function protects its own internal iteration against other callbacks happening at the same time. Now, assume there is the callback of a value v_0
, and we want to assign some value D[k]=v
without iteration.
Question: Is it possible that some lines of D.__setitem__(k,v)
are executed, then some lines of the callback for v_0
are executed, and then further lines of D.__setitem__(k,v)
are executed, before finally the callback for v_0
finishes?
If the answer is "yes", then I know the problem:
D.__setitem__(k,v)
starts- the callback for
v_0
starts, enters the iteration context, but does not exit the context yet D.__setitem__(k,v)
continues, and tests whether the dictionary is in an iteration context. Answer is "yes", and so an error is raised.
Can this happen?
comment:44 Changed 5 years ago by
Hm. A quick test has shown to me that at least in one of the failing examples the error does not occur during a callback.
comment:45 Changed 5 years ago by
Result: The error is raised in __setitem__
and the iteration guard is apparently invoked by __getitem__
.
To be precise:
- The iteration context was entered by
__getitem__
with the arguments((<class 'sage.schemes.generic.divisor_group.DivisorGroup_curve'>, Affine Curve over Complex Field with 53 bits of precision defined by -x^9 + y^2 - x, Integer Ring), ())
, which looks like aCachedRepresentation
to me. - This iteration context is not yet exited when
__setitem__
is called with the arguments((<class 'sage.categories.groupoid.Groupoid'>, Complex Field with 53 bits of precision), ())
comment:46 follow-up: ↓ 49 Changed 5 years ago by
A guess: Inside of the iteration context invoked by __getitem__
, I have of course to do a comparison of the given key with the previously existing keys. Could it be that comparison needs certain data to be computed, and this computation will call __setitem__
?
But then I wonder: How does a straight Python dict prevents this from happening? I.e., if a hash bucket B
is searched for equality with a key K
, how does Python make sure that the comparison with K
does not change the length of B
?
comment:47 Changed 5 years ago by
Haha! Python's <dict> does in fact not prevent this from happening! I think I have just found a bug in Python!
Here is an example that exposes the problem:
sage: import sage.misc.weak_dict sage: D = sage.misc.weak_dict.WeakValueDictionary() sage: class Key(object): ....: def __hash__(self): ....: print "hash of",id(self) ....: return 5 ....: def __cmp__(self, other): ....: print "inserting 5" ....: D[5] = ZZ ....: return self is other sage: D[Key()] = QQ hash of 215823980 sage: D[Key()] = ZZ hash of 216179948 inserting 5 Traceback (most recent call last): ... RuntimeError: Can not add items while iterating over the dictionary
And this is what Python's dict does. There is no error, but it silently fails:
sage: D = {} sage: class Key(object): ....: def __hash__(self): ....: print "hash of",id(self) ....: return 5 ....: def __cmp__(self, other): ....: print "inserting 5" ....: D[5] = ZZ ....: return self is other sage: D[Key()] = QQ hash of 215853068 sage: D[Key()] = ZZ hash of 215853580 inserting 5 sage: len(D) 2 sage: D.keys() [<__main__.Key at 0xcdda80c>, 5] sage: id(_[0]) 215853068
In other words, the second of the explicitly inserted items is missing! That's a bug, I believe.
The same of course also happens with Python's weak value dictionaries (of course, since they use a dict internally). Hence, Sage's CachedRepresentation
has been running into this bug all the time.
comment:48 follow-up: ↓ 51 Changed 5 years ago by
PS: We don't even need a hash collision to fool Python's dicts.
sage: D = {} sage: class Key(object): ....: def __hash__(self): ....: print "hash of",id(self) ....: return 6 ....: def __cmp__(self, other): ....: print "inserting 5" ....: D[5] = ZZ ....: return self is other sage: D[Key()] = QQ hash of 215852940 sage: D[Key()] = ZZ hash of 215852524 inserting 5 sage: len(D) 2 sage: D.keys() [5, <__main__.Key at 0xcdda78c>] sage: id(_[1]) 215852940
comment:49 in reply to: ↑ 46 Changed 5 years ago by
Replying to SimonKing:
But then I wonder: How does a straight Python dict prevents this from happening? I.e., if a hash bucket
B
is searched for equality with a keyK
, how does Python make sure that the comparison withK
does not change the length ofB
?
It doesn't (and also bucket lengths are irrelevant for Python dicts). It only has one contract: Entries in the dictionary ONLY change location when the allocated space of the dictionary is resized. And resizing ONLY happens when a (key,value) pair gets added to the dictionary that wasn't previously there. If the value of an already existing key gets updated, the dictionary is guaranteed to not be resized either.
That said, yes, when probing the dictionary, it's checked every iteration if the dictionary has been resized: dictobject.c line 1491. This bit used to have a warning that a devious advisary could make this not finish.
I need to look at your other example ...
comment:50 Changed 5 years ago by
Now one may wonder why this did not give rise to doctest failures before. Well, I think in this way one could construct an example of a CachedRepresentation
that fails to provide an identical instance on equal representations. But since there is no error raised and since non-unique parents do work, it went unnoticed.
comment:51 in reply to: ↑ 48 ; follow-up: ↓ 52 Changed 5 years ago by
Replying to SimonKing:
PS: We don't even need a hash collision with the implicitly inserted item to fool Python's dicts.
You haven't implemented equality in the way you think:
sage: k1=Key() sage: k2=Key() sage: k1 is k2, k1==k2 inserting 5 (False, True)
so python's dict acts consistently with the results of "==" as it should. (if you implement __cmp__
then 0
means equal, which is a falsey value. If you implement this with __eq__
instead, you see that dict
works properly).
comment:52 in reply to: ↑ 51 ; follow-up: ↓ 53 Changed 5 years ago by
Replying to nbruin:
You haven't implemented equality in the way you think:
Ouch, you are right.
In any case, I should think if there is a reasonable way to prevent this problem. Such as: Do not do
l = len(bucket) for i from 0<=i<l: ...
but
while i<len(bucket): i += 1 ...
That's slower, though.
comment:53 in reply to: ↑ 52 Changed 5 years ago by
Replying to SimonKing:
while i<len(bucket): i += 1 ...That's slower, though.
No, that's still not safe (apart from the fact you'd do i+=2
) . If i=2, you could have a del bucket[0:2]
happen and if the bucket was long enough, you'll now be skipping a key-value pair. That could be the pair you were supposed to find! You might be able to get by with placing "not in use" sentinels in the bucket slots you're not using anymore, instead of properly deleting them. That's what python does (see below), but then cleaning/reusing that garbage comes natural in python's scheme.
If your dictionary changes, you have to start over. Note that this isn't an issue in TripleDict
etc., nor is it for exclusively string-keyed dictionaries in python, because then you can guarantee the dict won't change as a result of your equality tests.
To see the kind of care necessary (and taken in dict
!):
D={} class EvilKey(object): def __init__(self,nr): self.evil = False self.nr = nr def __hash__(self): return 0 def __repr__(self): return "Evil key nr. %s"%self.nr def __eq__(self,other): if self.evil: del D[0] return self is other k1=EvilKey(1) k2=EvilKey(2) D[0]='a' D[k1]='b' k1.evil=True D[k2]='c' k1.evil=False D[k1]='d'
In this script, D
does not get reshaped, so python does not "start over" at any point. However, a side effect of D[k2]='c'
is that an earlier slot (previously taken by 0
) becomes available again. This slot is already passed, in the probing for finding a spot for D[k2]
, so its value does not end up there. This seems dangerous for the D[k1]='d'
assignment: a free slot would be encountered before k1
is found to already be a key. dict
doesn't get fooled, though:
the slot previously occupied for D[0]
is marked as "freed", not as "never used".
So, python continues probing the entire bucket (for as far as you can talk about buckets in an open-coded hash table) and indeed finds the key k1
is already in there and uses that entry. If it would have found that the key k1
didn't occur yet, it would have reused the freed slot it found earlier.
I'd say this is all a pretty good argument to try and stick with python's dict
. Coding a bullet-proof, high performance hash table is not an easy task.
comment:54 Changed 5 years ago by
I have attached a WeakValueDictionary based on your prototype, but directly implemented on dict
. It's a drop-in replacement, so you can use it right away (and you can probably more easily test its performance). In my preliminary checks it seems (much) faster than weakref.WeakValueDictionary
in all situations (not surprising since it's cythonized) and I think also always a bit faster than your prototype (it should be with one level of indirection less)
I have not made explicit modifications "safe" for dictionary iteration, although in most cases they should be fairly safe: python's dict only reshapes on adding entries, so mixing enumerating entries and removing them shouldn't lead to huge problems (except for non-determinism), but I think doing so is always a bug waiting to happen.
Notes I found when adapting your code:
- if we only use
pending_removals
for callbacks, there's no need to check that list for retrieval etc. You can recognize these entries from the dead weakref. - in
_IterationContext.__exit__
you must check the guard level every time. Key deletion could have any consequence, including creation of new iterators. - for now I have left
len
as the length of the dict itself. We could make itlen(self)-len(self.pending_removals)
, but length of aWeakValueDictionary
doesn't say much anyway. - there's a
PyObject_Hash
that's a little more efficient than calling hash (you get along
immediately).
Interesting detail: when I put this file in place in by sage-git
, your branch worked! That indicates that the MRO error reported above is indeed triggered by some changed memory management or dict order.
Debugging by seeing if dicts that have an active iterator get mutated might be useful. But note that people can keep iterators around without intention to consume them any more.
comment:55 follow-up: ↓ 56 Changed 5 years ago by
Even python's dict
isn't quite properly guarded against mutating iteration: they only check that the size doesn't change from one yield to the next, but that isn't enough, of course:
D=dict( (i,i) for i in range(5)) for k in D: print "processing key",k M=max(D.keys())+1 print "adding 10 starting from",M for i in range(M,M+1000,100): D[i]=1 L=D.keys()[:10] print "deleting ",L for j in L: del D[j]
produces some interesting behaviour. In particular, it iterates 12 times and results in the keys [2107, 2108, 3709, 3710, 5311]
. Of course, I wouldn't know what the program should do.
comment:56 in reply to: ↑ 55 ; follow-up: ↓ 59 Changed 5 years ago by
Replying to nbruin:
Even python's
dict
isn't quite properly guarded against mutating iteration: they only check that the size doesn't change from one yield to the next, but that isn't enough, of course:
Aha. That's safer in my implementation (I think): During iteration, I protect against changing the length of a bucket, not just against changing the length of the whole dict. Anyway, I do believe that it is enough to have a weak value dictionary that is as robust as a plain dict---but we don't need to be better than <dict>
.
Some remarks/questions about your code:
- I wanted to call
PyWeakref_GetObject
with borrowed references, but it somehow didn't work. Why is it working in your code? - You still do
cdef PyObject* Py_None = <PyObject*>None
. Couldn't we importPy_None
from somewhere? Unfortunately I couldn't find it, althoughPy_None
is mentioned in the documentation of the C-API. - In
del_dictitem_by_exact_value
, you ask#perhaps we should exit silently if no entry is found?
. I agree. Namely, you use this function during callback (that's the only place), and I think we really don't want an error being raised there. Could it be that the callback of a reference can not find the item that contains the reference? I think so, by a weird race condition! Namely:- Create an item
D[k] = v
, with weak referencer
tov
- delete v, but make sure that v does not become garbage collected yet
- Do
del D[k]
D.__delitem__(k)
proceeds until(k,r)
is removed from the dictionary, but it does not return yet.- Just before
r
is freed inside of__delitem__
, a garbage collection happens onv
. Hence, the callback ofr
is executed. - The callback finds that
(k,r)
is not in the dict and raises an error
- Create an item
- Why is
del_dictitem_by_exact_value
cpdef and not cdef? Does<void *>value
cost a CPU cycle? If this is so, one shouldn't call it in a while loop, and instead havecdef del_dictitem_by_exact_value(dict D, PyObject *value_addr, long hash)
. - In
pop()
, you say#we turn out into a new reference right away because...
. However, wouldn't the following save a CPU cycle in case of an error:cdef PyObject *out_ref = PyWeakref_GetObject(wr) if out_ref==Py_None: raise KeyError(k) out = <object>out_ref del self[k] return out
- The last line of
__contains__
should be deleted, as it will never be executed (in my code, I have a while loop, and thus I need to havereturn False
after the while loop. - Thanks for spotting the race condition in
_IterationContext.__exit__
.
I suggest that I'll merge your code into my branch, do the changes suggested above and create a new commit.
comment:57 Changed 5 years ago by
- Commit changed from fab0ed4112b9f798e2690f4c885b57cd711ea698 to 851cc9522dde332561101f1c84182a0a84b8eed4
Branch pushed to git repo; I updated commit sha1. New commits:
[changeset:851cc95] | Avoid some pointer casts in WeakValueDict? callbacks |
[changeset:246518f] | Use <dict>'s internals in WeakValueDictionary? and do not reinvent the bucket. |
comment:58 Changed 5 years ago by
- Reviewers set to Simon King
With the current commits, make ptest
says
---------------------------------------------------------------------- All tests passed! ---------------------------------------------------------------------- Total time for all tests: 4634.7 seconds cpu time: 7269.1 seconds cumulative wall time: 8916.6 seconds
For the record, I am happy with the changes introduced by Nils.
comment:59 in reply to: ↑ 56 ; follow-up: ↓ 60 Changed 5 years ago by
Replying to SimonKing:
Some remarks/questions about your code:
- I wanted to call
PyWeakref_GetObject
with borrowed references, but it somehow didn't work. Why is it working in your code?
Because in the cython pxd header files, they define all those routines with <object>
type parameters. That tends to work nicer, because it means the routine can "borrow" the reference itself. In the cython file I referenced there is a warning about reference counting and a remark that you can change that to <PyObject *>
if you know what you're doing, which I pretended here.
As you are away, a type cast from <PyObject *>
to <object>
generates code: an incref.
- You still do
cdef PyObject* Py_None = <PyObject*>None
. Couldn't we importPy_None
from somewhere? Unfortunately I couldn't find it, althoughPy_None
is mentioned in the documentation of the C-API.
It's in object.h
. line 839 or so:
PyAPI_DATA(PyObject) _Py_NoneStruct; /* Don't use this directly */ #define Py_None (&_Py_NoneStruct)
We can do "cdef extern PyObject? * Py_None" without problem.
- In
del_dictitem_by_exact_value
, you ask#perhaps we should exit silently if no entry is found?
. I agree.
We've considered things like this when working on TripleDict? and I don't think we ever found a smoking gun. It seems that Python tries to organize things such that callbacks don't have to run. Obviously when I was testing this routine I DID want to see the errors. I'm fine with silent, I would prefer to see a scenario actually play out where it is necessary.
D.__delitem__(k)
proceeds until(k,r)
is removed from the dictionary, but it does not return yet.- Just before
r
is freed inside of__delitem__
, a garbage collection happens onv
. Hence, the callback ofr
is executed.
The only shot you have for this is the dereffing of k
. If the value is freed before the key, this never happens. The deletion code in delitem_..._exact
is copied from Python's PyDict_DelItem
and there also first the value is dereffed before the key is. So this scenario doesn't happen.
- Why is
del_dictitem_by_exact_value
cpdef and not cdef? Does<void *>value
cost a CPU cycle? If this is so, one shouldn't call it in a while loop, and instead have `cdef
No, it's a cast from one pointer type to another, so is free. It was def
before (the routine is in principle safe to call from python too, except that its semantics are a bit wonky), but I don't mind if it's something else.
- In
pop()
, you say#we turn out into a new reference right away because...
. However, wouldn't the following save a CPU cycle in case of an error:cdef PyObject *out_ref = PyWeakref_GetObject(wr) if out_ref==Py_None: raise KeyError(k) out = <object>out_ref del self[k] return out
Yes, I figured the error raising and handling would swamp a "cycle" (whatever that is on a modern CPU) anyway, so I went for the shortest code and making sure that we don't waste an allocation in the most common code path (although a good compile should optimize that out).
- The last line of
__contains__
should be deleted, as it will never be executed (in my code, I have a while loop, and thus I need to havereturn False
after the while loop.
Ah right! I noted the "unreachable code" warning, but when you load
something into sage, the reported line numbers are off, so I thought it was complaining about unreachable comments ...
I suggest that I'll merge your code into my branch, do the changes suggested above and create a new commit.
cool.
comment:60 in reply to: ↑ 59 Changed 5 years ago by
Replying to nbruin:
I suggest that I'll merge your code into my branch, do the changes suggested above and create a new commit.
cool.
... which I already did, and I also gave a positive review to your changes.
comment:61 follow-ups: ↓ 64 ↓ 71 Changed 5 years ago by
OK, we don't have a doctest for del_dictitem_by_exact_value
. Given that we absolutely want to know if somehow python changes something about their dicts so that our implementation needs adapting, we should include a stress test (perhaps #long
?). Something along the lines of:
sage: from sage.misc.weak_dict import del_dictitem_by_exact_value B=1000 L=list(range(B)) D1=dict() D2=dict() for i in range(100000): ki=L[floor(random()*B)] vi=L[floor(random()*B)] D1[ki]=vi D2[ki]=vi ko=L[floor(random()*B)] if ko in D1: vo=D1[ko] del D1[ko] del_dictitem_by_exact_value(D2,vo,hash(ko)) assert D1 == D2
comment:62 follow-up: ↓ 63 Changed 5 years ago by
Aw, shoot. We do have a problem:
sage: from sage.misc.weak_dict import WeakValueDictionary sage: D=WeakValueDictionary() sage: V=set([2]) sage: W=set([3]) sage: D[1]=V sage: i=D.iteritems(); d=i.next(); del d #raise the guard sage: del V #queue a callback sage: D[1]=W #invalidate the callback (but it's queued!) sage: len(D) 1 sage: del i #lower guard; callback gets triggered. Exception KeyError: KeyError('key not found',) in <generator object at 0x65b0140> ignored sage: len(D) 1
so it seems indeed that if there are queued removals, any other destructive mutation (changing the value of an existing key or removing a key) should check if the corresponding value is queued and, if so, remove it.
Lemma: We can get away with silently abandoning the callback if we don't find the value. This is OK if we're sure the value we're looking for (the weakref) still exists in its original form and hasn't been reused. And we know that because we're holding an <object>
, not just an id of an object! So we've kept the weakref alive ourselves and prevented that id from being reused until the callback actually happens or is thrown away.
So: yes, we should remain silent on failure.
comment:63 in reply to: ↑ 62 Changed 5 years ago by
Replying to nbruin:
Aw, shoot. We do have a problem:
You mean: We would have a problem when the callback raised an error. With the current commits (that silently return without error!) I get
sage: from sage.misc.weak_dict import WeakValueDictionary sage: D=WeakValueDictionary() sage: V=set([2]) sage: W=set([3]) sage: D[1]=V sage: i=D.iteritems(); d=i.next(); del d #raise the guard sage: del V #queue a callback sage: D[1]=W #invalidate the callback (but it's queued!) sage: len(D) 1 sage: del i sage: len(D) 1
which I believe is correct.
So: yes, we should remain silent on failure.
We already do...
comment:64 in reply to: ↑ 61 Changed 5 years ago by
Replying to nbruin:
OK, we don't have a doctest for
del_dictitem_by_exact_value
. Given that we absolutely want to know if somehow python changes something about their dicts so that our implementation needs adapting, we should include a stress test (perhaps#long
?). Something along the lines of:sage: from sage.misc.weak_dict import del_dictitem_by_exact_value
}}}
With the current commits, del_dictitem_by_exact_value
is cdef and can thus not be imported. We need an indirect doctest.
comment:65 Changed 5 years ago by
OK, I updated my local git repository but can't push it to the ticket here (I can't push it anywhere), so I've attached the changed file. git commit log:
commit dbeb58ba955ebfe9b37d4241871e0d8f41b20a1d Author: Nils Bruin <nbruin@sfu.ca> Date: Thu Oct 31 12:16:11 2013 -0700 documentation and clean-up of del_dictitem_by_exact_value. In particular: - add a hash check so that behaviour is a little better defined - include a stress test to verify that it behaves as documented - put a test on WeakValueDictionary to verify that deferred callbacks that become invalidated are dealt with appropriately.
comment:66 Changed 5 years ago by
I have tested (by inserting print statements into del_dictitem_by_exact_value
) that the following is a valid test for the callback silently doing the right thing:
sage: from sage.misc.weak_dict import WeakValueDictionary sage: V = [set(range(n)) for n in range(5)] sage: D = WeakValueDictionary(enumerate(V)) sage: for k in D.iterkeys(): ....: V[k] = None ....: D[k] = ZZ
So, this could be added as a test.
I just see that you have also tried to push something. I'll see what you suggest, will merge, and then I'll push the result.
comment:67 follow-ups: ↓ 68 ↓ 69 Changed 5 years ago by
Nils, could you please try to download the commits that I have pushed? Your file reverts changes that I already did (and I think should be done).
comment:68 in reply to: ↑ 67 Changed 5 years ago by
Replying to SimonKing:
Nils, could you please try to download the commits that I have pushed? Your file reverts changes that I already did
That said: Please wait with the download until I have pushed the merge of our latest versions.
comment:69 in reply to: ↑ 67 ; follow-up: ↓ 70 Changed 5 years ago by
Replying to SimonKing:
Nils, could you please try to download the commits that I have pushed? Your file reverts changes that I already did (and I think should be done).
I did. My changes were based on:
commit 851cc9522dde332561101f1c84182a0a84b8eed4 Author: Simon King <simon.king@uni-jena.de> Date: Thu Oct 31 14:37:21 2013 +0100 Avoid some pointer casts in WeakValueDict callbacks
I reverted those things necessary to have del_dictitem_by_exact_value
in cpdef
form. If you prefer to have a cdef _del_dictitem_by_exact_value
version with a def del_dictitem_by_exact_value
that's fine with me. For doctesting, I think we want to expose the routine on python level. It should be easy to prove that the routine works as advertised/show problems.
(collaborating with git really is an improvement, isn't it :-)
comment:70 in reply to: ↑ 69 ; follow-up: ↓ 72 Changed 5 years ago by
Replying to nbruin:
I reverted those things necessary to have
del_dictitem_by_exact_value
incpdef
form. If you prefer to have acdef _del_dictitem_by_exact_value
version with adef del_dictitem_by_exact_value
that's fine with me. For doctesting, I think we want to expose the routine on python level.
No, I think it is usual in sage to have indirect doctests for cdef functions. And we do have indirect tests in this case.
It should be easy to prove that the routine works as advertised/show problems.
Yes, indirectly.
(collaborating with git really is an improvement, isn't it :-)
:-?
comment:71 in reply to: ↑ 61 Changed 5 years ago by
Replying to nbruin:
OK, we don't have a doctest for
del_dictitem_by_exact_value
. Given that we absolutely want to know if somehow python changes something about their dicts so that our implementation needs adapting, we should include a stress test (perhaps#long
?). Something along the lines of:sage: from sage.misc.weak_dict import del_dictitem_by_exact_value B=1000 L=list(range(B)) D1=dict() D2=dict() for i in range(100000): ki=L[floor(random()*B)] vi=L[floor(random()*B)] D1[ki]=vi D2[ki]=vi ko=L[floor(random()*B)] if ko in D1: vo=D1[ko] del D1[ko] del_dictitem_by_exact_value(D2,vo,hash(ko)) assert D1 == D2
I think, for an indirect doctest, one should modify it so that del_dictitem_by_exact_value
is executed because a callback happens. I'll try to create an example accordingly. Roughly: We have two equal dicts; we explicitly delete an item in the first dict, and implicitly delete the same item in the second dict, by allowing the value to become garbage collected.
For a weak value dictionary we can of course not use a list L
of ints. What I don't understand: Is it essential in your example that keys and values come from the same list? Otherwise, I would simply have a list and two weak value dictionaries keeping track of the same key-value pairs, so that a callback is triggered if one deletes a value from the list.
comment:72 in reply to: ↑ 70 Changed 5 years ago by
Replying to SimonKing:
No, I think it is usual in sage to have indirect doctests for cdef functions. And we do have indirect tests in this case.
I don't think the indirect tests are thorough enough in this case. The implicit tests that we have presently probably don't exercise the probing function at all (we probably find what we're looking for or NULL on the first try)
I would like to see the test I wrote, or something similar, in the regular test suite (perhaps scaled down a bit or marked #long
if it's too expensive in its present form). We are relying here on implementation details that python doesn't explicitly guarantee.
I am taking keys and values from the same list so that there's a chance to screw up the dictionary if they inadvertently get confused (i.e., if the layout of mp.ma_table
changes).
Why waste effort on trying to come up with an indirect test if you can just test it directly? Call the routine _test_of_del_dict_dont_bother_calling_this_for_other_reasons
if you like.
I think the semantics of the routine are well-defined, though, so we shouldn't go out of our way to hide it.
comment:73 Changed 5 years ago by
Bad news. When trying to turn your stress test into an indirect test, things didn't work for me:
sage: B=3 sage: L = [None]*B sage: D1 = WeakValueDictionary() sage: D2 = WeakValueDictionary() sage: import gc sage: for i in range(100): ....: ki = floor(random()*B) ....: vi = C(floor(random()*B)) ....: print "assign",ki,id(vi),"to D1" ....: D1[ki] = vi ....: print "assign",ki,id(vi),"to D2" ....: D2[ki] = vi ....: print "assign",ki,id(vi),"to L" ....: L[ki] = vi ....: ko = floor(random()*B) ....: if ko in D1: ....: print "delete",ko,id(L[ko]),"from D1" ....: del D1[ko] ....: print "delete",ko,id(L[ko]),"from L" ....: L[ko] = None ....: _ = gc.collect() ....: assert D1 == D2 assign 2 220846028 to D1 assign 2 220846028 to D2 assign 2 220846028 to L assign 0 220845964 to D1 assign 0 220845964 to D2 assign 0 220845964 to L assign 1 220845932 to D1 assign 1 220845932 to D2 assign 1 220845932 to L delete 0 220845964 from D1 delete 0 220845964 from L bye-bye 220845964 assign 2 221294892 to D1 assign 2 221294892 to D2 assign 2 221294892 to L bye-bye 220846028 delete 1 220845932 from D1 delete 1 220845932 from L bye-bye 220845932 assign 0 221294764 to D1 assign 0 221294764 to D2 assign 0 221294764 to L delete 2 221294892 from D1 delete 2 221294892 from L bye-bye 221294892 assign 1 221294700 to D1 assign 1 221294700 to D2 assign 1 221294700 to L delete 1 221294700 from D1 delete 1 221294700 from L Traceback (most recent call last): ... AssertionError:
Do you think this is from a design flaw (race condition) in my test? Or do you think it is a genuine problem of our code?
To my understanding, deletion of a value (and thus execution of the callback) should happen immediately after replacing the value on the list L
: Cyclic garbage collection is not involved, and hence there should be no delay. However, just to be on the safe side, I explicitly added a garbage collection.
Or did I misunderstand how deallocation of Python objects without reference cycles is working? Can there be a delay?
comment:74 Changed 5 years ago by
My observation, after repeating the tests a couple of times: The error happens as soon as ki==ko
. I don't understand why, though. Anyway, for tonight, I should stop working on this.
comment:75 Changed 5 years ago by
Arrgh, I am so stupid!! Of course, if ki==ko
then vi
lurks around and prevents the item in D2
from "death-by-garbage-collection"!
This works:
sage: from sage.misc.weak_dict import WeakValueDictionary sage: class C(object): ....: def __init__(self, n): ....: self.n = n ....: def __hash__(self): ....: return hash(self.n) ....: def __cmp__(self, other): ....: return cmp(type(self),type(other)) or cmp(self.n, other.n) sage: B=10 sage: L = [None]*B sage: D1 = WeakValueDictionary() sage: D2 = WeakValueDictionary() sage: import gc sage: for i in range(10000): ....: ki = floor(random()*B) ....: vi = C(floor(random()*B)) ....: D1[ki] = vi ....: D2[ki] = vi ....: L[ki] = vi ....: del vi ....: ko = floor(random()*B) ....: if ko in D1: ....: del D1[ko] ....: L[ko] = None ....: assert D1 == D2
Do you agree that this test is an indirect version of your test, as del_dictitem_by_exact_value
is triggered by the line L[ko] = None
?
comment:76 Changed 5 years ago by
PS: You are right, many cdef functions in Sage have a python function that tests them. Perhaps we can have both, namely an explicit and an implicit test?
Another question: Do you think it is a good idea to make B so large (1000, in your original example)? The smaller B, the more collisions we have, isn't it? On the other hand, for large B we should probably have more different buckets being used.
comment:77 Changed 5 years ago by
With B=500 and 50000 runs, the indirect test takes 23 CPU seconds on my laptop. Would this be good for a "normal" test, or would this needed to be called "long"?
I guess your direct test is faster, since you don't need to create instances of C
.
comment:78 Changed 5 years ago by
This version of your test (using a python wrapper for the cdef function) needs 12.5 CPU-seconds on my laptop. I think that's fine for a long test.
def test_del_dictitem_by_exact_value(D, value, h): """ This function helps testing some cdef function used to delete dictionary items. INPUT: - ``D`` -- a Python ``<dict>``. - ``value`` -- an object that is value ``D``. - ``h`` -- the hash of the key under which to find ``value`` in ``D``. <Further doc to be added> TESTS: See :trac:`13394` for a discussion. :: sage: from sage.misc.weak_dict import test_del_dictitem_by_exact_value sage: B=1000 sage: L=list(range(B)) sage: D1=dict() sage: D2=dict() sage: for i in range(100000): # optional: long ....: ki=L[floor(random()*B)] ....: vi=L[floor(random()*B)] ....: D1[ki]=vi ....: D2[ki]=vi ....: ko=L[floor(random()*B)] ....: if ko in D1: ....: vo=D1[ko] ....: del D1[ko] ....: test_del_dictitem_by_exact_value(D2,vo,hash(ko)) ....: assert D1 == D2 """ return del_dictitem_by_exact_value(<PyDictObject *>D, <PyObject *>value, h)
The following down-sized indirect test takes about 2 seconds and is thus fine for a non-long test:
sage: from sage.misc.weak_dict import WeakValueDictionary sage: class C(object): ....: def __init__(self, n): ....: self.n = n ....: def __cmp__(self, other): ....: return cmp(type(self),type(other)) or cmp(self.n, other.n) sage: B = 100 sage: L = [None]*B sage: D1 = WeakValueDictionary() sage: D2 = WeakValueDictionary() sage: import gc sage: for i in range(10000): ....: ki = floor(random()*B) ....: vi = C(floor(random()*B)) ....: D1[ki] = vi ....: D2[ki] = vi ....: L[ki] = vi ....: del vi ....: ko = floor(random()*B) ....: if ko in D1: ....: del D1[ko] ....: L[ko] = None ....: assert D1 == D2
comment:79 Changed 5 years ago by
- Commit changed from 851cc9522dde332561101f1c84182a0a84b8eed4 to e60890eea7e9d108431d183c897b37482b12e4cf
Branch pushed to git repo; I updated commit sha1. New commits:
[changeset:e60890e] | Add direct and indirect stresstests for the weak value callbacks |
comment:80 Changed 5 years ago by
The latest commit adds a small indirect stresstest and a long (nearly) direct stresstest, using a python function that wraps the cdef function (and could in principle be used in python code).
For me, running sage -t --long --verbose src/sage/misc/weak_dict.pyx
takes 21 seconds, without errors.
Hence, from my perspective it is good to go.
comment:81 follow-up: ↓ 82 Changed 5 years ago by
I've attached some changes (both as a patch and the file that should be the end result).
Some small type corrections and some major example changes for _IterationContext
: the examples were testing behaviour that the guard doesn't influence in its current incarnation. I have replaced it with an example that should also illustrate that people should really only use a WeakValueDictionary
if they really need that, because their behaviour can be very strange (especially when the same objects occur as keys and values)
I'd be OK with giving a positive review to "your part" and since you've given a positive review to "my part" (provided you're OK with the changes I've made) all code has a positive review. I think we've been pretty careful in developing this code.
Would it be possible to "fold" the commits on the branch here? It would be silly to document the little "back and forth" on some bits in the "official" sage history.
comment:82 in reply to: ↑ 81 Changed 5 years ago by
Replying to nbruin:
I've attached some changes (both as a patch and the file that should be the end result).
OK, I'll merge it into the branch.
Would it be possible to "fold" the commits on the branch here? It would be silly to document the little "back and forth" on some bits in the "official" sage history.
I was told that changing a commit that is published on trac but is not merged in Sage means "changing the history" (even though it will only be merged in future) and is bad. I really hate this aspect of git. So, even though I totally agree that documenting the back and forth is silly, I will not fold.
comment:83 Changed 5 years ago by
Note that your patch introduces a typo (namely, the indentation of the closing """
of some doc string is too small)
comment:84 Changed 5 years ago by
- Commit changed from e60890eea7e9d108431d183c897b37482b12e4cf to 1a12ce6982bbe1f050e4ae627972023d5ddfa5ef
Branch pushed to git repo; I updated commit sha1. New commits:
[changeset:1a12ce6] | Fix some typos. Better tests for WeakValueDict? iteration guard. |
comment:85 Changed 5 years ago by
I merged your changes and fixed some further typos. Is it good to go now?
comment:86 Changed 5 years ago by
- Description modified (diff)
- Reviewers changed from Simon King to Nils Bruin, Simon King
- Status changed from needs_review to positive_review
Good to go for me! Incidentally, we're running into something I can only explain as an illegal optimization in python's print
command:
sage: from sage.misc.weak_dict import WeakValueDictionary sage: D=WeakValueDictionary() sage: D[1]=QQ sage: print D {1: <weakref at 0x34128f0; to 'RationalField_with_category' at 0x1ffb4d0>} sage: repr(D) '<WeakValueDictionary at 0x6225ae0>' sage: str(D) '<WeakValueDictionary at 0x6225ae0>'
and it's an odd optimization too. The only explanation I can see is that print
doesn't think dict
can be subclassed and thus, based on a type test, decides to use PyDict
routines on it. I'm afraid that our solution won't necessarily work in all places where a UserDict
would get recognized as a type that needs method resolution.
EDIT: I think the problem is that the dict
type has its tp_print
field set, so we inherit that routine. Ideally we should override it, but I don't think cython provides support for that. Discussion on cython-users has some details. We could hack our way around it by doing
cdef extern from "Python.h": ctypedef struct PyTypeObject: void * tp_print cdef WeakValueDictionary(dict): def __init__(self): .... #clear the tp_print field on the type after PyType_Ready has executed on it. (<PyTypeObject *><void *>WeakValueDictionary).tp_print = NULL
It feels like a horrible hack but it does have the desired effect.
UPDATE: Cython is planning to do this for us in their next release.
comment:87 Changed 5 years ago by
Hi Nils,
I guess we don't really need to take care of print D
.
Now, I plan to create a follow-up ticket. Namely, I notice that in different places in Sage we use something like a "weak value dictionary without callbacks", for example sage.modular.etaproducts._cache
: This is a normal Python dict, but its values are weak references.
Hence, it would make sense to use a proper sage.misc.weak_dict.WeakValueDictionary
instead.
comment:88 Changed 5 years ago by
- Milestone changed from sage-5.13 to sage-6.0
comment:89 Changed 5 years ago by
- Milestone changed from sage-6.0 to sage-5.13
Why is the milestone bumped after it got a positive review?
comment:90 Changed 5 years ago by
- Milestone changed from sage-5.13 to sage-6.0
PS, to Jeroen: It was due to a bug (I guess) in Trac that my post changed the milestone back from 6.0 to 5.13 --- I did not do it explicitly, and now I am changing it forth explicitly.
Anyway, what is the policy? Does a "feature freeze" also apply to positive reviews given before the feature freeze?
comment:91 Changed 5 years ago by
- Milestone changed from sage-6.0 to sage-5.13
Mercurial patches go into sage-5.x
Git branches go into sage-6.x
It's that simple.
comment:92 Changed 5 years ago by
- Milestone changed from sage-5.13 to sage-6.0
comment:93 follow-up: ↓ 102 Changed 5 years ago by
Nils, from the discussion at cython-users, it seems that the reason for defining dict.tp_print
is that for large dicts one would not like to create a giant string and return it, but would like to print it bit by bit.
And this may give rise to a suggestion: Python could allow __str__
resp. __repr__
to become iterators returning a list of strings, and then it could print them bit by bit. Such as (for a dictionary):
def __repr__(self): yield "{" i = self.iteritems() try: k,v = i.next() yield "%s: %s"%(k,v) except StopIteration: pass for k, v in i: yield ", %s: %s"%(k,v) yield "}"
If you think that this would be a reasonable feature, you could suggest it to upstream (I guess you know the Python community better than I).
comment:94 follow-up: ↓ 95 Changed 5 years ago by
Jeroen, another question: Nils has asked whether we could fold all commits into one, because we had quite some back and forth when implementing it.
Would it be an acceptable option that I post a diff patch summarising all the changes, and then perhaps even change the milestone back to 5.13?
comment:95 in reply to: ↑ 94 ; follow-up: ↓ 96 Changed 5 years ago by
Replying to SimonKing:
Would it be an acceptable option that I post a diff patch summarising all the changes, and then perhaps even change the milestone back to 5.13?
Yes.
But please double-check that the Mercurial patch you post actually matches the result of the git branch.
comment:96 in reply to: ↑ 95 Changed 5 years ago by
Replying to jdemeyer:
Replying to SimonKing:
Would it be an acceptable option that I post a diff patch summarising all the changes, and then perhaps even change the milestone back to 5.13?
Yes.
But please double-check that the Mercurial patch you post actually matches the result of the git branch.
OK. My plan is:
- remove some trailing whitespace that I spotted and push the change (is this needed, or would you automatically remove it when you merge it? In this case, I'd skip that).
git diff master > some_file.patch
- Edit it, so that it matches the old folder layout
- apply
some_file.patch
to a non-git version of Sage, and run tests. - post the patch here, and Nils double checks whether the patch looks good to him.
comment:97 follow-up: ↓ 99 Changed 5 years ago by
- Description modified (diff)
I have removed the trailing whitespace from the patch, but not from the branch yet (I can do so when requested).
This patch does apply to sage-5.12.beta5 (that's the latest non-git version on my machine). Note that there was a mismatch in sage/structure/factory.pyx
, because someone appears to have removed trailing whitespace. It must have happened somehow between sage-5.12.beta5 and the master branch on trac, but git blame
does not show how the whitespace has vanished.
Anyway. Nils, please verify that the attached patch applies to some non-git Sage, and that the patch does what we do in the branch.
Apply trac13394-weak_value_dictionary.patch
comment:98 Changed 5 years ago by
- Commit changed from 1a12ce6982bbe1f050e4ae627972023d5ddfa5ef to 11bd210662b7dfa8eea2a6ff258980663ae14209
- Status changed from positive_review to needs_review
Branch pushed to git repo; I updated commit sha1 and set ticket back to needs_review. New commits:
[changeset:11bd210] | Remove some trailing whitespace |
comment:99 in reply to: ↑ 97 Changed 5 years ago by
Replying to SimonKing:
I have removed the trailing whitespace from the patch, but not from the branch yet (I can do so when requested).
I simply did...
Anyway. Nils, please verify that the attached patch applies to some non-git Sage, and that the patch does what we do in the branch.
Now, because of the new commit (which only removes whitespace) we need a review anyway.
Apply trac13394-weak_value_dictionary.patch
comment:100 Changed 5 years ago by
- Milestone changed from sage-6.0 to sage-5.13
comment:101 Changed 5 years ago by
- Description modified (diff)
comment:102 in reply to: ↑ 93 Changed 5 years ago by
Replying to SimonKing:
Nils, from the discussion at cython-users, it seems that the reason for defining
dict.tp_print
is that for large dicts one would not like to create a giant string and return it, but would like to print it bit by bit.
I'm pretty sure that's why it was originally designed. It seems that subsequently tp_print
wasn't found to be all that useful and hence effectively removed. I haven't checked, but Stefan's comments suggest that it's not used in Python3. I can see how that would happen: If one has a huge data structure, one doesn't write it to a file in one go. One dumps it more carefully.
And this may give rise to a suggestion: Python could allow
__str__
resp.__repr__
to become iterators returning a list of strings, and then it could print them bit by bit.
It's an elegant idea and probably how tp_print would be implemented if iterators had been around earlier, but I'm pretty sure one wouldn't want to slow down __str__
etc. (internally they are C-speed for many types, because they are slotted!) in general to gain a use-case that has been shown to very rarely occur.
comment:103 follow-up: ↓ 104 Changed 5 years ago by
Please check that the documentation builds fine. I get:
docstring of sage.misc.weak_dict.WeakValueDictionary:45: WARNING: Literal block expected; none found.
(but I'm not entirely sure this was with the latest version of the patch)
comment:104 in reply to: ↑ 103 Changed 5 years ago by
- Status changed from needs_review to needs_work
- Work issues set to Docbuild
Replying to jdemeyer:
(but I'm not entirely sure this was with the latest version of the patch)
Well, there is only one version of the patch... But you are right, I get this error as well.
comment:105 Changed 5 years ago by
- Commit changed from 11bd210662b7dfa8eea2a6ff258980663ae14209 to 1aedffee52c61a67167de17afc57d76a91d1d679
Branch pushed to git repo; I updated commit sha1. New commits:
[changeset:1aedffe] | Fix doc format |
comment:106 Changed 5 years ago by
- Status changed from needs_work to needs_review
- Work issues Docbuild deleted
I have updated both the hg patch and the git branch (it was only needed to remove one ":"). To me, the documentation looks good.
Apply trac13394-weak_value_dictionary.patch
comment:107 Changed 5 years ago by
Thanks everybody for your work on this ticket. Since #10963 now depends on it, I hope it will get in soon!
comment:108 Changed 5 years ago by
- Status changed from needs_review to positive_review
Patch applies to 5.12 and results in a functional weak_dict.pyx
and diffs pretty close to something that was already reviewed, so back to positive.
comment:109 Changed 5 years ago by
- Branch u/SimonKing/ticket/13394 deleted
- Commit 1aedffee52c61a67167de17afc57d76a91d1d679 deleted
Removed git branch to reduce confusion.
comment:110 Changed 5 years ago by
- Merged in set to sage-5.13.beta3
- Resolution set to fixed
- Status changed from positive_review to closed
comment:111 follow-up: ↓ 112 Changed 5 years ago by
The WeakValueDictionary
behaves differently from regular dicts for unhashable keys:
sage: {}[[]] TypeError: unhashable type: 'list' sage: sage.misc.weak_dict.WeakValueDictionary()[[]] KeyError: []
This is because http://docs.python.org/2/c-api/dict.html#PyDict_GetItem does not throw a TypeError
. Was this done on purpose, or should I open a ticket for it?
comment:112 in reply to: ↑ 111 Changed 5 years ago by
This is now #15956.
Replying to saraedum:
The
WeakValueDictionary
behaves differently from regular dicts for unhashable keys:sage: {}[[]] TypeError: unhashable type: 'list' sage: sage.misc.weak_dict.WeakValueDictionary()[[]] KeyError: []This is because http://docs.python.org/2/c-api/dict.html#PyDict_GetItem does not throw a
TypeError
. Was this done on purpose, or should I open a ticket for it?
The standard Python
WeakValueDictionary
is built on top of a normal dictionary, built on top of a normal dictionary. Let's say we haveThere is an underlying
D = dict
. If we dothen it really executes
This is a weak reference to v, so v can still get collected. If it does, the callback can ensure that
gets executed. This locates the entry by computing the hash of
h
, finding the right bucket and finding an equal key by comparison. When we create the entry, we have better lookup information, though: We already havehash(k)
andid(v)
. These two don't necessarily pin down the entry in the dictionary, but in our applications they do and for aWeakValueDictionary
identicalv
would be the same (dead)KeyedRef
anyway, so it wouldn't hurt to remove them all. We need one more method on our dictionary:Then the callback can be
and the
KeyedRef
that gets stored would be(normal precautions to put the dict reference in a closure or a class should apply here in defining
remove
)Or one could just immediately integrate the
KeyedRefs
in the design. See also #13387 for discussions about dictionary designs.In either case, the main effect is that removal does not cause any hashing or comparison methods to be called on
k
anymore.