Opened 10 years ago
Closed 10 years ago
#14254 closed defect (fixed)
OverflowErrors in TripleDictEraser
Reported by: | Jeroen Demeyer | Owned by: | Robert Bradshaw |
---|---|---|---|
Priority: | blocker | Milestone: | sage-5.8 |
Component: | coercion | Keywords: | |
Cc: | Simon King, Nils Bruin | Merged in: | sage-5.8.rc0 |
Authors: | Jeroen Demeyer | Reviewers: | Simon King |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | #13387 | Stopgaps: |
Description
With the new doctesting framework (#12415), I sometimes see errors like
sage -t --long devel/sage/sage/combinat/sf/sfa.py ********************************************************************** File "devel/sage/sage/combinat/sf/sfa.py", line 388, in sage.combinat.sf.sfa.is_SymmetricFunctionAlgebra Failed example: is_SymmetricFunctionAlgebra(SymmetricFunctions(FractionField(QQ['q','t'])).macdonald().P()) Expected: True Got: Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x9531b3c> ignored True **********************************************************************
on 32-bit systems.
These appear randomly non-reproducibly in doctests.
Attachments (1)
Change History (41)
comment:1 Changed 10 years ago by
Milestone: | sage-5.9 → sage-5.8 |
---|
comment:2 Changed 10 years ago by
comment:3 Changed 10 years ago by
PS: Note the discussion that we had on #715 on that matter.
Is there a reason why we use ssize_t and not size_t?
comment:4 follow-up: 6 Changed 10 years ago by
Any idea on how to debug this? The problem is that currently we have no idea where this exception is generated.
comment:5 Changed 10 years ago by
Google seems to tell me that one should use size_t and not ssize_t here anyway. Not sure though.
Nils, do you think it could help to consistently replace ssize_t and Py_ssize_t by size_t?
comment:6 Changed 10 years ago by
Replying to jdemeyer:
Any idea on how to debug this? The problem is that currently we have no idea where this exception is generated.
Sage's debug version?
comment:7 Changed 10 years ago by
Replying to SimonKing:
k1,k2,k3,offset = r.key
This line might be problematic if there is no guarantee that the components of r.key
fit in the respective types (Py_ssize_t
, Py_ssize_t
, Py_ssize_t
, int
).
comment:8 follow-up: 9 Changed 10 years ago by
Replying to SimonKing:
Sage's debug version?
Would it really give more information about where the exception is generated?
comment:9 Changed 10 years ago by
Replying to jdemeyer:
Replying to SimonKing:
Sage's debug version?
Would it really give more information about where the exception is generated?
I thought that this is the point of a debug version? Is there an environment variable similar to MALLOC_CHECK_ that makes the program segfault on an overflow error, so that one could then analyse the problem using gdb?
comment:10 follow-up: 11 Changed 10 years ago by
How many bytes is Py_ssize_t using? On 32-bit machines, is Py_ssize_t perhaps using more bytes than size_t? Then, the problem could be here:
cdef Py_ssize_t h1 = <Py_ssize_t><void *>k1 cdef Py_ssize_t h2 = <Py_ssize_t><void *>k2 cdef Py_ssize_t h3 = <Py_ssize_t><void *>k3 cdef Py_ssize_t h = (h1 + 13*h2 ^ 503*h3)
While h1, h2, and h3 should be fine (they are obtained from a pointer and are thus using 32 bit), I could imagine that Python tries to be clever and uses more byte (long versus int) for h, if 13*h2 or 503*h3 happen to be larger than the greatest number representable by 32 bit.
If this is the case, then the line
cdef list bucket = <object>PyList_GET_ITEM(all_buckets, (<size_t>h) % PyList_GET_SIZE(all_buckets))
(when the "long" Py_ssize_t h is converted down to 32 bit of size_t) could overflow, isn't it?
Caveat: I am not an expert for the subtleties of Py_ssize_t on 32 versus 64 bit...
comment:11 Changed 10 years ago by
Replying to SimonKing:
How many bytes is Py_ssize_t using? On 32-bit machines, is Py_ssize_t perhaps using more bytes than size_t? Then, the problem could be here:
cdef Py_ssize_t h1 = <Py_ssize_t><void *>k1 cdef Py_ssize_t h2 = <Py_ssize_t><void *>k2 cdef Py_ssize_t h3 = <Py_ssize_t><void *>k3 cdef Py_ssize_t h = (h1 + 13*h2 ^ 503*h3)While h1, h2, and h3 should be fine (they are obtained from a pointer and are thus using 32 bit), I could imagine that Python tries to be clever and uses more byte (long versus int) for h
No, that cannot be the problem, arithmetic on pure C types (like here) doesn't use Python at all.
comment:12 follow-up: 14 Changed 10 years ago by
I found the following on PEP 353:
Conceptually, Py_intptr_t and Py_ssize_t are different things: Py_intptr_t needs to be the same size as void*, and Py_ssize_t the same size as size_t. These could differ, e.g. on machines where pointers have segment and offset. On current flat-address space machines, there is no difference, so for all practical purposes, Py_intptr_t would have worked as well.
Note that we actually do want a conversion from and to <void*>. So, shouldn't we actually use Py_intptr_t, not Py_ssize_t?
comment:13 Changed 10 years ago by
Does Py_intptr_t even exist in Sage? I did a grep for it, to no avail.
comment:14 follow-up: 15 Changed 10 years ago by
Replying to SimonKing:
I found the following on PEP 353:
Conceptually, Py_intptr_t and Py_ssize_t are different things: Py_intptr_t needs to be the same size as void*, and Py_ssize_t the same size as size_t. These could differ, e.g. on machines where pointers have segment and offset. On current flat-address space machines, there is no difference, so for all practical purposes, Py_intptr_t would have worked as well.Note that we actually do want a conversion from and to <void*>. So, shouldn't we actually use Py_intptr_t, not Py_ssize_t?
That's all true but it wouldn't solve this error.
comment:15 follow-ups: 16 19 Changed 10 years ago by
Replying to jdemeyer:
Note that we actually do want a conversion from and to <void*>. So, shouldn't we actually use Py_intptr_t, not Py_ssize_t?
That's all true but it wouldn't solve this error.
Well, you didn't tell yet were exactly the error is located.
Lacking more information, my guess was that an overflow could occur on systems where sizeof(Py_ssize_t)==sizeof(size_t)
differs from sizeof(Py_intptr_t)==sizeof(void*)
. So, why are you sure that the problem is not related with conversion between size_t and pointers?
comment:16 Changed 10 years ago by
Replying to SimonKing:
Well, you didn't tell yet were exactly the error is located.
If I would know that, fixing the error would probably be trivial.
comment:17 Changed 10 years ago by
I see it also all over the place when building the docs:
[...] [rings ] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings ] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings ] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings ] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings ] no targets are out of date. [rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings_num] no targets are out of date. [...]
comment:18 Changed 10 years ago by
Can you test sizeof(Py_ssize_t)==sizeof(void*)
on those machines that exposed the error? Namely, this equality would falsify my guess.
comment:19 Changed 10 years ago by
Replying to SimonKing:
Lacking more information, my guess was that an overflow could occur on systems where
sizeof(Py_ssize_t)==sizeof(size_t)
differs fromsizeof(Py_intptr_t)==sizeof(void*)
. So, why are you sure that the problem is not related with conversion between size_t and pointers?
This is a typical 32-bit system where the types int
, long
, size_t
, Py_ssize_t
, void*
, intptr_t
are all 4 bytes.
comment:20 Changed 10 years ago by
Besides, even if void*
would be larger than Py_ssize_t
, I still don't see how it could cause the error that we're seeing. The cast void*
-> Py_ssize_t
is pure C, so it couldn't overflow.
comment:21 Changed 10 years ago by
I am debugging building the docs, as I can reliably get the failure this way.
comment:22 follow-up: 23 Changed 10 years ago by
The error happens in the last line here:
cdef TripleDict D = <object>PyWeakref_GetObject(self.D) if D is None: return cdef list buckets = D.buckets if buckets is None: return # r is a (weak) reference (typically to a parent), and it knows the # stored key of the unique triple r() had been part of. # We remove that unique triple from self.D cdef Py_ssize_t k1,k2,k3 k1,k2,k3 = r.key
comment:23 follow-up: 26 Changed 10 years ago by
Replying to jdemeyer:
The error happens in the last line here:
cdef Py_ssize_t k1,k2,k3 k1,k2,k3 = r.key
Interesting. So, this would probably mean that the actual error occurs when storing the key. Let's see what is done there...
Just out of curiosity: Does the same problem also occur with #14159 and dependencies?
comment:24 Changed 10 years ago by
Aha!
We have
cdef size_t h1 = <size_t><void *>k1 cdef size_t h2 = <size_t><void *>k2 cdef size_t h3 = <size_t><void *>k3 try: ref1 = KeyedRef(k1,self.eraser,(h1, h2, h3)) except TypeError: ref1 = k1
Hence, the three items of the key originally were size_t. But they are assigned to Py_ssize_t.
comment:26 follow-up: 28 Changed 10 years ago by
Replying to SimonKing:
Interesting. So, this would probably mean that the actual error occurs when storing the key.
Indeed.
And r.key
does indeed take values like k1,k2,k3 = (3043219468L, 154417500, 150140108)
, the first of which would not fit in a Py_ssize_t
.
So the question is: where does this tuple come from?
comment:27 follow-up: 29 Changed 10 years ago by
In sage/categories/homset.py:
_cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))
id()
returns <type 'int'>
. I don't know if assigning int
to Py_ssize_t
can be a problem.
comment:28 Changed 10 years ago by
Replying to jdemeyer:
And
r.key
does indeed take values likek1,k2,k3 = (3043219468L, 154417500, 150140108)
, the first of which would not fit in aPy_ssize_t
.So the question is: where does this tuple come from?
There are (as much as I know) only the two places that I mentioned (by the way, #14159 would change it).
Is size_t-->Py_ssize_t
a potential problem? Or is id(...)-->Py_ssize_t
a potential problem?
comment:29 follow-up: 30 Changed 10 years ago by
Replying to SimonKing:
In sage/categories/homset.py:
_cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))
id()
returns<type 'int'>
. I don't know if assigningint
toPy_ssize_t
can be a problem.
Bingo! That's the problem indeed.
comment:30 follow-up: 38 Changed 10 years ago by
Replying to jdemeyer:
Replying to SimonKing:
In sage/categories/homset.py:
_cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))
id()
returns<type 'int'>
. I don't know if assigningint
toPy_ssize_t
can be a problem.Bingo! That's the problem indeed.
OK. Then it is solved in #14159, I suppose.
comment:31 follow-up: 32 Changed 10 years ago by
comment:32 follow-up: 33 Changed 10 years ago by
Replying to jdemeyer:
But I would really like a quick fix independently of #14159. Then this quick fix here could be merged in sage-5.8.rc0 while #14159 wouldn't. Would you mind if I create a patch?
No, I wouldn't mind (rebasing #14159 should be easy enough).
But how can one change the id() thingy in sage/categories/homset.py? After all, it is a Python file, hence, one can not just do <size_t><void*>X
.
comment:33 Changed 10 years ago by
Replying to SimonKing:
But how can one change the id() thingy in sage/categories/homset.py? After all, it is a Python file, hence, one can not just do
<size_t><void*>X
.
You can't, but you could write a small function to do just this in coerce_dict.pyx
for example.
Changed 10 years ago by
Attachment: | 14254_signed_id.patch added |
---|
comment:34 Changed 10 years ago by
Authors: | → Jeroen Demeyer |
---|---|
Status: | new → needs_review |
comment:35 Changed 10 years ago by
Dependencies: | → #13387 |
---|
comment:36 follow-up: 37 Changed 10 years ago by
The patch looks fine to me, and it applies cleanly after #13387. Now I wonder what the patchbots have to tell.
Problem: I have no 32-bit machine available. Will the machines on which you first noticed the problem report here, in the hopefully green) blob?
comment:37 Changed 10 years ago by
Replying to SimonKing:
Will the machines on which you first noticed the problem report here, in the (hopefully green) blob?
I don't think so.
In any case, the machine in question managed to build the documentation without errors, so that's a good sign. I'm running tests now, but it's a slow machine so I'll report back tomorrow.
Regardless, I think you can review the patch without access to a 32-bit system and assume that I have done my duty in testing the patch.
comment:38 Changed 10 years ago by
Replying to SimonKing:
_cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))Bingo! That's the problem indeed.
OK. Then it is solved in #14159, I suppose.
Good work guys! That line made me feel uncomfortable when I was reviewing it, because it was reaching into implementation details of _cache.eraser
and indeed, it wasn't properly doing that, in a way I did not spot. This is even more confirmation that #14159 is the right solution and not just over-engineering.
comment:39 Changed 10 years ago by
Reviewers: | → SimonKing |
---|---|
Status: | needs_review → positive_review |
OK, I believe Jeroen when he says that building the documentation on his 32-bit systems works with the patch, while it reproducibly failed without the patch.
I think the solution makes sense: id(X) returns a positive number, and this may very well be beyond what a signed number type of the same size (such as Py_ssize_t) can take.
The patchbot gives a green blob, too. Hence, I hope it is safe to give this a positive review (but I think #14159 will be better, on the long run).
comment:40 Changed 10 years ago by
Merged in: | → sage-5.8.rc0 |
---|---|
Resolution: | → fixed |
Reviewers: | SimonKing → Simon King |
Status: | positive_review → closed |
There is a comment
Could it be that this is relevant here?
Is it really safe to do
?
Yes, we need to care about the sign. But we also have
(hence, the modulus is Py_ssize_t, and the result will be translated into Py_ssize_t, too)
I am afraid I have no 32 bit machine to test.