Opened 7 years ago
Closed 7 years ago
#14254 closed defect (fixed)
OverflowErrors in TripleDictEraser
Reported by: | jdemeyer | Owned by: | robertwb |
---|---|---|---|
Priority: | blocker | Milestone: | sage-5.8 |
Component: | coercion | Keywords: | |
Cc: | SimonKing, nbruin | Merged in: | sage-5.8.rc0 |
Authors: | Jeroen Demeyer | Reviewers: | Simon King |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | #13387 | Stopgaps: |
Description
With the new doctesting framework (#12415), I sometimes see errors like
sage -t --long devel/sage/sage/combinat/sf/sfa.py ********************************************************************** File "devel/sage/sage/combinat/sf/sfa.py", line 388, in sage.combinat.sf.sfa.is_SymmetricFunctionAlgebra Failed example: is_SymmetricFunctionAlgebra(SymmetricFunctions(FractionField(QQ['q','t'])).macdonald().P()) Expected: True Got: Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x9531b3c> ignored True **********************************************************************
on 32-bit systems.
These appear randomly non-reproducibly in doctests.
Attachments (1)
Change History (41)
comment:1 Changed 7 years ago by
- Milestone changed from sage-5.9 to sage-5.8
comment:2 Changed 7 years ago by
comment:3 Changed 7 years ago by
PS: Note the discussion that we had on #715 on that matter.
Is there a reason why we use ssize_t and not size_t?
comment:4 follow-up: ↓ 6 Changed 7 years ago by
Any idea on how to debug this? The problem is that currently we have no idea where this exception is generated.
comment:5 Changed 7 years ago by
Google seems to tell me that one should use size_t and not ssize_t here anyway. Not sure though.
Nils, do you think it could help to consistently replace ssize_t and Py_ssize_t by size_t?
comment:6 in reply to: ↑ 4 Changed 7 years ago by
Replying to jdemeyer:
Any idea on how to debug this? The problem is that currently we have no idea where this exception is generated.
Sage's debug version?
comment:7 Changed 7 years ago by
Replying to SimonKing:
k1,k2,k3,offset = r.key
This line might be problematic if there is no guarantee that the components of r.key
fit in the respective types (Py_ssize_t
, Py_ssize_t
, Py_ssize_t
, int
).
comment:8 follow-up: ↓ 9 Changed 7 years ago by
Replying to SimonKing:
Sage's debug version?
Would it really give more information about where the exception is generated?
comment:9 in reply to: ↑ 8 Changed 7 years ago by
Replying to jdemeyer:
Replying to SimonKing:
Sage's debug version?
Would it really give more information about where the exception is generated?
I thought that this is the point of a debug version? Is there an environment variable similar to MALLOC_CHECK_ that makes the program segfault on an overflow error, so that one could then analyse the problem using gdb?
comment:10 follow-up: ↓ 11 Changed 7 years ago by
How many bytes is Py_ssize_t using? On 32-bit machines, is Py_ssize_t perhaps using more bytes than size_t? Then, the problem could be here:
cdef Py_ssize_t h1 = <Py_ssize_t><void *>k1 cdef Py_ssize_t h2 = <Py_ssize_t><void *>k2 cdef Py_ssize_t h3 = <Py_ssize_t><void *>k3 cdef Py_ssize_t h = (h1 + 13*h2 ^ 503*h3)
While h1, h2, and h3 should be fine (they are obtained from a pointer and are thus using 32 bit), I could imagine that Python tries to be clever and uses more byte (long versus int) for h, if 13*h2 or 503*h3 happen to be larger than the greatest number representable by 32 bit.
If this is the case, then the line
cdef list bucket = <object>PyList_GET_ITEM(all_buckets, (<size_t>h) % PyList_GET_SIZE(all_buckets))
(when the "long" Py_ssize_t h is converted down to 32 bit of size_t) could overflow, isn't it?
Caveat: I am not an expert for the subtleties of Py_ssize_t on 32 versus 64 bit...
comment:11 in reply to: ↑ 10 Changed 7 years ago by
Replying to SimonKing:
How many bytes is Py_ssize_t using? On 32-bit machines, is Py_ssize_t perhaps using more bytes than size_t? Then, the problem could be here:
cdef Py_ssize_t h1 = <Py_ssize_t><void *>k1 cdef Py_ssize_t h2 = <Py_ssize_t><void *>k2 cdef Py_ssize_t h3 = <Py_ssize_t><void *>k3 cdef Py_ssize_t h = (h1 + 13*h2 ^ 503*h3)While h1, h2, and h3 should be fine (they are obtained from a pointer and are thus using 32 bit), I could imagine that Python tries to be clever and uses more byte (long versus int) for h
No, that cannot be the problem, arithmetic on pure C types (like here) doesn't use Python at all.
comment:12 follow-up: ↓ 14 Changed 7 years ago by
I found the following on PEP 353:
Conceptually, Py_intptr_t and Py_ssize_t are different things: Py_intptr_t needs to be the same size as void*, and Py_ssize_t the same size as size_t. These could differ, e.g. on machines where pointers have segment and offset. On current flat-address space machines, there is no difference, so for all practical purposes, Py_intptr_t would have worked as well.
Note that we actually do want a conversion from and to <void*>. So, shouldn't we actually use Py_intptr_t, not Py_ssize_t?
comment:13 Changed 7 years ago by
Does Py_intptr_t even exist in Sage? I did a grep for it, to no avail.
comment:14 in reply to: ↑ 12 ; follow-up: ↓ 15 Changed 7 years ago by
Replying to SimonKing:
I found the following on PEP 353:
Conceptually, Py_intptr_t and Py_ssize_t are different things: Py_intptr_t needs to be the same size as void*, and Py_ssize_t the same size as size_t. These could differ, e.g. on machines where pointers have segment and offset. On current flat-address space machines, there is no difference, so for all practical purposes, Py_intptr_t would have worked as well.Note that we actually do want a conversion from and to <void*>. So, shouldn't we actually use Py_intptr_t, not Py_ssize_t?
That's all true but it wouldn't solve this error.
comment:15 in reply to: ↑ 14 ; follow-ups: ↓ 16 ↓ 19 Changed 7 years ago by
Replying to jdemeyer:
Note that we actually do want a conversion from and to <void*>. So, shouldn't we actually use Py_intptr_t, not Py_ssize_t?
That's all true but it wouldn't solve this error.
Well, you didn't tell yet were exactly the error is located.
Lacking more information, my guess was that an overflow could occur on systems where sizeof(Py_ssize_t)==sizeof(size_t)
differs from sizeof(Py_intptr_t)==sizeof(void*)
. So, why are you sure that the problem is not related with conversion between size_t and pointers?
comment:16 in reply to: ↑ 15 Changed 7 years ago by
Replying to SimonKing:
Well, you didn't tell yet were exactly the error is located.
If I would know that, fixing the error would probably be trivial.
comment:17 Changed 7 years ago by
I see it also all over the place when building the docs:
[...] [rings ] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings ] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings ] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings ] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings ] no targets are out of date. [rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored [rings_num] no targets are out of date. [...]
comment:18 Changed 7 years ago by
Can you test sizeof(Py_ssize_t)==sizeof(void*)
on those machines that exposed the error? Namely, this equality would falsify my guess.
comment:19 in reply to: ↑ 15 Changed 7 years ago by
Replying to SimonKing:
Lacking more information, my guess was that an overflow could occur on systems where
sizeof(Py_ssize_t)==sizeof(size_t)
differs fromsizeof(Py_intptr_t)==sizeof(void*)
. So, why are you sure that the problem is not related with conversion between size_t and pointers?
This is a typical 32-bit system where the types int
, long
, size_t
, Py_ssize_t
, void*
, intptr_t
are all 4 bytes.
comment:20 Changed 7 years ago by
Besides, even if void*
would be larger than Py_ssize_t
, I still don't see how it could cause the error that we're seeing. The cast void*
-> Py_ssize_t
is pure C, so it couldn't overflow.
comment:21 Changed 7 years ago by
I am debugging building the docs, as I can reliably get the failure this way.
comment:22 follow-up: ↓ 23 Changed 7 years ago by
The error happens in the last line here:
cdef TripleDict D = <object>PyWeakref_GetObject(self.D) if D is None: return cdef list buckets = D.buckets if buckets is None: return # r is a (weak) reference (typically to a parent), and it knows the # stored key of the unique triple r() had been part of. # We remove that unique triple from self.D cdef Py_ssize_t k1,k2,k3 k1,k2,k3 = r.key
comment:23 in reply to: ↑ 22 ; follow-up: ↓ 26 Changed 7 years ago by
Replying to jdemeyer:
The error happens in the last line here:
cdef Py_ssize_t k1,k2,k3 k1,k2,k3 = r.key
Interesting. So, this would probably mean that the actual error occurs when storing the key. Let's see what is done there...
Just out of curiosity: Does the same problem also occur with #14159 and dependencies?
comment:24 Changed 7 years ago by
Aha!
We have
cdef size_t h1 = <size_t><void *>k1 cdef size_t h2 = <size_t><void *>k2 cdef size_t h3 = <size_t><void *>k3 try: ref1 = KeyedRef(k1,self.eraser,(h1, h2, h3)) except TypeError: ref1 = k1
Hence, the three items of the key originally were size_t. But they are assigned to Py_ssize_t.
comment:25 Changed 7 years ago by
I can't find this code, where is it?
comment:26 in reply to: ↑ 23 ; follow-up: ↓ 28 Changed 7 years ago by
Replying to SimonKing:
Interesting. So, this would probably mean that the actual error occurs when storing the key.
Indeed.
And r.key
does indeed take values like k1,k2,k3 = (3043219468L, 154417500, 150140108)
, the first of which would not fit in a Py_ssize_t
.
So the question is: where does this tuple come from?
comment:27 follow-up: ↓ 29 Changed 7 years ago by
In sage/categories/homset.py:
_cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))
id()
returns <type 'int'>
. I don't know if assigning int
to Py_ssize_t
can be a problem.
comment:28 in reply to: ↑ 26 Changed 7 years ago by
Replying to jdemeyer:
And
r.key
does indeed take values likek1,k2,k3 = (3043219468L, 154417500, 150140108)
, the first of which would not fit in aPy_ssize_t
.So the question is: where does this tuple come from?
There are (as much as I know) only the two places that I mentioned (by the way, #14159 would change it).
Is size_t-->Py_ssize_t
a potential problem? Or is id(...)-->Py_ssize_t
a potential problem?
comment:29 in reply to: ↑ 27 ; follow-up: ↓ 30 Changed 7 years ago by
Replying to SimonKing:
In sage/categories/homset.py:
_cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))
id()
returns<type 'int'>
. I don't know if assigningint
toPy_ssize_t
can be a problem.
Bingo! That's the problem indeed.
comment:30 in reply to: ↑ 29 ; follow-up: ↓ 38 Changed 7 years ago by
Replying to jdemeyer:
Replying to SimonKing:
In sage/categories/homset.py:
_cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))
id()
returns<type 'int'>
. I don't know if assigningint
toPy_ssize_t
can be a problem.Bingo! That's the problem indeed.
OK. Then it is solved in #14159, I suppose.
comment:31 follow-up: ↓ 32 Changed 7 years ago by
comment:32 in reply to: ↑ 31 ; follow-up: ↓ 33 Changed 7 years ago by
Replying to jdemeyer:
But I would really like a quick fix independently of #14159. Then this quick fix here could be merged in sage-5.8.rc0 while #14159 wouldn't. Would you mind if I create a patch?
No, I wouldn't mind (rebasing #14159 should be easy enough).
But how can one change the id() thingy in sage/categories/homset.py? After all, it is a Python file, hence, one can not just do <size_t><void*>X
.
comment:33 in reply to: ↑ 32 Changed 7 years ago by
Replying to SimonKing:
But how can one change the id() thingy in sage/categories/homset.py? After all, it is a Python file, hence, one can not just do
<size_t><void*>X
.
You can't, but you could write a small function to do just this in coerce_dict.pyx
for example.
Changed 7 years ago by
comment:34 Changed 7 years ago by
- Status changed from new to needs_review
comment:35 Changed 7 years ago by
- Dependencies set to #13387
comment:36 follow-up: ↓ 37 Changed 7 years ago by
The patch looks fine to me, and it applies cleanly after #13387. Now I wonder what the patchbots have to tell.
Problem: I have no 32-bit machine available. Will the machines on which you first noticed the problem report here, in the hopefully green) blob?
comment:37 in reply to: ↑ 36 Changed 7 years ago by
Replying to SimonKing:
Will the machines on which you first noticed the problem report here, in the (hopefully green) blob?
I don't think so.
In any case, the machine in question managed to build the documentation without errors, so that's a good sign. I'm running tests now, but it's a slow machine so I'll report back tomorrow.
Regardless, I think you can review the patch without access to a 32-bit system and assume that I have done my duty in testing the patch.
comment:38 in reply to: ↑ 30 Changed 7 years ago by
Replying to SimonKing:
_cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))Bingo! That's the problem indeed.
OK. Then it is solved in #14159, I suppose.
Good work guys! That line made me feel uncomfortable when I was reviewing it, because it was reaching into implementation details of _cache.eraser
and indeed, it wasn't properly doing that, in a way I did not spot. This is even more confirmation that #14159 is the right solution and not just over-engineering.
comment:39 Changed 7 years ago by
- Reviewers set to SimonKing
- Status changed from needs_review to positive_review
OK, I believe Jeroen when he says that building the documentation on his 32-bit systems works with the patch, while it reproducibly failed without the patch.
I think the solution makes sense: id(X) returns a positive number, and this may very well be beyond what a signed number type of the same size (such as Py_ssize_t) can take.
The patchbot gives a green blob, too. Hence, I hope it is safe to give this a positive review (but I think #14159 will be better, on the long run).
comment:40 Changed 7 years ago by
- Merged in set to sage-5.8.rc0
- Resolution set to fixed
- Reviewers changed from SimonKing to Simon King
- Status changed from positive_review to closed
There is a comment
Could it be that this is relevant here?
Is it really safe to do
?
Yes, we need to care about the sign. But we also have
(hence, the modulus is Py_ssize_t, and the result will be translated into Py_ssize_t, too)
I am afraid I have no 32 bit machine to test.