Opened 7 years ago

Closed 7 years ago

#14254 closed defect (fixed)

OverflowErrors in TripleDictEraser

Reported by: jdemeyer Owned by: robertwb
Priority: blocker Milestone: sage-5.8
Component: coercion Keywords:
Cc: SimonKing, nbruin Merged in: sage-5.8.rc0
Authors: Jeroen Demeyer Reviewers: Simon King
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: #13387 Stopgaps:

Description

With the new doctesting framework (#12415), I sometimes see errors like

sage -t --long devel/sage/sage/combinat/sf/sfa.py
**********************************************************************
File "devel/sage/sage/combinat/sf/sfa.py", line 388, in sage.combinat.sf.sfa.is_SymmetricFunctionAlgebra
Failed example:
    is_SymmetricFunctionAlgebra(SymmetricFunctions(FractionField(QQ['q','t'])).macdonald().P())
Expected:
    True
Got:
    Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x9531b3c> ignored
    True
**********************************************************************

on 32-bit systems.

These appear randomly non-reproducibly in doctests.

Attachments (1)

14254_signed_id.patch (6.0 KB) - added by jdemeyer 7 years ago.

Download all attachments as: .zip

Change History (41)

comment:1 Changed 7 years ago by jdemeyer

  • Milestone changed from sage-5.9 to sage-5.8

comment:2 Changed 7 years ago by SimonKing

There is a comment

# A note about how to store "id" keys in python structures:
#
# In python a "pointer length integer" (size_t) normally, is encoded
# as a *signed* integer, of type Py_ssize_t. This has an advantage in that
# if the value gets encoded as a *python integer* it can do so in a sign-preserving
# way and still make use of all the bits that python offers to store (small) integers.
#
# There is one place where we have to be careful about signs:
# Our hash values most naturally live in Py_ssize_t. We convert those into
# an index into our bucket list by taking the hash modulo the number of buckets.
# However, the modulo operator in C preserves the sign of the number we take the
# modulus of, which is not what we want.
# The solution is to always do
# (<size_t) h)% modulus
# to ensure we're doing an unsigned modulus.

Could it be that this is relevant here?

Is it really safe to do

        cdef Py_ssize_t k1,k2,k3
        cdef int offset
        k1,k2,k3,offset = r.key
        cdef Py_ssize_t h = (k1 + 13*k2 ^ 503*k3)
        PyList_GET_ITEM(buckets, (<size_t>h) % PyList_GET_SIZE(buckets))

?

Yes, we need to care about the sign. But we also have

 Py_ssize_t PyList_GET_SIZE(object list)
 PyObject* PyList_GET_ITEM(object list, Py_ssize_t i)

(hence, the modulus is Py_ssize_t, and the result will be translated into Py_ssize_t, too)

I am afraid I have no 32 bit machine to test.

comment:3 Changed 7 years ago by SimonKing

PS: Note the discussion that we had on #715 on that matter.

Is there a reason why we use ssize_t and not size_t?

comment:4 follow-up: Changed 7 years ago by jdemeyer

Any idea on how to debug this? The problem is that currently we have no idea where this exception is generated.

comment:5 Changed 7 years ago by SimonKing

Google seems to tell me that one should use size_t and not ssize_t here anyway. Not sure though.

Nils, do you think it could help to consistently replace ssize_t and Py_ssize_t by size_t?

comment:6 in reply to: ↑ 4 Changed 7 years ago by SimonKing

Replying to jdemeyer:

Any idea on how to debug this? The problem is that currently we have no idea where this exception is generated.

Sage's debug version?

comment:7 Changed 7 years ago by jdemeyer

Replying to SimonKing:

        k1,k2,k3,offset = r.key

This line might be problematic if there is no guarantee that the components of r.key fit in the respective types (Py_ssize_t, Py_ssize_t, Py_ssize_t, int).

comment:8 follow-up: Changed 7 years ago by jdemeyer

Replying to SimonKing:

Sage's debug version?

Would it really give more information about where the exception is generated?

comment:9 in reply to: ↑ 8 Changed 7 years ago by SimonKing

Replying to jdemeyer:

Replying to SimonKing:

Sage's debug version?

Would it really give more information about where the exception is generated?

I thought that this is the point of a debug version? Is there an environment variable similar to MALLOC_CHECK_ that makes the program segfault on an overflow error, so that one could then analyse the problem using gdb?

comment:10 follow-up: Changed 7 years ago by SimonKing

How many bytes is Py_ssize_t using? On 32-bit machines, is Py_ssize_t perhaps using more bytes than size_t? Then, the problem could be here:

        cdef Py_ssize_t h1 = <Py_ssize_t><void *>k1
        cdef Py_ssize_t h2 = <Py_ssize_t><void *>k2
        cdef Py_ssize_t h3 = <Py_ssize_t><void *>k3
        cdef Py_ssize_t h = (h1 + 13*h2 ^ 503*h3)

While h1, h2, and h3 should be fine (they are obtained from a pointer and are thus using 32 bit), I could imagine that Python tries to be clever and uses more byte (long versus int) for h, if 13*h2 or 503*h3 happen to be larger than the greatest number representable by 32 bit.

If this is the case, then the line

cdef list bucket = <object>PyList_GET_ITEM(all_buckets, (<size_t>h) % PyList_GET_SIZE(all_buckets))

(when the "long" Py_ssize_t h is converted down to 32 bit of size_t) could overflow, isn't it?

Caveat: I am not an expert for the subtleties of Py_ssize_t on 32 versus 64 bit...

comment:11 in reply to: ↑ 10 Changed 7 years ago by jdemeyer

Replying to SimonKing:

How many bytes is Py_ssize_t using? On 32-bit machines, is Py_ssize_t perhaps using more bytes than size_t? Then, the problem could be here:

        cdef Py_ssize_t h1 = <Py_ssize_t><void *>k1
        cdef Py_ssize_t h2 = <Py_ssize_t><void *>k2
        cdef Py_ssize_t h3 = <Py_ssize_t><void *>k3
        cdef Py_ssize_t h = (h1 + 13*h2 ^ 503*h3)

While h1, h2, and h3 should be fine (they are obtained from a pointer and are thus using 32 bit), I could imagine that Python tries to be clever and uses more byte (long versus int) for h

No, that cannot be the problem, arithmetic on pure C types (like here) doesn't use Python at all.

comment:12 follow-up: Changed 7 years ago by SimonKing

I found the following on PEP 353:

Conceptually, Py_intptr_t and Py_ssize_t are different things: Py_intptr_t needs
to be the same size as void*, and Py_ssize_t the same size as size_t. These could
differ, e.g. on machines where pointers have segment and offset. On current
flat-address space machines, there is no difference, so for all practical purposes,
Py_intptr_t would have worked as well.

Note that we actually do want a conversion from and to <void*>. So, shouldn't we actually use Py_intptr_t, not Py_ssize_t?

comment:13 Changed 7 years ago by SimonKing

Does Py_intptr_t even exist in Sage? I did a grep for it, to no avail.

comment:14 in reply to: ↑ 12 ; follow-up: Changed 7 years ago by jdemeyer

Replying to SimonKing:

I found the following on PEP 353:

Conceptually, Py_intptr_t and Py_ssize_t are different things: Py_intptr_t needs
to be the same size as void*, and Py_ssize_t the same size as size_t. These could
differ, e.g. on machines where pointers have segment and offset. On current
flat-address space machines, there is no difference, so for all practical purposes,
Py_intptr_t would have worked as well.

Note that we actually do want a conversion from and to <void*>. So, shouldn't we actually use Py_intptr_t, not Py_ssize_t?

That's all true but it wouldn't solve this error.

comment:15 in reply to: ↑ 14 ; follow-ups: Changed 7 years ago by SimonKing

Replying to jdemeyer:

Note that we actually do want a conversion from and to <void*>. So, shouldn't we actually use Py_intptr_t, not Py_ssize_t?

That's all true but it wouldn't solve this error.

Well, you didn't tell yet were exactly the error is located.

Lacking more information, my guess was that an overflow could occur on systems where sizeof(Py_ssize_t)==sizeof(size_t) differs from sizeof(Py_intptr_t)==sizeof(void*). So, why are you sure that the problem is not related with conversion between size_t and pointers?

Last edited 7 years ago by SimonKing (previous) (diff)

comment:16 in reply to: ↑ 15 Changed 7 years ago by jdemeyer

Replying to SimonKing:

Well, you didn't tell yet were exactly the error is located.

If I would know that, fixing the error would probably be trivial.

comment:17 Changed 7 years ago by jdemeyer

I see it also all over the place when building the docs:

[...]
[rings    ] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored
[rings    ] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored
[rings    ] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored
[rings    ] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored
[rings    ] no targets are out of date.
[rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored
[rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored
[rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored
[rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored
[rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored
[rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored
[rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored
[rings_num] Exception OverflowError: 'long int too large to convert to int' in <sage.structure.coerce_dict.TripleDictEraser object at 0x949611c> ignored
[rings_num] no targets are out of date.
[...]

comment:18 Changed 7 years ago by SimonKing

Can you test sizeof(Py_ssize_t)==sizeof(void*) on those machines that exposed the error? Namely, this equality would falsify my guess.

comment:19 in reply to: ↑ 15 Changed 7 years ago by jdemeyer

Replying to SimonKing:

Lacking more information, my guess was that an overflow could occur on systems where sizeof(Py_ssize_t)==sizeof(size_t) differs from sizeof(Py_intptr_t)==sizeof(void*). So, why are you sure that the problem is not related with conversion between size_t and pointers?

This is a typical 32-bit system where the types int, long, size_t, Py_ssize_t, void*, intptr_t are all 4 bytes.

comment:20 Changed 7 years ago by jdemeyer

Besides, even if void* would be larger than Py_ssize_t, I still don't see how it could cause the error that we're seeing. The cast void* -> Py_ssize_t is pure C, so it couldn't overflow.

Last edited 7 years ago by jdemeyer (previous) (diff)

comment:21 Changed 7 years ago by jdemeyer

I am debugging building the docs, as I can reliably get the failure this way.

comment:22 follow-up: Changed 7 years ago by jdemeyer

The error happens in the last line here:

        cdef TripleDict D = <object>PyWeakref_GetObject(self.D)
        if D is None:
            return
        cdef list buckets = D.buckets
        if buckets is None:
            return
        # r is a (weak) reference (typically to a parent), and it knows the
        # stored key of the unique triple r() had been part of.
        # We remove that unique triple from self.D
        cdef Py_ssize_t k1,k2,k3
        k1,k2,k3 = r.key

comment:23 in reply to: ↑ 22 ; follow-up: Changed 7 years ago by SimonKing

Replying to jdemeyer:

The error happens in the last line here:

        cdef Py_ssize_t k1,k2,k3
        k1,k2,k3 = r.key

Interesting. So, this would probably mean that the actual error occurs when storing the key. Let's see what is done there...

Just out of curiosity: Does the same problem also occur with #14159 and dependencies?

comment:24 Changed 7 years ago by SimonKing

Aha!

We have

        cdef size_t h1 = <size_t><void *>k1
        cdef size_t h2 = <size_t><void *>k2
        cdef size_t h3 = <size_t><void *>k3
        try:
            ref1 = KeyedRef(k1,self.eraser,(h1, h2, h3))
        except TypeError:
            ref1 = k1

Hence, the three items of the key originally were size_t. But they are assigned to Py_ssize_t.

comment:25 Changed 7 years ago by jdemeyer

I can't find this code, where is it?

comment:26 in reply to: ↑ 23 ; follow-up: Changed 7 years ago by jdemeyer

Replying to SimonKing:

Interesting. So, this would probably mean that the actual error occurs when storing the key.

Indeed.

And r.key does indeed take values like k1,k2,k3 = (3043219468L, 154417500, 150140108), the first of which would not fit in a Py_ssize_t.

So the question is: where does this tuple come from?

comment:27 follow-up: Changed 7 years ago by SimonKing

In sage/categories/homset.py:

    _cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))

id() returns <type 'int'>. I don't know if assigning int to Py_ssize_t can be a problem.

comment:28 in reply to: ↑ 26 Changed 7 years ago by SimonKing

Replying to jdemeyer:

And r.key does indeed take values like k1,k2,k3 = (3043219468L, 154417500, 150140108), the first of which would not fit in a Py_ssize_t.

So the question is: where does this tuple come from?

There are (as much as I know) only the two places that I mentioned (by the way, #14159 would change it).

Is size_t-->Py_ssize_t a potential problem? Or is id(...)-->Py_ssize_t a potential problem?

comment:29 in reply to: ↑ 27 ; follow-up: Changed 7 years ago by jdemeyer

Replying to SimonKing:

In sage/categories/homset.py:

    _cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))

id() returns <type 'int'>. I don't know if assigning int to Py_ssize_t can be a problem.

Bingo! That's the problem indeed.

comment:30 in reply to: ↑ 29 ; follow-up: Changed 7 years ago by SimonKing

Replying to jdemeyer:

Replying to SimonKing:

In sage/categories/homset.py:

    _cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))

id() returns <type 'int'>. I don't know if assigning int to Py_ssize_t can be a problem.

Bingo! That's the problem indeed.

OK. Then it is solved in #14159, I suppose.

comment:31 follow-up: Changed 7 years ago by jdemeyer

But I would really like a quick fix independently of #14159. Then this quick fix here could be merged in sage-5.8.rc0 while #14159 wouldn't. Would you mind if I create a patch?

comment:32 in reply to: ↑ 31 ; follow-up: Changed 7 years ago by SimonKing

Replying to jdemeyer:

But I would really like a quick fix independently of #14159. Then this quick fix here could be merged in sage-5.8.rc0 while #14159 wouldn't. Would you mind if I create a patch?

No, I wouldn't mind (rebasing #14159 should be easy enough).

But how can one change the id() thingy in sage/categories/homset.py? After all, it is a Python file, hence, one can not just do <size_t><void*>X.

comment:33 in reply to: ↑ 32 Changed 7 years ago by jdemeyer

Replying to SimonKing:

But how can one change the id() thingy in sage/categories/homset.py? After all, it is a Python file, hence, one can not just do <size_t><void*>X.

You can't, but you could write a small function to do just this in coerce_dict.pyx for example.

Changed 7 years ago by jdemeyer

comment:34 Changed 7 years ago by jdemeyer

  • Authors set to Jeroen Demeyer
  • Status changed from new to needs_review

comment:35 Changed 7 years ago by SimonKing

  • Dependencies set to #13387

comment:36 follow-up: Changed 7 years ago by SimonKing

The patch looks fine to me, and it applies cleanly after #13387. Now I wonder what the patchbots have to tell.

Problem: I have no 32-bit machine available. Will the machines on which you first noticed the problem report here, in the hopefully green) blob?

comment:37 in reply to: ↑ 36 Changed 7 years ago by jdemeyer

Replying to SimonKing:

Will the machines on which you first noticed the problem report here, in the (hopefully green) blob?

I don't think so.

In any case, the machine in question managed to build the documentation without errors, so that's a good sign. I'm running tests now, but it's a slow machine so I'll report back tomorrow.

Regardless, I think you can review the patch without access to a 32-bit system and assume that I have done my duty in testing the patch.

comment:38 in reply to: ↑ 30 Changed 7 years ago by nbruin

Replying to SimonKing:

_cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))

Bingo! That's the problem indeed.

OK. Then it is solved in #14159, I suppose.

Good work guys! That line made me feel uncomfortable when I was reviewing it, because it was reaching into implementation details of _cache.eraser and indeed, it wasn't properly doing that, in a way I did not spot. This is even more confirmation that #14159 is the right solution and not just over-engineering.

comment:39 Changed 7 years ago by SimonKing

  • Reviewers set to SimonKing
  • Status changed from needs_review to positive_review

OK, I believe Jeroen when he says that building the documentation on his 32-bit systems works with the patch, while it reproducibly failed without the patch.

I think the solution makes sense: id(X) returns a positive number, and this may very well be beyond what a signed number type of the same size (such as Py_ssize_t) can take.

The patchbot gives a green blob, too. Hence, I hope it is safe to give this a positive review (but I think #14159 will be better, on the long run).

comment:40 Changed 7 years ago by jdemeyer

  • Merged in set to sage-5.8.rc0
  • Resolution set to fixed
  • Reviewers changed from SimonKing to Simon King
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.