Opened 2 years ago
Closed 2 years ago
#28112 closed defect (fixed)
py3: Fix hash function of Integer
Reported by:  ghmwageringel  Owned by:  

Priority:  major  Milestone:  sage8.9 
Component:  python3  Keywords:  
Cc:  Merged in:  
Authors:  Markus Wageringel  Reviewers:  Jeroen Demeyer, John Palmieri, Frédéric Chapoton 
Report Upstream:  N/A  Work issues:  
Branch:  b13f7f8 (Commits, GitHub, GitLab)  Commit:  b13f7f812bd02ae02a2a2549c860cc20cbc1bcd6 
Dependencies:  Stopgaps: 
Description
This ticket fixes an issue in sage.libs.gmp.pylong.mpz_pythonhash
, the hash function for Integer, and resolves the Python 3 doctest failures in sage/rings/integer.pyx
.
More precisely, the implementation of mpz_pythonhash
is using the fact that y <= modulus
, as explained in the comment
# Safely compute h = (h + y) % modulus knowing that h < modulus # and y <= modulus
However, this assumption is not always correct, as y
is the sum of up to three numbers. For example, this fails in
sage: hash(2^64  1), hash(int(2^64  1)) (2305843009213693958, 7)
Change History (20)
comment:1 followup: ↓ 3 Changed 2 years ago by
 Branch set to u/ghmwageringel/28112
 Commit set to ba487ec15e63bff0e51f3d6afd5fb632f2b63d01
 Status changed from new to needs_review
 Summary changed from py3: Fix for Integer hash to py3: Fix hash function of Integer
comment:2 Changed 2 years ago by
 Reviewers set to Jeroen Demeyer
comment:3 in reply to: ↑ 1 ; followup: ↓ 6 Changed 2 years ago by
Replying to ghmwageringel:
I have not had the chance to test this with Python 3 on a 32bit machine.
Do we really need the exact values in the docstring? Wouldn't it be sufficient to mark those tests with # random
? The only thing really that matters is that hash(n) == hash(int(n))
.
comment:4 Changed 2 years ago by
 Status changed from needs_review to needs_work
Given that y
is less than 2 * modulus
, it is more efficient to do
if y > modulus: y = modulus
(Note that the condition y > modulus
instead of y >= modulus
is intentional as it generates better code).
comment:5 Changed 2 years ago by
 Commit changed from ba487ec15e63bff0e51f3d6afd5fb632f2b63d01 to 87702411d84e713195bff3d8828d26f01ec9aa38
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
8770241  28112: py3: fix Integer hash and doctests

comment:6 in reply to: ↑ 3 Changed 2 years ago by
 Status changed from needs_work to needs_review
Replying to jdemeyer:
Do we really need the exact values in the docstring? Wouldn't it be sufficient to mark those tests with
# random
? The only thing really that matters is thathash(n) == hash(int(n))
.
I agree. I updated the tests accordingly, keeping just the 64bit Python 3 output. At some point, this should still be tested with 32bit, though.
I also applied the other suggested change. Thanks for the feedback.
comment:7 followup: ↓ 8 Changed 2 years ago by
I would prefer not to use long
in doctests, since it is not supported in Python 3. The only reason it works in Sage now is because of a hack that converts long
to int
when using Python 3. Maybe instead separate doctests?
sage: n.__hash__() == hash(long(n)) # py2 True sage: n.__hash__() == hash(n) # py3 True
comment:8 in reply to: ↑ 7 Changed 2 years ago by
Replying to jhpalmieri:
I would prefer not to use
long
in doctests, since it is not supported in Python 3. The only reason it works in Sage now is because of a hack that convertslong
toint
when using Python 3. Maybe instead separate doctests?
Good point. Instead of having separate doctests, I think it is preferrable to replace long(n)
by int(n)
which should work in both Python 2 and 3. In Python 2, it will still return a long
if the number is too large.
comment:9 Changed 2 years ago by
 Commit changed from 87702411d84e713195bff3d8828d26f01ec9aa38 to b1efe822fc4df0770f451dd5c44c71025329f0bb
Branch pushed to git repo; I updated commit sha1. New commits:
b1efe82  28112: py3: replace long by int in Integer doctests

comment:10 followup: ↓ 11 Changed 2 years ago by
 Reviewers changed from Jeroen Demeyer to Jeroen Demeyer, John Palmieri
One final nitpick: this comment # This is safe, since y <= 2 * modulus < 2 ^ limb_bits.
is not clear, I don't know what you're trying to say with it. I would say something like
# At this point, y < 2 * modulus but we need y <= modulus # We use > instead of >= on the line below because it # generates more efficient code
comment:11 in reply to: ↑ 10 ; followup: ↓ 15 Changed 2 years ago by
Replying to jdemeyer:
One final nitpick: this comment
# This is safe, since y <= 2 * modulus < 2 ^ limb_bits.
is not clear, I don't know what you're trying to say with it.
What I was trying to emphasize is that y
did not overflow which is important for the comparison to work. I will rephrase it to make this clearer.
Before adding a comment about efficiency, I would like to understand better why this is more efficient and why it is important. Could you explain this please? Is it because it saves a subtraction occasionally, or are >
comparisons generally faster than >=
, or maybe would it help the compiler do branch elimination?
comment:12 Changed 2 years ago by
Is it easy to also address this? In Sage + Python 2:
sage: all(hash(2^n1)==hash(SR(2^n1)) for n in range(60, 100)) True
In Sage + Python 3:
sage: hash(2^601) == hash(SR(2^601)) True sage: hash(2^611) == hash(SR(2^611)) False
comment:13 followup: ↓ 16 Changed 2 years ago by
Let us keep the SR issue for elsewhere. The branch looks good to me.
Jeroen, do you approve ?
comment:14 Changed 2 years ago by
The SR
issue might be upstream in Pynac.
comment:15 in reply to: ↑ 11 ; followup: ↓ 18 Changed 2 years ago by
Replying to ghmwageringel:
I would like to understand better why this is more efficient and why it is important.
It's certainly not important. It's just that, out of two possible alternatives, the >
alternative is slightly faster.
Could you explain this please?
I don't think that there is a fundamental reason. It's just that the compiler generates more efficient instructions for the >
variant.
comment:16 in reply to: ↑ 13 Changed 2 years ago by
Replying to chapoton:
Jeroen, do you approve ?
Yes, I approve. I'm just waiting for the rewording of the comment.
comment:17 Changed 2 years ago by
 Commit changed from b1efe822fc4df0770f451dd5c44c71025329f0bb to b13f7f812bd02ae02a2a2549c860cc20cbc1bcd6
Branch pushed to git repo; I updated commit sha1. New commits:
b13f7f8  28112: reword a comment in hash function

comment:18 in reply to: ↑ 15 Changed 2 years ago by
Replying to jdemeyer:
Replying to ghmwageringel:
I would like to understand better why this is more efficient and why it is important.
It's certainly not important. It's just that, out of two possible alternatives, the
>
alternative is slightly faster.Could you explain this please?
I don't think that there is a fundamental reason. It's just that the compiler generates more efficient instructions for the
>
variant.
Thank you for explaining. I changed the comment. It should be fine now.
comment:19 Changed 2 years ago by
 Reviewers changed from Jeroen Demeyer, John Palmieri to Jeroen Demeyer, John Palmieri, Frédéric Chapoton
 Status changed from needs_review to positive_review
I am setting to positive. Jeroen, if you do not agree, then undo
comment:20 Changed 2 years ago by
 Branch changed from u/ghmwageringel/28112 to b13f7f812bd02ae02a2a2549c860cc20cbc1bcd6
 Resolution set to fixed
 Status changed from positive_review to closed
I have not had the chance to test this with Python 3 on a 32bit machine. Instead, I just computed the 32bit hashes in the doctests using a modulus of
2^31  1
. Could someone with access to a 32bit machine confirm or correct these, please?New commits:
28112: py3: fix Integer hash and doctests