Opened 6 months ago
Last modified 7 days ago
#31236 needs_work defect
Fix hashing of permutation group elements
Reported by:  vdelecroix  Owned by:  

Priority:  major  Milestone:  sage9.5 
Component:  group theory  Keywords:  
Cc:  Merged in:  
Authors:  Vincent Delecroix  Reviewers:  Dima Pasechnik, David Roe 
Report Upstream:  N/A  Work issues:  
Branch:  u/vdelecroix/31236 (Commits, GitHub, GitLab)  Commit:  e2e4d1d7fdcf87355af69f46e8d90e7e1e69a552 
Dependencies:  Stopgaps: 
Description (last modified by )
As noticed on this sagedevel thread the hash function of PermutationGroupElement
does not behave coherently with respect to the natural inclusions SymmetricGroup(n) c SymmetricGroup(n+1)
.
We design a hash function:
 whose value on the identity is 1
 which does only depend on the support (ie ignores fixed points so that the hash is consistent with permutation restriction)
 which is independent of the domain ordering
Change History (47)
comment:1 Changed 6 months ago by
 Description modified (diff)
comment:2 Changed 6 months ago by
comment:3 Changed 6 months ago by
 Component changed from PLEASE CHANGE to group theory
comment:4 Changed 6 months ago by
 Branch set to u/vdelecroix/31236
 Commit set to 462fdb1f2b903204041b6fc304ce2099aceae062
 Status changed from new to needs_review
New commits:
462fdb1  31236: fix hash function of permutations

comment:5 Changed 6 months ago by
What is 145623773L
doing there?
Are you replacing something which looks to me as a perfect hash function (i.e. no collisions) with something that might have collisions?
comment:6 Changed 6 months ago by
If you find a collision, I pay you a beer.
comment:7 Changed 6 months ago by
I asked a theory question.
comment:8 Changed 6 months ago by
145623773L
is an offset to ensure bit shuffling. See https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function (though I might not have used the best constants).
comment:9 Changed 6 months ago by
thanks. please add this info in the docs.
comment:10 Changed 6 months ago by
 Commit changed from 462fdb1f2b903204041b6fc304ce2099aceae062 to ba9d62d7cdd8cf81f3e06d1399af8b0cb9d57bc4
Branch pushed to git repo; I updated commit sha1. New commits:
ba9d62d  31326: reference for algorithm

comment:11 Changed 6 months ago by
done
comment:12 Changed 6 months ago by
 Reviewers set to Dima Pasechnik
 Status changed from needs_review to positive_review
OK, looks good, thanks.
comment:13 Changed 6 months ago by
 Status changed from positive_review to needs_work
sage: S1 = SymmetricGroup(FiniteEnumeratedSet([1,2,3])) sage: S2 = SymmetricGroup(FiniteEnumeratedSet([2,1,3])) sage: S1("(1,3,2)") == S2("(1,3,2)") True sage: hash(S1("(1,3,2)")) == hash(S2("(1,3,2)")) False
comment:14 Changed 6 months ago by
 Commit changed from ba9d62d7cdd8cf81f3e06d1399af8b0cb9d57bc4 to 3e2994829b5c1ce4ad74c1400015339498240c18
Branch pushed to git repo; I updated commit sha1. New commits:
3e29948  31326: change algorithm for consistency

comment:15 Changed 6 months ago by
 Status changed from needs_work to needs_review
Consistent hashing with more tests.
comment:16 Changed 6 months ago by
 Description modified (diff)
comment:17 Changed 6 months ago by
Can't one do hashing based on the position in the lexicographically ordered set of all permutations?
comment:18 followup: ↓ 19 Changed 6 months ago by
See comment:13 for why this will not work.
comment:19 in reply to: ↑ 18 ; followup: ↓ 20 Changed 6 months ago by
Replying to vdelecroix:
See comment:13 for why this will not work.
This seems to be an entirely different issue  one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.
Having said this, it's in general a quite bad idea to ignore the nature of the group the element comes from, as it affects the performance of algorithms quite badly. Every permutation group G has a base, a minimal size set B s.t. g_1(B)\=g_2(B) as soon as g_1\=g_2, and for transitive permutation groups this is in general the quickest test, and B is usually very small compared to the degree (it's even possible to quantify what "usually" means here  e.g. for primitive groups this is governed by O'NanScott theorem).
comment:20 in reply to: ↑ 19 ; followup: ↓ 21 Changed 6 months ago by
Replying to dimpase:
Replying to vdelecroix:
See comment:13 for why this will not work.
This seems to be an entirely different issue  one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.
How do you canonise [pi, 'a', Partition([3,2,1])]
to [1, 2, 3]
?
Having said this, it's in general a quite bad idea to ignore the nature of the group the element comes from, as it affects the performance of algorithms quite badly. Every permutation group G has a base, a minimal size set B s.t. g_1(B)\=g_2(B) as soon as g_1\=g_2, and for transitive permutation groups this is in general the quickest test, and B is usually very small compared to the degree (it's even possible to quantify what "usually" means here  e.g. for primitive groups this is governed by O'NanScott theorem).
Why is this relevent for hashing?
comment:21 in reply to: ↑ 20 ; followup: ↓ 22 Changed 6 months ago by
Replying to vdelecroix:
Replying to dimpase:
Replying to vdelecroix:
See comment:13 for why this will not work.
This seems to be an entirely different issue  one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.
How do you canonise
[pi, 'a', Partition([3,2,1])]
to[1, 2, 3]
?
By their positions in the list, i.e. as [1,2,3]
, why not?
Having said this, it's in general a quite bad idea to ignore the nature of the group the element comes from, as it affects the performance of algorithms quite badly. Every permutation group G has a base, a minimal size set B s.t. g_1(B)\=g_2(B) as soon as g_1\=g_2, and for transitive permutation groups this is in general the quickest test, and B is usually very small compared to the degree (it's even possible to quantify what "usually" means here  e.g. for primitive groups this is governed by O'NanScott theorem).
Why is this relevent for hashing?
I am inclined to say that one should not hash permutation group elements ignoring their parent group. A workaround, to hash Permutation(g)
rather than g
itself, should work, though  although it does not, due to the usual sloppiness of Permutation
.
comment:22 in reply to: ↑ 21 ; followup: ↓ 25 Changed 6 months ago by
Replying to dimpase:
Replying to vdelecroix:
Replying to dimpase:
Replying to vdelecroix:
See comment:13 for why this will not work.
This seems to be an entirely different issue  one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.
How do you canonise
[pi, 'a', Partition([3,2,1])]
to[1, 2, 3]
?By their positions in the list, i.e. as
[1,2,3]
, why not?
Because it won't work. If you do that with [pi, 'a', Partition([3,2,1])]
and then [Partition([3,2,1]), 'a', pi]
for the same permutation, let say the transposition ('a', pi)
, you will obtain two different permutations on [1,2,3]
with different hashes.
You could also have read comment:13: the order of the elements in the domain should not matter.
comment:23 Changed 6 months ago by
If you think you have a better way of implementing a hash function, please provide an alternative branch. As far as I tested my branch does solve the issue.
comment:24 Changed 6 months ago by
There is no issue IMHO. It's an overzealous wish to mimic Python numeric tower to the point where it does not make sense any more.
comment:25 in reply to: ↑ 22 ; followup: ↓ 26 Changed 6 months ago by
Replying to vdelecroix:
Replying to dimpase:
Replying to vdelecroix:
Replying to dimpase:
Replying to vdelecroix:
See comment:13 for why this will not work.
This seems to be an entirely different issue  one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.
How do you canonise
[pi, 'a', Partition([3,2,1])]
to[1, 2, 3]
?By their positions in the list, i.e. as
[1,2,3]
, why not?Because it won't work. If you do that with
[pi, 'a', Partition([3,2,1])]
and then[Partition([3,2,1]), 'a', pi]
for the same permutation, let say the transposition('a', pi)
, you will obtain two different permutations on[1,2,3]
with different hashes.You could also have read comment:13: the order of the elements in the domain should not matter.
And this only reinforces my suspision that hashing permutations of different domains should return different answers.
As you just demonstrated, there is no canonical way to identify the transposition ('a',pi)
with (1,2)
, why would the hashes be the same?
comment:26 in reply to: ↑ 25 Changed 6 months ago by
Replying to dimpase:
Replying to vdelecroix:
Replying to dimpase:
Replying to vdelecroix:
Replying to dimpase:
Replying to vdelecroix:
See comment:13 for why this will not work.
This seems to be an entirely different issue  one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.
How do you canonise
[pi, 'a', Partition([3,2,1])]
to[1, 2, 3]
?By their positions in the list, i.e. as
[1,2,3]
, why not?Because it won't work. If you do that with
[pi, 'a', Partition([3,2,1])]
and then[Partition([3,2,1]), 'a', pi]
for the same permutation, let say the transposition('a', pi)
, you will obtain two different permutations on[1,2,3]
with different hashes.You could also have read comment:13: the order of the elements in the domain should not matter.
And this only reinforces my suspision that hashing permutations of different domains should return different answers.
There is a canonical map from the domain {1,2}
to the domain {1,2,3}
. Hence the transposition (1,2)
on both sets should just be equal. It is trivial to implement a compatible hash function. This is what this ticket is about.
As you just demonstrated, there is no canonical way to identify the transposition
('a',pi)
with(1,2)
, why would the hashes be the same?
The hashes are different.
comment:27 followup: ↓ 28 Changed 6 months ago by
It seems like you have incompatible goals for a hash function.
 It should be as fast as possible to compute for elements of a fixed permutation group. Dima's suggestion about the base is an interesting idea for a way to compute hashes by only considering a few elements. But it's clearly not going to be consistent across subgroup relations (including
S_n c S_{n+1}
considered here), and it requires a precomputation for each permutation group (the base).  It should depend only on the permutation, not the group. Yes, there are other parts of Sage where hashing and equality testing are incompatible
sage: mod(1,3) == 2 True sage: hash(mod(1,3)) == hash(2) False
but this compatibility is still a desirable feature of a hash function.
Vincent's solution seems like a pretty good way to achieve the second goal. Dima, if you'd like to argue that we should drop compatibility of hash functions across different permutation groups in the interest of speed, I think that discussion should be taken back to the sagedevel thread.
I wonder if it's reasonable to have a "hash mode" option for parents where a user can indicate that they want hashes to be as fast as possible, dropping compatibility across parents. Maybe this could disable implicit coercions and require explicit conversions?
comment:28 in reply to: ↑ 27 ; followup: ↓ 30 Changed 6 months ago by
Replying to roed:
It seems like you have incompatible goals for a hash function.
<SNIP>
Thanks for your input. The branch I provided is a bit slower than the previous code in case the domain is the standard one, ie {1, 2, ..., n}
sage: S = SymmetricGroup(10) sage: %timeit r5 for s in S: h = hash(s) 1.02 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # old version 1.32 s ± 11.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version
The time difference is plausibly due to the extra call P._has_natural_domain()
which is here repeated 10 factorial times. I do not have any idea to improve the timing without Cythonizing the parent.
As expected the code is significatively slower in case the domain is non standard
sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i']) sage: %timeit r5 for s in S: h = hash(s) 109 ms ± 511 µs per loop (mean ± std. dev. of 5 runs, 10 loops each) # old version 3.56 s ± 4.19 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version
If the latter is a problem, it is possible to store the hashes of elements in the domain in a C array inside the parent. Should I do that?
comment:29 followup: ↓ 31 Changed 6 months ago by
PermutationGroupElement hashes have to be compatible w.r.t. to subgroups of this group (and, by the way, using images of the base  which gets computed as soon as almost any kind of nontrivial computation has to be done  to compute the hash is obviously good here), but I fail to understand why this would be desirable for groups with different domains, and to me domains (1,2,3) and (2,1,3) are different, as different as (1,2,3) and (7,8,9).
comment:30 in reply to: ↑ 28 ; followups: ↓ 32 ↓ 34 Changed 6 months ago by
Replying to vdelecroix:
Thanks for your input. The branch I provided is a bit slower than the previous code in case the domain is the standard one, ie
{1, 2, ..., n}
sage: S = SymmetricGroup(10) sage: %timeit r5 for s in S: h = hash(s) 1.02 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # old version 1.32 s ± 11.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new versionThe time difference is plausibly due to the extra call
P._has_natural_domain()
which is here repeated 10 factorial times. I do not have any idea to improve the timing without Cythonizing the parent.
Since _has_natural_domain
is cached, it would surprise me a bit if this was the bottleneck. Maybe try running your code again with that just set to True
? I would suspect that the inside of the for loop is just more complicated, yielding a 30% slowdown. Obviously we'd prefer not to have a slowdown, but I think 30% is acceptable in order to fix the bug.
As expected the code is significatively slower in case the domain is non standard
sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i']) sage: %timeit r5 for s in S: h = hash(s) 109 ms ± 511 µs per loop (mean ± std. dev. of 5 runs, 10 loops each) # old version 3.56 s ± 4.19 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new versionIf the latter is a problem, it is possible to store the hashes of elements in the domain in a C array inside the parent. Should I do that?
Since the parent isn't Cythonized, where would that array go? I think speeding up this case would be worthwhile, and recording the domain's hashes seems like a good way to do it.
The speed comparison I was thinking about, though, wasn't either of these. Instead, I think Dima was suggesting that we can make a much faster hash function than the current one in the case where n
is large. I think speeding things up for large n
is worthwhile, but there are plenty of cases where Sage is pretty bad. I think the hash code is fine in comparison:
sage: S = SymmetricGroup(2000) sage: %timeit a = S.random_element() 639 ms ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sage: S = SymmetricGroup(10^8) sage: %time hash(a) CPU times: user 120 ms, sys: 350 µs, total: 120 ms Wall time: 120 ms
comment:31 in reply to: ↑ 29 Changed 6 months ago by
Replying to dimpase:
PermutationGroupElement hashes have to be compatible w.r.t. to subgroups of this group (and, by the way, using images of the base  which gets computed as soon as almost any kind of nontrivial computation has to be done  to compute the hash is obviously good here), but I fail to understand why this would be desirable for groups with different domains, and to me domains (1,2,3) and (2,1,3) are different, as different as (1,2,3) and (7,8,9).
I agree that they are different, but Sage currently allows equality testing between them:
sage: S = SymmetricGroup([1,2,3]) sage: T = SymmetricGroup([2,3,1]) sage: S('(2,1)') (1,2) sage: T('(2,1)') (2,1) sage: T('(2,1)') == S('(2,1)') True
I wouldn't object to prohibiting a coercion map between these parents (as you say, the domains are different since permutations are all about the ordering of domain, not just the labels), but I'd say that should also be discussed more broadly.
comment:32 in reply to: ↑ 30 Changed 6 months ago by
Replying to roed:
Replying to vdelecroix:
Thanks for your input. The branch I provided is a bit slower than the previous code in case the domain is the standard one, ie
{1, 2, ..., n}
sage: S = SymmetricGroup(10) sage: %timeit r5 for s in S: h = hash(s) 1.02 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # old version 1.32 s ± 11.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new versionThe time difference is plausibly due to the extra call
P._has_natural_domain()
which is here repeated 10 factorial times. I do not have any idea to improve the timing without Cythonizing the parent.Since
_has_natural_domain
is cached, it would surprise me a bit if this was the bottleneck. Maybe try running your code again with that just set toTrue
?
sage: %timeit r5 for s in S: h = hash(s) # new code without _has_natural_domain call 1.09 s ± 18.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
comment:33 Changed 6 months ago by
I guess I was wrong. I also don't see an easy way to make this faster.
comment:34 in reply to: ↑ 30 Changed 6 months ago by
Replying to roed:
Replying to vdelecroix:
As expected the code is significatively slower in case the domain is non standard
sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i']) sage: %timeit r5 for s in S: h = hash(s) 109 ms ± 511 µs per loop (mean ± std. dev. of 5 runs, 10 loops each) # old version 3.56 s ± 4.19 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new versionIf the latter is a problem, it is possible to store the hashes of elements in the domain in a C array inside the parent. Should I do that?
Since the parent isn't Cythonized, where would that array go? I think speeding up this case would be worthwhile, and recording the domain's hashes seems like a good way to do it.
There is the Python array object that can store whatever C data you want. We will still suffer from the attribute access to the Python parent. But it will not be a factor x30 as it is currently
comment:35 Changed 6 months ago by
 Commit changed from 3e2994829b5c1ce4ad74c1400015339498240c18 to e2e4d1d7fdcf87355af69f46e8d90e7e1e69a552
Branch pushed to git repo; I updated commit sha1. New commits:
e2e4d1d  31236: use a hash array for non standard domains

comment:36 Changed 6 months ago by
Better now
sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i']) sage: %timeit r5 for s in S: h = hash(s) 159 ms ± 1.17 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
Computing a single hash of an element triggers the creation of an array in the parent
sage: S._domain_hash_array array('l', [5034579376873000338, 5819885834672024805, 1342306816048582247, 4154873981948319473, 9080196639829811986, 3487762170233258304, 4059262801740968994, 1410151844174446032, 7454342532774124322])
This solution is not ideal as the hash array could be made available on the domain itself and not on the symmetric group. It could end up with many duplications of this array (ie when constructing subgroups).
comment:37 Changed 6 months ago by
See #31269
comment:38 Changed 6 months ago by
ping
comment:39 Changed 6 months ago by
I'm happy giving this ticket positive review. Since Dima originally reviewed it, I'll give him a couple days to respond with an objection and suggestions for how he'd like to proceed.
comment:40 Changed 6 months ago by
 Reviewers changed from Dima Pasechnik to Dima Pasechnik, David Roe
comment:41 Changed 6 months ago by
My objection is only on the level of documentation. Where the hell these huge constants come from, and why? It's totally unclear.
comment:42 Changed 6 months ago by
As their names indicate mult1
, mult2
are multipliers (for bit shuffling) and mask1
, mask2
are masks so that the hasing of the pair (i, j)
is non commutative (ie different from the hash of (j, i)
). These numbers are not big, they are 64 bit integers.
comment:43 Changed 5 months ago by
ping
comment:44 Changed 5 months ago by
 Status changed from needs_review to positive_review
Sorry, I've been busy. This looks good to me.
comment:45 Changed 5 months ago by
 Status changed from positive_review to needs_work
I got this on Ubuntu 18.04 32 bit
********************************************************************** File "src/sage/groups/perm_gps/permgroup_element.pyx", line 1485, in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__ Failed example: for n in range(1, 10): G = SymmetricGroup(n) assert hash(G.one()) == 1 assert len(set(map(hash, G))) == factorial(n) Exception raised: Traceback (most recent call last): File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/sitepackages/sage/doctest/forker.py", line 714, in _run self.compile_and_execute(example, compiler, test.globs) File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/sitepackages/sage/doctest/forker.py", line 1133, in compile_and_execute exec(compiled, globs) File "<doctest sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__[3]>", line 4, in <module> assert len(set(map(hash, G))) == factorial(n) AssertionError ********************************************************************** File "src/sage/groups/perm_gps/permgroup_element.pyx", line 1493, in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__ Failed example: assert len(set(map(hash, A))) == A.cardinality() Exception raised: Traceback (most recent call last): File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/sitepackages/sage/doctest/forker.py", line 714, in _run self.compile_and_execute(example, compiler, test.globs) File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/sitepackages/sage/doctest/forker.py", line 1133, in compile_and_execute exec(compiled, globs) File "<doctest sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__[7]>", line 1, in <module> assert len(set(map(hash, A))) == A.cardinality() AssertionError ********************************************************************** 1 item had failures: 2 of 35 in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__ [426 tests, 2 failures, 3.64 s]
as a general bit of advice, consider using assert condition, message
so you learn whats wrong from tracebacks...
comment:46 Changed 4 months ago by
 Milestone changed from sage9.3 to sage9.4
Moving this ticket to 9.4, as it seems unlikely that it will be merged in 9.3, which is in the release candidate stage
comment:47 Changed 7 days ago by
 Milestone changed from sage9.4 to sage9.5
Setting a new milestone for this ticket based on a cursory review.
GAP has a function
LargestMovedPoint()
for permutations, so instead of usingself.n
, wheren
is the degree of the permutation group, one can use this value to iterate over while computing the hash.While this does not ignore all the fixed points, this is still OK in terms of the consistency.