Opened 9 months ago

Last modified 3 months ago

#31236 needs_work defect

Fix hashing of permutation group elements

Reported by: vdelecroix Owned by:
Priority: major Milestone: sage-9.5
Component: group theory Keywords:
Cc: Merged in:
Authors: Vincent Delecroix Reviewers: Dima Pasechnik, ​David Roe
Report Upstream: N/A Work issues:
Branch: u/vdelecroix/31236 (Commits, GitHub, GitLab) Commit: e2e4d1d7fdcf87355af69f46e8d90e7e1e69a552
Dependencies: Stopgaps:

Status badges

Description (last modified by vdelecroix)

As noticed on this sage-devel thread the hash function of PermutationGroupElement does not behave coherently with respect to the natural inclusions SymmetricGroup(n) c SymmetricGroup(n+1).

We design a hash function:

  • whose value on the identity is 1
  • which does only depend on the support (ie ignores fixed points so that the hash is consistent with permutation restriction)
  • which is independent of the domain ordering

Change History (47)

comment:1 Changed 9 months ago by vdelecroix

  • Description modified (diff)

comment:2 Changed 9 months ago by dimpase

GAP has a function LargestMovedPoint() for permutations, so instead of using self.n, where n is the degree of the permutation group, one can use this value to iterate over while computing the hash.

While this does not ignore all the fixed points, this is still OK in terms of the consistency.

comment:3 Changed 9 months ago by dimpase

  • Component changed from PLEASE CHANGE to group theory

comment:4 Changed 9 months ago by vdelecroix

  • Branch set to u/vdelecroix/31236
  • Commit set to 462fdb1f2b903204041b6fc304ce2099aceae062
  • Status changed from new to needs_review

New commits:

462fdb131236: fix hash function of permutations

comment:5 Changed 9 months ago by dimpase

What is 145623773L doing there?

Are you replacing something which looks to me as a perfect hash function (i.e. no collisions) with something that might have collisions?

comment:6 Changed 9 months ago by vdelecroix

If you find a collision, I pay you a beer.

comment:7 Changed 9 months ago by dimpase

I asked a theory question.

comment:8 Changed 9 months ago by vdelecroix

145623773L is an offset to ensure bit shuffling. See https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function (though I might not have used the best constants).

comment:9 Changed 9 months ago by dimpase

thanks. please add this info in the docs.

comment:10 Changed 9 months ago by git

  • Commit changed from 462fdb1f2b903204041b6fc304ce2099aceae062 to ba9d62d7cdd8cf81f3e06d1399af8b0cb9d57bc4

Branch pushed to git repo; I updated commit sha1. New commits:

ba9d62d31326: reference for algorithm

comment:11 Changed 9 months ago by vdelecroix

done

comment:12 Changed 9 months ago by dimpase

  • Reviewers set to Dima Pasechnik
  • Status changed from needs_review to positive_review

OK, looks good, thanks.

comment:13 Changed 9 months ago by vdelecroix

  • Status changed from positive_review to needs_work
sage: S1 = SymmetricGroup(FiniteEnumeratedSet([1,2,3]))
sage: S2 = SymmetricGroup(FiniteEnumeratedSet([2,1,3]))
sage: S1("(1,3,2)") == S2("(1,3,2)")
True
sage: hash(S1("(1,3,2)")) == hash(S2("(1,3,2)"))
False

comment:14 Changed 9 months ago by git

  • Commit changed from ba9d62d7cdd8cf81f3e06d1399af8b0cb9d57bc4 to 3e2994829b5c1ce4ad74c1400015339498240c18

Branch pushed to git repo; I updated commit sha1. New commits:

3e2994831326: change algorithm for consistency

comment:15 Changed 9 months ago by vdelecroix

  • Status changed from needs_work to needs_review

Consistent hashing with more tests.

comment:16 Changed 9 months ago by vdelecroix

  • Description modified (diff)

comment:17 Changed 9 months ago by dimpase

Can't one do hashing based on the position in the lexicographically ordered set of all permutations?

https://en.wikipedia.org/wiki/Factorial_number_system

Last edited 9 months ago by dimpase (previous) (diff)

comment:18 follow-up: Changed 9 months ago by vdelecroix

See comment:13 for why this will not work.

comment:19 in reply to: ↑ 18 ; follow-up: Changed 9 months ago by dimpase

Replying to vdelecroix:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.


Having said this, it's in general a quite bad idea to ignore the nature of the group the element comes from, as it affects the performance of algorithms quite badly. Every permutation group G has a base, a minimal size set B s.t. g_1(B)\=g_2(B) as soon as g_1\=g_2, and for transitive permutation groups this is in general the quickest test, and |B| is usually very small compared to the degree (it's even possible to quantify what "usually" means here - e.g. for primitive groups this is governed by O'Nan-Scott theorem).

comment:20 in reply to: ↑ 19 ; follow-up: Changed 9 months ago by vdelecroix

Replying to dimpase:

Replying to vdelecroix:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.

How do you canonise [pi, 'a', Partition([3,2,1])] to [1, 2, 3]?


Having said this, it's in general a quite bad idea to ignore the nature of the group the element comes from, as it affects the performance of algorithms quite badly. Every permutation group G has a base, a minimal size set B s.t. g_1(B)\=g_2(B) as soon as g_1\=g_2, and for transitive permutation groups this is in general the quickest test, and |B| is usually very small compared to the degree (it's even possible to quantify what "usually" means here - e.g. for primitive groups this is governed by O'Nan-Scott theorem).

Why is this relevent for hashing?

Last edited 9 months ago by vdelecroix (previous) (diff)

comment:21 in reply to: ↑ 20 ; follow-up: Changed 9 months ago by dimpase

Replying to vdelecroix:

Replying to dimpase:

Replying to vdelecroix:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.

How do you canonise [pi, 'a', Partition([3,2,1])] to [1, 2, 3]?

By their positions in the list, i.e. as [1,2,3], why not?


Having said this, it's in general a quite bad idea to ignore the nature of the group the element comes from, as it affects the performance of algorithms quite badly. Every permutation group G has a base, a minimal size set B s.t. g_1(B)\=g_2(B) as soon as g_1\=g_2, and for transitive permutation groups this is in general the quickest test, and |B| is usually very small compared to the degree (it's even possible to quantify what "usually" means here - e.g. for primitive groups this is governed by O'Nan-Scott theorem).

Why is this relevent for hashing?

I am inclined to say that one should not hash permutation group elements ignoring their parent group. A workaround, to hash Permutation(g) rather than g itself, should work, though - although it does not, due to the usual sloppiness of Permutation.

comment:22 in reply to: ↑ 21 ; follow-up: Changed 9 months ago by vdelecroix

Replying to dimpase:

Replying to vdelecroix:

Replying to dimpase:

Replying to vdelecroix:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.

How do you canonise [pi, 'a', Partition([3,2,1])] to [1, 2, 3]?

By their positions in the list, i.e. as [1,2,3], why not?

Because it won't work. If you do that with [pi, 'a', Partition([3,2,1])] and then [Partition([3,2,1]), 'a', pi] for the same permutation, let say the transposition ('a', pi), you will obtain two different permutations on [1,2,3] with different hashes.

You could also have read comment:13: the order of the elements in the domain should not matter.

Last edited 9 months ago by vdelecroix (previous) (diff)

comment:23 Changed 9 months ago by vdelecroix

If you think you have a better way of implementing a hash function, please provide an alternative branch. As far as I tested my branch does solve the issue.

comment:24 Changed 9 months ago by dimpase

There is no issue IMHO. It's an over-zealous wish to mimic Python numeric tower to the point where it does not make sense any more.

comment:25 in reply to: ↑ 22 ; follow-up: Changed 9 months ago by dimpase

Replying to vdelecroix:

Replying to dimpase:

Replying to vdelecroix:

Replying to dimpase:

Replying to vdelecroix:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.

How do you canonise [pi, 'a', Partition([3,2,1])] to [1, 2, 3]?

By their positions in the list, i.e. as [1,2,3], why not?

Because it won't work. If you do that with [pi, 'a', Partition([3,2,1])] and then [Partition([3,2,1]), 'a', pi] for the same permutation, let say the transposition ('a', pi), you will obtain two different permutations on [1,2,3] with different hashes.

You could also have read comment:13: the order of the elements in the domain should not matter.

And this only reinforces my suspision that hashing permutations of different domains should return different answers. As you just demonstrated, there is no canonical way to identify the transposition ('a',pi) with (1,2), why would the hashes be the same?

comment:26 in reply to: ↑ 25 Changed 9 months ago by vdelecroix

Replying to dimpase:

Replying to vdelecroix:

Replying to dimpase:

Replying to vdelecroix:

Replying to dimpase:

Replying to vdelecroix:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.

How do you canonise [pi, 'a', Partition([3,2,1])] to [1, 2, 3]?

By their positions in the list, i.e. as [1,2,3], why not?

Because it won't work. If you do that with [pi, 'a', Partition([3,2,1])] and then [Partition([3,2,1]), 'a', pi] for the same permutation, let say the transposition ('a', pi), you will obtain two different permutations on [1,2,3] with different hashes.

You could also have read comment:13: the order of the elements in the domain should not matter.

And this only reinforces my suspision that hashing permutations of different domains should return different answers.

There is a canonical map from the domain {1,2} to the domain {1,2,3}. Hence the transposition (1,2) on both sets should just be equal. It is trivial to implement a compatible hash function. This is what this ticket is about.

As you just demonstrated, there is no canonical way to identify the transposition ('a',pi) with (1,2), why would the hashes be the same?

The hashes are different.

comment:27 follow-up: Changed 9 months ago by roed

It seems like you have incompatible goals for a hash function.

  • It should be as fast as possible to compute for elements of a fixed permutation group. Dima's suggestion about the base is an interesting idea for a way to compute hashes by only considering a few elements. But it's clearly not going to be consistent across subgroup relations (including S_n c S_{n+1} considered here), and it requires a precomputation for each permutation group (the base).
  • It should depend only on the permutation, not the group. Yes, there are other parts of Sage where hashing and equality testing are incompatible
    sage: mod(1,3) == -2                                                                                                                                                                                                                         
    True
    sage: hash(mod(1,3)) == hash(-2)                                                                                                                                                                                                             
    False
    

but this compatibility is still a desirable feature of a hash function.

Vincent's solution seems like a pretty good way to achieve the second goal. Dima, if you'd like to argue that we should drop compatibility of hash functions across different permutation groups in the interest of speed, I think that discussion should be taken back to the sage-devel thread.

I wonder if it's reasonable to have a "hash mode" option for parents where a user can indicate that they want hashes to be as fast as possible, dropping compatibility across parents. Maybe this could disable implicit coercions and require explicit conversions?

comment:28 in reply to: ↑ 27 ; follow-up: Changed 9 months ago by vdelecroix

Replying to roed:

It seems like you have incompatible goals for a hash function.

<SNIP>

Thanks for your input. The branch I provided is a bit slower than the previous code in case the domain is the standard one, ie {1, 2, ..., n}

sage: S = SymmetricGroup(10)
sage: %timeit -r5 for s in S: h = hash(s)
1.02 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # old version
1.32 s ± 11.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version

The time difference is plausibly due to the extra call P._has_natural_domain() which is here repeated 10 factorial times. I do not have any idea to improve the timing without Cythonizing the parent.

As expected the code is significatively slower in case the domain is non standard

sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i'])
sage: %timeit -r5 for s in S: h = hash(s)
109 ms ± 511 µs per loop (mean ± std. dev. of 5 runs, 10 loops each) # old version
3.56 s ± 4.19 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version

If the latter is a problem, it is possible to store the hashes of elements in the domain in a C array inside the parent. Should I do that?

Last edited 9 months ago by vdelecroix (previous) (diff)

comment:29 follow-up: Changed 9 months ago by dimpase

PermutationGroupElement hashes have to be compatible w.r.t. to subgroups of this group (and, by the way, using images of the base - which gets computed as soon as almost any kind of nontrivial computation has to be done - to compute the hash is obviously good here), but I fail to understand why this would be desirable for groups with different domains, and to me domains (1,2,3) and (2,1,3) are different, as different as (1,2,3) and (7,8,9).

comment:30 in reply to: ↑ 28 ; follow-ups: Changed 9 months ago by roed

Replying to vdelecroix:

Thanks for your input. The branch I provided is a bit slower than the previous code in case the domain is the standard one, ie {1, 2, ..., n}

sage: S = SymmetricGroup(10)
sage: %timeit -r5 for s in S: h = hash(s)
1.02 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # old version
1.32 s ± 11.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version

The time difference is plausibly due to the extra call P._has_natural_domain() which is here repeated 10 factorial times. I do not have any idea to improve the timing without Cythonizing the parent.

Since _has_natural_domain is cached, it would surprise me a bit if this was the bottleneck. Maybe try running your code again with that just set to True? I would suspect that the inside of the for loop is just more complicated, yielding a 30% slowdown. Obviously we'd prefer not to have a slowdown, but I think 30% is acceptable in order to fix the bug.

As expected the code is significatively slower in case the domain is non standard

sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i'])
sage: %timeit -r5 for s in S: h = hash(s)
109 ms ± 511 µs per loop (mean ± std. dev. of 5 runs, 10 loops each) # old version
3.56 s ± 4.19 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version

If the latter is a problem, it is possible to store the hashes of elements in the domain in a C array inside the parent. Should I do that?

Since the parent isn't Cythonized, where would that array go? I think speeding up this case would be worthwhile, and recording the domain's hashes seems like a good way to do it.

The speed comparison I was thinking about, though, wasn't either of these. Instead, I think Dima was suggesting that we can make a much faster hash function than the current one in the case where n is large. I think speeding things up for large n is worthwhile, but there are plenty of cases where Sage is pretty bad. I think the hash code is fine in comparison:

sage: S = SymmetricGroup(2000)
sage: %timeit a = S.random_element()
639 ms ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

sage: S = SymmetricGroup(10^8)
sage: %time hash(a)                                                                                                                                                                                                                          
CPU times: user 120 ms, sys: 350 µs, total: 120 ms
Wall time: 120 ms

comment:31 in reply to: ↑ 29 Changed 9 months ago by roed

Replying to dimpase:

PermutationGroupElement hashes have to be compatible w.r.t. to subgroups of this group (and, by the way, using images of the base - which gets computed as soon as almost any kind of nontrivial computation has to be done - to compute the hash is obviously good here), but I fail to understand why this would be desirable for groups with different domains, and to me domains (1,2,3) and (2,1,3) are different, as different as (1,2,3) and (7,8,9).

I agree that they are different, but Sage currently allows equality testing between them:

sage: S = SymmetricGroup([1,2,3])                                                                                                                                                                                                            
sage: T = SymmetricGroup([2,3,1])                                                                                                                                                                                                            
sage: S('(2,1)')                                                                                                                                                                                                                             
(1,2)
sage: T('(2,1)')                                                                                                                                                                                                                             
(2,1)
sage: T('(2,1)') == S('(2,1)')                                                                                                                                                                                                               
True

I wouldn't object to prohibiting a coercion map between these parents (as you say, the domains are different since permutations are all about the ordering of domain, not just the labels), but I'd say that should also be discussed more broadly.

comment:32 in reply to: ↑ 30 Changed 9 months ago by vdelecroix

Replying to roed:

Replying to vdelecroix:

Thanks for your input. The branch I provided is a bit slower than the previous code in case the domain is the standard one, ie {1, 2, ..., n}

sage: S = SymmetricGroup(10)
sage: %timeit -r5 for s in S: h = hash(s)
1.02 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # old version
1.32 s ± 11.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version

The time difference is plausibly due to the extra call P._has_natural_domain() which is here repeated 10 factorial times. I do not have any idea to improve the timing without Cythonizing the parent.

Since _has_natural_domain is cached, it would surprise me a bit if this was the bottleneck. Maybe try running your code again with that just set to True?

sage: %timeit -r5 for s in S: h = hash(s) # new code without _has_natural_domain call
1.09 s ± 18.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

comment:33 Changed 9 months ago by roed

I guess I was wrong. I also don't see an easy way to make this faster.

comment:34 in reply to: ↑ 30 Changed 9 months ago by vdelecroix

Replying to roed:

Replying to vdelecroix:

As expected the code is significatively slower in case the domain is non standard

sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i'])
sage: %timeit -r5 for s in S: h = hash(s)
109 ms ± 511 µs per loop (mean ± std. dev. of 5 runs, 10 loops each) # old version
3.56 s ± 4.19 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version

If the latter is a problem, it is possible to store the hashes of elements in the domain in a C array inside the parent. Should I do that?

Since the parent isn't Cythonized, where would that array go? I think speeding up this case would be worthwhile, and recording the domain's hashes seems like a good way to do it.

There is the Python array object that can store whatever C data you want. We will still suffer from the attribute access to the Python parent. But it will not be a factor x30 as it is currently

comment:35 Changed 9 months ago by git

  • Commit changed from 3e2994829b5c1ce4ad74c1400015339498240c18 to e2e4d1d7fdcf87355af69f46e8d90e7e1e69a552

Branch pushed to git repo; I updated commit sha1. New commits:

e2e4d1d31236: use a hash array for non standard domains

comment:36 Changed 9 months ago by vdelecroix

Better now

sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i'])
sage: %timeit -r5 for s in S: h = hash(s)
159 ms ± 1.17 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

Computing a single hash of an element triggers the creation of an array in the parent

sage: S._domain_hash_array
array('l', [-5034579376873000338, 5819885834672024805, -1342306816048582247, 4154873981948319473, -9080196639829811986, 3487762170233258304, -4059262801740968994, -1410151844174446032, -7454342532774124322])

This solution is not ideal as the hash array could be made available on the domain itself and not on the symmetric group. It could end up with many duplications of this array (ie when constructing subgroups).

Last edited 9 months ago by vdelecroix (previous) (diff)

comment:37 Changed 9 months ago by vdelecroix

See #31269

comment:38 Changed 9 months ago by vdelecroix

ping

comment:39 Changed 9 months ago by roed

I'm happy giving this ticket positive review. Since Dima originally reviewed it, I'll give him a couple days to respond with an objection and suggestions for how he'd like to proceed.

comment:40 Changed 9 months ago by vdelecroix

  • Reviewers changed from Dima Pasechnik to Dima Pasechnik, ​David Roe

comment:41 Changed 9 months ago by dimpase

My objection is only on the level of documentation. Where the hell these huge constants come from, and why? It's totally unclear.

comment:42 Changed 9 months ago by vdelecroix

As their names indicate mult1, mult2 are multipliers (for bit shuffling) and mask1, mask2 are masks so that the hasing of the pair (i, j) is non commutative (ie different from the hash of (j, i)). These numbers are not big, they are 64 bit integers.

comment:43 Changed 8 months ago by vdelecroix

ping

comment:44 Changed 8 months ago by roed

  • Status changed from needs_review to positive_review

Sorry, I've been busy. This looks good to me.

comment:45 Changed 8 months ago by vbraun

  • Status changed from positive_review to needs_work

I got this on Ubuntu 18.04 32 bit

**********************************************************************
File "src/sage/groups/perm_gps/permgroup_element.pyx", line 1485, in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__
Failed example:
    for n in range(1, 10):
       G = SymmetricGroup(n)
       assert hash(G.one()) == 1
       assert len(set(map(hash, G))) == factorial(n)
Exception raised:
    Traceback (most recent call last):
      File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 714, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 1133, in compile_and_execute
        exec(compiled, globs)
      File "<doctest sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__[3]>", line 4, in <module>
        assert len(set(map(hash, G))) == factorial(n)
    AssertionError
**********************************************************************
File "src/sage/groups/perm_gps/permgroup_element.pyx", line 1493, in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__
Failed example:
    assert len(set(map(hash, A))) == A.cardinality()
Exception raised:
    Traceback (most recent call last):
      File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 714, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 1133, in compile_and_execute
        exec(compiled, globs)
      File "<doctest sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__[7]>", line 1, in <module>
        assert len(set(map(hash, A))) == A.cardinality()
    AssertionError
**********************************************************************
1 item had failures:
   2 of  35 in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__
    [426 tests, 2 failures, 3.64 s]

as a general bit of advice, consider using assert condition, message so you learn whats wrong from tracebacks...

comment:46 Changed 7 months ago by mkoeppe

  • Milestone changed from sage-9.3 to sage-9.4

Moving this ticket to 9.4, as it seems unlikely that it will be merged in 9.3, which is in the release candidate stage

comment:47 Changed 3 months ago by mkoeppe

  • Milestone changed from sage-9.4 to sage-9.5

Setting a new milestone for this ticket based on a cursory review.

Note: See TracTickets for help on using tickets.