Opened 8 years ago

Closed 17 months ago

Last modified 17 months ago

#10950 closed defect (fixed)

The hash function for matrices suffers from many collisions

Reported by: nthiery Owned by: nthiery
Priority: major Milestone: sage-8.1
Component: linear algebra Keywords: Weyl groups, permutation matrices, hash
Cc: sage-combinat Merged in:
Authors: Jeroen Demeyer Reviewers: Travis Scrimshaw, Vincent Delecroix
Report Upstream: N/A Work issues:
Branch: da5c1ff (Commits) Commit:
Dependencies: #24090 Stopgaps:

Description (last modified by jdemeyer)

The current hash function for generic dense matrices suffers from hard collisions with permutation matrices. For example: all permutation matrices of size 4 have hash 0!

    sage: def mat(p): m = p.to_matrix(); m.set_immutable(); return m
    sage: [ hash(mat(p)) for p in Permutations(4) ]
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

I stumbled on this when profiling some code using Weyl groups that heavily used caching (the hash of a weyl group element is the hash of the underlying matrix). I gained a speed factor of 10x by just tweaking the hash of matrices.

Cheers,

Nicolas

After applying this branch:

sage: def hashmat(p): m = p.to_matrix(); m.set_immutable(); return hash(m)
sage: len(set(hashmat(p) for p in Permutations(1)))
1
sage: len(set(hashmat(p) for p in Permutations(2)))
2
sage: len(set(hashmat(p) for p in Permutations(3)))
6
sage: len(set(hashmat(p) for p in Permutations(4)))
24
sage: len(set(hashmat(p) for p in Permutations(5)))
120
sage: len(set(hashmat(p) for p in Permutations(6)))
720
sage: len(set(hashmat(p) for p in Permutations(7)))
5040
sage: len(set(hashmat(p) for p in Permutations(8)))
40320

As extra bonus, this branch ensures that scalar matrices have the same hash as the scalar. This means one extra case where A == B implies hash(A) == hash(B).

Attachments (1)

trac_10950-hash_matrices-nt.patch (1.1 KB) - added by nthiery 8 years ago.

Download all attachments as: .zip

Change History (40)

Changed 8 years ago by nthiery

comment:1 Changed 8 years ago by nthiery

  • Description modified (diff)
  • Summary changed from The Hash function for matrices has many collisions with permutation matrices: to The Hash function for matrices suffers from many collisions with permutation matrices

comment:2 Changed 8 years ago by nthiery

  • Summary changed from The Hash function for matrices suffers from many collisions with permutation matrices to The hash function for matrices suffers from many collisions with permutation matrices

comment:3 follow-up: Changed 8 years ago by jhpalmieri

Note that in m = p.to_matrix(), the matrix m is sparse, not dense (as far as I can tell), so you'll need to patch both matrix_dense.pyx and matrix_sparse.pyx.

comment:4 in reply to: ↑ 3 Changed 8 years ago by nthiery

  • Keywords permutation matrices hash added

Replying to jhpalmieri:

Note that in m = p.to_matrix(), the matrix m is sparse, not dense (as far as I can tell), so you'll need to patch both matrix_dense.pyx and matrix_sparse.pyx.

Oops. I changed my examples at the last minute, thinking that permutation matrices would feel less exotic than Weyl groups, but I forgot to actually check that my patch fixed the problem for those. Thanks for pointing this out!

comment:5 Changed 8 years ago by nthiery

  • Owner changed from jason, was to nthiery

Oops again ... Please ignore the second attachment; it's a duplicate of the first. Anyone with power: feel free to delete

comment:6 Changed 3 years ago by nbruin

This is still true. I'd say it's just true of "dense" collections of matrices:

def imm(A):
    A.set_immutable()
    return A
sage: V={imm(matrix([a,b,c,d])) for a in [-1..1] for b in [-1..1] for c in [-1..1] for d in [-1..1]}
sage: H={hash(v) for v in V}
sage: Ht={hash(tuple(v.list())) for v in V}
sage: len(V); len (H); len(Ht)
81
14
69

so I think this problem might need a little bump in priority.

comment:7 Changed 17 months ago by vdelecroix

I opened #19050 while not knowing about this one! Funny numbers.

It would be easy to design a robust hash for dense matrices. But we do want to have the same hash for sparse and dense matrices. As a consequence the hash should just be computed from the set {(i,j,v)} of nonzero values v at position (i,j). We could do a reasonable hash by asking this list to be sorted lexicographically in (i,j) (that would cost a log(num of entries)).

For polynomials this is exactly the same setting with the list (exponent, coefficient). See #21284.

The question: for sparse matrices (respectively sparse or multivariate polynomials), would it be reasonable to sort the indices in the hash function?

comment:8 follow-up: Changed 17 months ago by jhpalmieri

This may be a silly question, but why does this happen?

sage: hash((0, 1, -1, -1)) == hash((0, 1, 0, 0))
True

comment:9 Changed 17 months ago by tscrim

How about just doing a frozenset instead of sorting the tuples?

comment:10 follow-up: Changed 17 months ago by vdelecroix

@tscrim: using frozenset, the hash function would be less robust and, more importantly, would be very hard to optimize in specialized classes. You don't want the construction of a frozenset each time __hash__ is called (eg for dense matrices or univariate polynomial the list of exponents naturally come in order).

comment:11 in reply to: ↑ 10 ; follow-up: Changed 17 months ago by tscrim

Replying to vdelecroix:

@tscrim: using frozenset, the hash function would be less robust

I don't see how, but maybe I don't understand what you think it will be less robust with respect to.

and, more importantly, would be very hard to optimize in specialized classes.

I don't see why you would need to optimize the hash function. The bottleneck would be iterating over the non-zero values, which is (would be?) a separate method.

You don't want the construction of a frozenset each time __hash__ is called (eg for dense matrices or univariate polynomial the list of exponents naturally come in order).

Do you really want to sort a list of triples every time for sparse matrices? Inserting n objects into a hash table is much faster than that: O(n) vs O(n log n) for sorting them. We might want to consider (continuing) to cache the __hash__ as well.

I also don't see why polynomials should come into an argument here.

However, here is another thought for a hash function:

temp = [0,0,0]
for i,j in self.nonzero_positions():
    temp[0] += i
    temp[1] += j
    temp[2] += self[i,j]
return hash(tuple(temp))

comment:12 follow-up: Changed 17 months ago by vdelecroix

First of all, the hash function is time critical and caching a hash is useless. You are almost never trying to test A = A (are you?). At least, upstream Python decided not to cache it for tuples and frozensets based on concrete benchmarking.

Let me consider a first concrete situation. I have 2 invertible matrices 3x3 matrices and I want to check whether there is a multiplicative relation among them. It is likely that I need to consider products of size 20, that is put in a set around 320 matrices. Building a frozenset for each of them would be a considerable waste and caching would not help.

A situation for which caching might be useful is when you have a "combinatorial" free module with a basis made of matrices and you encode your vectors using your matrices as keys. Though, that might be a problem of datastructure here (the keys are completely static).

Do you have more concrete situations where hashing is involved?

For sorting the indices, I believe that in most situations hashing will be used on dense matrices. For them, you can go through the (i,j) indices in any order you like.

Concerning collisions, I don't believe your hash function was a serious proposal. Let us consider n x n matrices with entries {0,1}. Let N = n2. You know that you have 2N such matrices and it is easy to see that you have a polynomial number of values produced by your hash function (around N2 N2 N = N5 if I am optimistic). Even on permutation matrices, your hash function would perform terribly.

The relevance of polynomials is that you have the same kind of troubles passing from univariate (coefficients in a list like structure) to multivariate (coefficients in a dictionary like structure). See #21284.

comment:13 in reply to: ↑ 8 Changed 17 months ago by jdemeyer

Replying to jhpalmieri:

This may be a silly question, but why does this happen?

sage: hash((0, 1, -1, -1)) == hash((0, 1, 0, 0))
True

Interesting. This is how Python implements hashing for tuples:

sage: def tuplehash(t):
....:     n = len(t)
....:     x = 3430008
....:     for i in range(n):
....:         h = hash(t[i])
....:         x = (x ^^ h) * (1000003 + (82520 + 2*n)*i)
....:     return x + 97531

The resulting value should be reduced mod 232 or 264 depending on the system.

The reason for the collision is that N ^^ -2 == -N for any odd N. The choice of a XOR operation there is strange, since most simple hash functions would use addition instead. See https://en.wikipedia.org/wiki/Universal_hashing#Constructions for example.

comment:14 in reply to: ↑ 11 Changed 17 months ago by jdemeyer

Replying to tscrim:

However, here is another thought for a hash function:

temp = [0,0,0]
for i,j in self.nonzero_positions():
    temp[0] += i
    temp[1] += j
    temp[2] += self[i,j]
return hash(tuple(temp))

I like the idea but not the concrete implementation. It does avoid sorting but there needs to be more mixing. Something like

h = 0
for i,j in self.nonzero_positions():
    h += H(i, j, self[i,j])
return h

where H is a simple hash function.

comment:15 Changed 17 months ago by jdemeyer

  • Authors set to Jeroen Demeyer
  • Milestone set to sage-8.1

Let me give this a shot...

comment:16 Changed 17 months ago by jdemeyer

  • Dependencies set to #24090

comment:17 in reply to: ↑ 12 Changed 17 months ago by tscrim

Replying to vdelecroix:

First of all, the hash function is time critical and caching a hash is useless. You are almost never trying to test A = A (are you?). At least, upstream Python decided not to cache it for tuples and frozensets based on concrete benchmarking.

Let me consider a first concrete situation. I have 2 invertible matrices 3x3 matrices and I want to check whether there is a multiplicative relation among them. It is likely that I need to consider products of size 20, that is put in a set around 320 matrices. Building a frozenset for each of them would be a considerable waste and caching would not help.

It depends on what you mean by "considerable waste." In terms of computational complexity, it just adds to the constant (or probably would not even be contributing asymptotically for dense matrices). However, I do agree that it effectively doubles the memory footprint for dense matrices.

A situation for which caching might be useful is when you have a "combinatorial" free module with a basis made of matrices and you encode your vectors using your matrices as keys. Though, that might be a problem of datastructure here (the keys are completely static).

Do you have more concrete situations where hashing is involved?

As you mentioned, when they are used for keys of a module implemented as a dict, such for CombinatorialFreeModule elements, or for defining objects such as FiniteDimensionalAlgebra. Granted, in these situations the matrices are relatively small.

For sorting the indices, I believe that in most situations hashing will be used on dense matrices. For them, you can go through the (i,j) indices in any order you like.

I have been doing a bunch of stuff with sparse matrices recently (computing representations of Lie algebras) and have encountered things that are optimized for dense matrices that are used by sparse matrices as well. So I'd appreciate it if we didn't make the hash worse for sparse matrices just because it is better of dense matrices.

Concerning collisions, I don't believe your hash function was a serious proposal.

It was just a thought on how to remove the dependence on the order and not fully fleshed out. I know I'm less experienced in this area, but I am legitimately trying to help.

Let us consider n x n matrices with entries {0,1}. Let N = n2. You know that you have 2N such matrices and it is easy to see that you have a polynomial number of values produced by your hash function (around N2 N2 N = N5 if I am optimistic). Even on permutation matrices, your hash function would perform terribly.

I agree that it would not perform well in this case. What do you think about Jeoren's modification. I think it works well for both dense and sparse matrices and should work better than my proposal.

The relevance of polynomials is that you have the same kind of troubles passing from univariate (coefficients in a list like structure) to multivariate (coefficients in a dictionary like structure). See #21284.

For the multivariate polynomials, it looks like what is used is Jeroen's proposal. Although we could have a completely different solution for polynomials because the data structure for dense univariate is 1D and sparse k-variate are kD, whereas matrices are strictly 2D.

comment:18 Changed 17 months ago by jdemeyer

  • Branch set to u/jdemeyer/the_hash_function_for_matrices_suffers_from_many_collisions_with_permutation_matrices

comment:19 Changed 17 months ago by git

  • Commit set to f67a8839d7dfb843a8ed8652a2734b644be579f1

Branch pushed to git repo; I updated commit sha1. New commits:

f67a883Fix doctests

comment:20 Changed 17 months ago by jdemeyer

  • Description modified (diff)

comment:21 Changed 17 months ago by jdemeyer

  • Description modified (diff)

comment:22 Changed 17 months ago by jdemeyer

Feel free to comment. This is essentially "needs review" except that I haven't run all doctests yet.

comment:23 Changed 17 months ago by jdemeyer

  • Summary changed from The hash function for matrices suffers from many collisions with permutation matrices to The hash function for matrices suffers from many collisions

comment:24 Changed 17 months ago by vdelecroix

Nice! Would be interesting to test in conjunction with #23706.

comment:25 follow-up: Changed 17 months ago by vdelecroix

Instead of hash(self.get_unsafe(i, j)) what about introducing self._entry_hash(i, j)?

comment:26 in reply to: ↑ 25 ; follow-up: Changed 17 months ago by jdemeyer

Replying to vdelecroix:

Instead of hash(self.get_unsafe(i, j)) what about introducing self._entry_hash(i, j)?

Please elaborate. Why should I do that?

comment:27 Changed 17 months ago by git

  • Commit changed from f67a8839d7dfb843a8ed8652a2734b644be579f1 to d159670f8f4f3196abd340a5ed3df2d16a291f8c

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

d159670Fix doctests

comment:28 Changed 17 months ago by jdemeyer

  • Status changed from new to needs_review

comment:29 Changed 17 months ago by git

  • Commit changed from d159670f8f4f3196abd340a5ed3df2d16a291f8c to b3abe34f8b1c8b9592ed661de2d40d2ece85a5d4

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

b98ce28Better hash for matrices
b3abe34Fix doctests

comment:30 Changed 17 months ago by jdemeyer

Reorganized the code a bit, added documentation in matrix0.pyx to explain why I did things the way I did.

comment:31 Changed 17 months ago by git

  • Commit changed from b3abe34f8b1c8b9592ed661de2d40d2ece85a5d4 to da5c1ff296edce8e6037d134e774ffdf328984f8

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

210febfBetter hash for matrices
da5c1ffFix doctests

comment:32 in reply to: ↑ 26 ; follow-up: Changed 17 months ago by vdelecroix

Replying to jdemeyer:

Replying to vdelecroix:

Instead of hash(self.get_unsafe(i, j)) what about introducing self._entry_hash(i, j)?

Please elaborate. Why should I do that?

To make hashing faster! That would be a method easy to implement for extension classes that store the data as a C attribute. In such situation you don't want to build a Python object each time you compute the hash of an entry. And you don't want to copy/paste of the _hash_ function either.

comment:33 in reply to: ↑ 32 ; follow-up: Changed 17 months ago by jdemeyer

Replying to vdelecroix:

To make hashing faster! That would be a method easy to implement for extension classes that store the data as a C attribute. In such situation you don't want to build a Python object each time you compute the hash of an entry. And you don't want to copy/paste of the _hash_ function either.

On the other hand, you force every Matrix class to implement yet another method...

I'm not really convinced... in any case, that can wait for a follow-up ticket.

comment:34 in reply to: ↑ 33 ; follow-up: Changed 17 months ago by vdelecroix

Replying to jdemeyer:

Replying to vdelecroix:

To make hashing faster! That would be a method easy to implement for extension classes that store the data as a C attribute. In such situation you don't want to build a Python object each time you compute the hash of an entry. And you don't want to copy/paste of the _hash_ function either.

On the other hand, you force every Matrix class to implement yet another method...

It would not be forced: return hash(self._get_unsafe(i,j)) is a reasonable default.

I'm not really convinced... in any case, that can wait for a follow-up ticket.

Agreed, it can wait.

comment:35 in reply to: ↑ 34 Changed 17 months ago by jdemeyer

Replying to vdelecroix:

It would not be forced: return hash(self._get_unsafe(i,j)) is a reasonable default.

But that would add an extra level of indirection in the general case, slowing that down.

comment:36 follow-up: Changed 17 months ago by tscrim

  • Reviewers set to Travis Scrimshaw, Vincent Delecroix
  • Status changed from needs_review to positive_review

I don't think there should be any Python object getting created by _get_unsafe, just a pointer to a Python object. What would be a use-case for implementing a custom _entry_hash?

Anyways, since you say it can wait for a followup and this is a good improvement, I am setting it to a positive review. If you still want to discuss things first, feel free to revert.

comment:37 in reply to: ↑ 36 Changed 17 months ago by jdemeyer

Replying to tscrim:

I don't think there should be any Python object getting created by _get_unsafe

...only if the entries of the matrix are stored as Python objects. For many specialized classes (matrices over ZZ or over finite fields), the entries are stored in some C array, so _get_unsafe() does need to create a Python object.

comment:38 Changed 17 months ago by vbraun

  • Branch changed from u/jdemeyer/the_hash_function_for_matrices_suffers_from_many_collisions_with_permutation_matrices to da5c1ff296edce8e6037d134e774ffdf328984f8
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:39 Changed 17 months ago by jdemeyer

  • Commit da5c1ff296edce8e6037d134e774ffdf328984f8 deleted

Some more comments:

This hash function is a lot better than the old one. However, it is far from perfect:

sage: M = diagonal_matrix([0, 1, -2, 1]); M.set_immutable(); hash(M)
0

Still, it is the best hash that I could come up with which is very efficient for sparse matrices: it takes time O(s) with s the number of non-zero entries of a matrix (independent of the size of the matrix). If you are willing to sort the indices (as suggested in 7), you get a better hash but with a runtime of O(s log(s)).

Note: See TracTickets for help on using tickets.