Sage: Ticket #10950: The Hash function for matrices suffers from many collisions with permutation matrices
https://trac.sagemath.org/ticket/10950
<p>
The current hash function for generic dense matrices suffers from hard collisions with permutation matrices. For example: all permutation matrices of size 4 have hash 0!
</p>
<pre class="wiki"> sage: def mat(p): m = p.to_matrix(); m.set_immutable(); return m
sage: [ hash(mat(p)) for p in Permutations(4) ]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
</pre><p>
In general, we would hope to get n! below:
</p>
<pre class="wiki"> sage: len(set([ hash(mat(p)) for p in Permutations(1) ]))
1
sage: len(set([ hash(mat(p)) for p in Permutations(2) ]))
1
sage: len(set([ hash(mat(p)) for p in Permutations(3) ]))
5
sage: len(set([ hash(mat(p)) for p in Permutations(4) ]))
1
sage: len(set([ hash(mat(p)) for p in Permutations(5) ]))
16
sage: len(set([ hash(mat(p)) for p in Permutations(6) ]))
16
sage: len(set([ hash(mat(p)) for p in Permutations(7) ]))
32
sage: len(set([ hash(mat(p)) for p in Permutations(8) ]))
1
</pre><p>
I stumbled on this when profiling some code using Weyl groups that
heavily used caching (the hash of a weyl group element is the hash of the underlying matrix). I gained a speed factor of 10x by just tweaking the hash of matrices as in the attached patch. Now, I have no idea if in general that would be a good hash for matrices, so please some expert write an appropriate patch.
</p>
<p>
Cheers,
</p>
<blockquote>
<p>
Nicolas
</p>
</blockquote>
en-usSagehttps://trac.sagemath.org/chrome/site/logo_sagemath_trac.png
https://trac.sagemath.org/ticket/10950
Trac 1.1.6nthieryWed, 16 Mar 2011 18:07:11 GMTattachment set
https://trac.sagemath.org/ticket/10950
https://trac.sagemath.org/ticket/10950
<ul>
<li><strong>attachment</strong>
set to <em>trac_10950-hash_matrices-nt.patch</em>
</li>
</ul>
TicketnthieryWed, 16 Mar 2011 18:08:08 GMTdescription, summary changed
https://trac.sagemath.org/ticket/10950#comment:1
https://trac.sagemath.org/ticket/10950#comment:1
<ul>
<li><strong>description</strong>
modified (<a href="/ticket/10950?action=diff&version=1">diff</a>)
</li>
<li><strong>summary</strong>
changed from <em>The Hash function for matrices has many collisions with permutation matrices:</em> to <em>The Hash function for matrices suffers from many collisions with permutation matrices</em>
</li>
</ul>
TicketnthieryWed, 16 Mar 2011 18:08:22 GMTsummary changed
https://trac.sagemath.org/ticket/10950#comment:2
https://trac.sagemath.org/ticket/10950#comment:2
<ul>
<li><strong>summary</strong>
changed from <em>The Hash function for matrices suffers from many collisions with permutation matrices</em> to <em>The hash function for matrices suffers from many collisions with permutation matrices</em>
</li>
</ul>
TicketjhpalmieriWed, 16 Mar 2011 20:20:35 GMT
https://trac.sagemath.org/ticket/10950#comment:3
https://trac.sagemath.org/ticket/10950#comment:3
<p>
Note that in <code>m = p.to_matrix()</code>, the matrix m is sparse, not dense (as far as I can tell), so you'll need to patch both matrix_dense.pyx and matrix_sparse.pyx.
</p>
TicketnthieryWed, 16 Mar 2011 22:02:10 GMTkeywords changed
https://trac.sagemath.org/ticket/10950#comment:4
https://trac.sagemath.org/ticket/10950#comment:4
<ul>
<li><strong>keywords</strong>
<em>permutation</em> <em>matrices</em> <em>hash</em> added
</li>
</ul>
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/10950#comment:3" title="Comment 3">jhpalmieri</a>:
</p>
<blockquote class="citation">
<p>
Note that in <code>m = p.to_matrix()</code>, the matrix m is sparse, not dense (as far as I can tell), so you'll need to patch both matrix_dense.pyx and matrix_sparse.pyx.
</p>
</blockquote>
<p>
Oops. I changed my examples at the last minute, thinking that permutation matrices would feel less exotic than Weyl groups, but I forgot to actually check that my patch fixed the problem for those. Thanks for pointing this out!
</p>
TicketnthieryThu, 17 Mar 2011 08:12:47 GMTowner changed
https://trac.sagemath.org/ticket/10950#comment:5
https://trac.sagemath.org/ticket/10950#comment:5
<ul>
<li><strong>owner</strong>
changed from <em>jason, was</em> to <em>nthiery</em>
</li>
</ul>
<p>
Oops again ... Please ignore the second attachment; it's a duplicate of the first. Anyone with power: feel free to delete
</p>
TicketnbruinMon, 04 Jul 2016 16:02:57 GMT
https://trac.sagemath.org/ticket/10950#comment:6
https://trac.sagemath.org/ticket/10950#comment:6
<p>
This is still true. I'd say it's just true of "dense" collections of matrices:
</p>
<pre class="wiki">def imm(A):
A.set_immutable()
return A
</pre><pre class="wiki">sage: V={imm(matrix([a,b,c,d])) for a in [-1..1] for b in [-1..1] for c in [-1..1] for d in [-1..1]}
sage: H={hash(v) for v in V}
sage: Ht={hash(tuple(v.list())) for v in V}
sage: len(V); len (H); len(Ht)
81
14
69
</pre><p>
so I think this problem might need a little bump in priority.
</p>
TicketvdelecroixSun, 22 Oct 2017 09:21:37 GMT
https://trac.sagemath.org/ticket/10950#comment:7
https://trac.sagemath.org/ticket/10950#comment:7
<p>
I opened <a class="closed ticket" href="https://trac.sagemath.org/ticket/19050" title="defect: hash values for matrices (closed: duplicate)">#19050</a> while not knowing about this one! Funny numbers.
</p>
<p>
It would be easy to design a robust hash for dense matrices. But we do want to have the same hash for sparse and dense matrices. As a consequence the hash should just be computed from the set {<code>(i,j,v)</code>} of nonzero values <code>v</code> at position <code>(i,j)</code>. We could do a reasonable hash by asking this list to be sorted lexicographically in <code>(i,j)</code> (that would cost a <code>log(num of entries)</code>).
</p>
<p>
For polynomials this is exactly the same setting with the list <code>(exponent, coefficient)</code>. See <a class="new ticket" href="https://trac.sagemath.org/ticket/21284" title="defect: hash of polynomials is very bad (new)">#21284</a>.
</p>
<p>
The question: for sparse matrices (respectively sparse or multivariate polynomials), would it be reasonable to sort the indices in the hash function?
</p>
TicketjhpalmieriSun, 22 Oct 2017 21:59:16 GMT
https://trac.sagemath.org/ticket/10950#comment:8
https://trac.sagemath.org/ticket/10950#comment:8
<p>
This may be a silly question, but why does this happen?
</p>
<pre class="wiki">sage: hash((0, 1, -1, -1)) == hash((0, 1, 0, 0))
True
</pre>
TickettscrimMon, 23 Oct 2017 03:44:31 GMT
https://trac.sagemath.org/ticket/10950#comment:9
https://trac.sagemath.org/ticket/10950#comment:9
<p>
How about just doing a <code>frozenset</code> instead of sorting the tuples?
</p>
TicketvdelecroixMon, 23 Oct 2017 10:16:12 GMT
https://trac.sagemath.org/ticket/10950#comment:10
https://trac.sagemath.org/ticket/10950#comment:10
<p>
@tscrim: using <code>frozenset</code>, the hash function would be less robust and, more importantly, would be very hard to optimize in specialized classes. You don't want the construction of a frozenset each time <code>__hash__</code> is called (eg for dense matrices or univariate polynomial the list of exponents naturally come in order).
</p>
TickettscrimMon, 23 Oct 2017 13:20:07 GMT
https://trac.sagemath.org/ticket/10950#comment:11
https://trac.sagemath.org/ticket/10950#comment:11
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/10950#comment:10" title="Comment 10">vdelecroix</a>:
</p>
<blockquote class="citation">
<p>
@tscrim: using <code>frozenset</code>, the hash function would be less robust
</p>
</blockquote>
<p>
I don't see how, but maybe I don't understand what you think it will be less robust with respect to.
</p>
<blockquote class="citation">
<p>
and, more importantly, would be very hard to optimize in specialized classes.
</p>
</blockquote>
<p>
I don't see why you would need to optimize the hash function. The bottleneck would be iterating over the non-zero values, which is (would be?) a separate method.
</p>
<blockquote class="citation">
<p>
You don't want the construction of a frozenset each time <code>__hash__</code> is called (eg for dense matrices or univariate polynomial the list of exponents naturally come in order).
</p>
</blockquote>
<p>
Do you really want to sort a list of triples every time for sparse matrices? Inserting n objects into a hash table is much faster than that: O(n) vs O(n log n) for sorting them. We might want to consider (continuing) to cache the <code>__hash__</code> as well.
</p>
<p>
I also don't see why polynomials should come into an argument here.
</p>
<p>
However, here is another thought for a hash function:
</p>
<div class="wiki-code"><div class="code"><pre>temp <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">]</span>
<span class="k">for</span> i<span class="p">,</span>j <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span>nonzero_positions<span class="p">():</span>
temp<span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+=</span> i
temp<span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+=</span> j
temp<span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">+=</span> <span class="bp">self</span><span class="p">[</span>i<span class="p">,</span>j<span class="p">]</span>
<span class="k">return</span> <span class="nb">hash</span><span class="p">(</span><span class="nb">tuple</span><span class="p">(</span>temp<span class="p">))</span>
</pre></div></div>
TicketvdelecroixMon, 23 Oct 2017 14:12:52 GMT
https://trac.sagemath.org/ticket/10950#comment:12
https://trac.sagemath.org/ticket/10950#comment:12
<p>
First of all, the hash function is time critical and caching a hash is useless. You are almost never trying to test <code>A = A</code> (are you?). At least, upstream Python decided not to cache it for tuples and frozensets based on concrete benchmarking.
</p>
<p>
Let me consider a first concrete situation. I have 2 invertible matrices 3x3 matrices and I want to check whether there is a multiplicative relation among them. It is likely that I need to consider products of size 20, that is put in a set around 3<sup>20</sup> matrices. Building a frozenset for each of them would be a considerable waste and caching would not help.
</p>
<p>
A situation for which caching might be useful is when you have a "combinatorial" free module with a basis made of matrices and you encode your vectors using your matrices as keys. Though, that might be a problem of datastructure here (the keys are completely static).
</p>
<p>
Do you have more concrete situations where hashing is involved?
</p>
<p>
For sorting the indices, I believe that in most situations hashing will be used on dense matrices. For them, you can go through the <code>(i,j)</code> indices in any order you like.
</p>
<p>
Concerning collisions, I don't believe your hash function was a serious proposal. Let us consider <code>n x n</code> matrices with entries <code>{0,1}</code>. Let N = n<sup>2</sup>. You know that you have 2<sup>N</sup> such matrices and it is easy to see that you have a polynomial number of values produced by your hash function (around N<sup>2</sup> N<sup>2</sup> N = N<sup>5</sup> if I am optimistic). Even on permutation matrices, your hash function would perform terribly.
</p>
<p>
The relevance of polynomials is that you have the same kind of troubles passing from univariate (coefficients in a list like structure) to multivariate (coefficients in a dictionary like structure). See <a class="new ticket" href="https://trac.sagemath.org/ticket/21284" title="defect: hash of polynomials is very bad (new)">#21284</a>.
</p>
TicketjdemeyerMon, 23 Oct 2017 14:35:08 GMT
https://trac.sagemath.org/ticket/10950#comment:13
https://trac.sagemath.org/ticket/10950#comment:13
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/10950#comment:8" title="Comment 8">jhpalmieri</a>:
</p>
<blockquote class="citation">
<p>
This may be a silly question, but why does this happen?
</p>
<pre class="wiki">sage: hash((0, 1, -1, -1)) == hash((0, 1, 0, 0))
True
</pre></blockquote>
<p>
Interesting. This is how Python implements hashing for tuples:
</p>
<pre class="wiki">sage: def tuplehash(t):
....: n = len(t)
....: x = 3430008
....: for i in range(n):
....: h = hash(t[i])
....: x = (x ^^ h) * (1000003 + (82520 + 2*n)*i)
....: return x + 97531
</pre><p>
The resulting value should be reduced mod 2<sup>32</sup> or 2<sup>64</sup> depending on the system.
</p>
<p>
The reason for the collision is that <code>N ^^ -2 == -N</code> for any odd <code>N</code>. The choice of a XOR operation there is strange, since most simple hash functions would use addition instead. See <a class="ext-link" href="https://en.wikipedia.org/wiki/Universal_hashing#Constructions"><span class="icon"></span>https://en.wikipedia.org/wiki/Universal_hashing#Constructions</a> for example.
</p>
TicketjdemeyerMon, 23 Oct 2017 14:41:31 GMT
https://trac.sagemath.org/ticket/10950#comment:14
https://trac.sagemath.org/ticket/10950#comment:14
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/10950#comment:11" title="Comment 11">tscrim</a>:
</p>
<blockquote class="citation">
<p>
However, here is another thought for a hash function:
</p>
<div class="wiki-code"><div class="code"><pre>temp <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">]</span>
<span class="k">for</span> i<span class="p">,</span>j <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span>nonzero_positions<span class="p">():</span>
temp<span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+=</span> i
temp<span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+=</span> j
temp<span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">+=</span> <span class="bp">self</span><span class="p">[</span>i<span class="p">,</span>j<span class="p">]</span>
<span class="k">return</span> <span class="nb">hash</span><span class="p">(</span><span class="nb">tuple</span><span class="p">(</span>temp<span class="p">))</span>
</pre></div></div></blockquote>
<p>
I like the idea but not the concrete implementation. It does avoid sorting but there needs to be more mixing. Something like
</p>
<div class="wiki-code"><div class="code"><pre>h <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> i<span class="p">,</span>j <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span>nonzero_positions<span class="p">():</span>
h <span class="o">+=</span> H<span class="p">(</span>i<span class="p">,</span> j<span class="p">,</span> <span class="bp">self</span><span class="p">[</span>i<span class="p">,</span>j<span class="p">])</span>
<span class="k">return</span> h
</pre></div></div><p>
where <code>H</code> is a simple hash function.
</p>
TicketjdemeyerMon, 23 Oct 2017 14:45:02 GMTauthor, milestone set
https://trac.sagemath.org/ticket/10950#comment:15
https://trac.sagemath.org/ticket/10950#comment:15
<ul>
<li><strong>author</strong>
set to <em>Jeroen Demeyer</em>
</li>
<li><strong>milestone</strong>
set to <em>sage-8.1</em>
</li>
</ul>
<p>
Let me give this a shot...
</p>
TicketjdemeyerMon, 23 Oct 2017 15:27:59 GMTdependencies set
https://trac.sagemath.org/ticket/10950#comment:16
https://trac.sagemath.org/ticket/10950#comment:16
<ul>
<li><strong>dependencies</strong>
set to <em>#24090</em>
</li>
</ul>
TickettscrimMon, 23 Oct 2017 23:57:45 GMT
https://trac.sagemath.org/ticket/10950#comment:17
https://trac.sagemath.org/ticket/10950#comment:17
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/10950#comment:12" title="Comment 12">vdelecroix</a>:
</p>
<blockquote class="citation">
<p>
First of all, the hash function is time critical and caching a hash is useless. You are almost never trying to test <code>A = A</code> (are you?). At least, upstream Python decided not to cache it for tuples and frozensets based on concrete benchmarking.
</p>
<p>
Let me consider a first concrete situation. I have 2 invertible matrices 3x3 matrices and I want to check whether there is a multiplicative relation among them. It is likely that I need to consider products of size 20, that is put in a set around 3<sup>20</sup> matrices. Building a frozenset for each of them would be a considerable waste and caching would not help.
</p>
</blockquote>
<p>
It depends on what you mean by "considerable waste." In terms of computational complexity, it just adds to the constant (or probably would not even be contributing asymptotically for dense matrices). However, I do agree that it effectively doubles the memory footprint for dense matrices.
</p>
<blockquote class="citation">
<p>
A situation for which caching might be useful is when you have a "combinatorial" free module with a basis made of matrices and you encode your vectors using your matrices as keys. Though, that might be a problem of datastructure here (the keys are completely static).
</p>
<p>
Do you have more concrete situations where hashing is involved?
</p>
</blockquote>
<p>
As you mentioned, when they are used for keys of a module implemented as a <code>dict</code>, such for <code>CombinatorialFreeModule</code> elements, or for defining objects such as <code>FiniteDimensionalAlgebra</code>. Granted, in these situations the matrices are relatively small.
</p>
<blockquote class="citation">
<p>
For sorting the indices, I believe that in most situations hashing will be used on dense matrices. For them, you can go through the <code>(i,j)</code> indices in any order you like.
</p>
</blockquote>
<p>
I have been doing a bunch of stuff with sparse matrices recently (computing representations of Lie algebras) and have encountered things that are optimized for dense matrices that are used by sparse matrices as well. So I'd appreciate it if we didn't make the hash worse for sparse matrices just because it is better of dense matrices.
</p>
<blockquote class="citation">
<p>
Concerning collisions, I don't believe your hash function was a serious proposal.
</p>
</blockquote>
<p>
It was just a thought on how to remove the dependence on the order and not fully fleshed out. I know I'm less experienced in this area, but I am legitimately trying to help.
</p>
<blockquote class="citation">
<p>
Let us consider <code>n x n</code> matrices with entries <code>{0,1}</code>. Let N = n<sup>2</sup>. You know that you have 2<sup>N</sup> such matrices and it is easy to see that you have a polynomial number of values produced by your hash function (around N<sup>2</sup> N<sup>2</sup> N = N<sup>5</sup> if I am optimistic). Even on permutation matrices, your hash function would perform terribly.
</p>
</blockquote>
<p>
I agree that it would not perform well in this case. What do you think about Jeoren's modification. I think it works well for both dense and sparse matrices and should work better than my proposal.
</p>
<blockquote class="citation">
<p>
The relevance of polynomials is that you have the same kind of troubles passing from univariate (coefficients in a list like structure) to multivariate (coefficients in a dictionary like structure). See <a class="new ticket" href="https://trac.sagemath.org/ticket/21284" title="defect: hash of polynomials is very bad (new)">#21284</a>.
</p>
</blockquote>
<p>
For the multivariate polynomials, it looks like what is used is Jeroen's proposal. Although we could have a completely different solution for polynomials because the data structure for dense univariate is 1D and sparse k-variate are kD, whereas matrices are strictly 2D.
</p>
TicketjdemeyerTue, 24 Oct 2017 13:41:54 GMTbranch set
https://trac.sagemath.org/ticket/10950#comment:18
https://trac.sagemath.org/ticket/10950#comment:18
<ul>
<li><strong>branch</strong>
set to <em>u/jdemeyer/the_hash_function_for_matrices_suffers_from_many_collisions_with_permutation_matrices</em>
</li>
</ul>
TicketgitTue, 24 Oct 2017 15:30:21 GMTcommit set
https://trac.sagemath.org/ticket/10950#comment:19
https://trac.sagemath.org/ticket/10950#comment:19
<ul>
<li><strong>commit</strong>
set to <em>f67a8839d7dfb843a8ed8652a2734b644be579f1</em>
</li>
</ul>
<p>
Branch pushed to git repo; I updated commit sha1. New commits:
</p>
<table class="wiki">
<tr><td><a class="ext-link" href="https://git.sagemath.org/sage.git/commit/?id=f67a8839d7dfb843a8ed8652a2734b644be579f1"><span class="icon"></span>f67a883</a></td><td><code>Fix doctests</code>
</td></tr></table>
TicketjdemeyerTue, 24 Oct 2017 15:34:40 GMTdescription changed
https://trac.sagemath.org/ticket/10950#comment:20
https://trac.sagemath.org/ticket/10950#comment:20
<ul>
<li><strong>description</strong>
modified (<a href="/ticket/10950?action=diff&version=20">diff</a>)
</li>
</ul>
TicketjdemeyerTue, 24 Oct 2017 15:36:54 GMTdescription changed
https://trac.sagemath.org/ticket/10950#comment:21
https://trac.sagemath.org/ticket/10950#comment:21
<ul>
<li><strong>description</strong>
modified (<a href="/ticket/10950?action=diff&version=21">diff</a>)
</li>
</ul>
TicketjdemeyerTue, 24 Oct 2017 15:38:00 GMT
https://trac.sagemath.org/ticket/10950#comment:22
https://trac.sagemath.org/ticket/10950#comment:22
<p>
Feel free to comment. This is essentially "needs review" except that I haven't run all doctests yet.
</p>
TicketjdemeyerTue, 24 Oct 2017 15:39:30 GMTsummary changed
https://trac.sagemath.org/ticket/10950#comment:23
https://trac.sagemath.org/ticket/10950#comment:23
<ul>
<li><strong>summary</strong>
changed from <em>The hash function for matrices suffers from many collisions with permutation matrices</em> to <em>The hash function for matrices suffers from many collisions</em>
</li>
</ul>
TicketvdelecroixTue, 24 Oct 2017 15:55:13 GMT
https://trac.sagemath.org/ticket/10950#comment:24
https://trac.sagemath.org/ticket/10950#comment:24
<p>
Nice! Would be interesting to test in conjunction with <a class="closed ticket" href="https://trac.sagemath.org/ticket/23706" title="enhancement: allow several implementations of matrices in MatrixSpace (closed: fixed)">#23706</a>.
</p>
TicketvdelecroixTue, 24 Oct 2017 15:56:41 GMT
https://trac.sagemath.org/ticket/10950#comment:25
https://trac.sagemath.org/ticket/10950#comment:25
<p>
Instead of <code>hash(self.get_unsafe(i, j))</code> what about introducing <code>self._entry_hash(i, j)</code>?
</p>
TicketjdemeyerTue, 24 Oct 2017 19:11:17 GMT
https://trac.sagemath.org/ticket/10950#comment:26
https://trac.sagemath.org/ticket/10950#comment:26
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/10950#comment:25" title="Comment 25">vdelecroix</a>:
</p>
<blockquote class="citation">
<p>
Instead of <code>hash(self.get_unsafe(i, j))</code> what about introducing <code>self._entry_hash(i, j)</code>?
</p>
</blockquote>
<p>
Please elaborate. Why should I do that?
</p>
TicketgitTue, 24 Oct 2017 19:14:11 GMTcommit changed
https://trac.sagemath.org/ticket/10950#comment:27
https://trac.sagemath.org/ticket/10950#comment:27
<ul>
<li><strong>commit</strong>
changed from <em>f67a8839d7dfb843a8ed8652a2734b644be579f1</em> to <em>d159670f8f4f3196abd340a5ed3df2d16a291f8c</em>
</li>
</ul>
<p>
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
</p>
<table class="wiki">
<tr><td><a class="ext-link" href="https://git.sagemath.org/sage.git/commit/?id=d159670f8f4f3196abd340a5ed3df2d16a291f8c"><span class="icon"></span>d159670</a></td><td><code>Fix doctests</code>
</td></tr></table>
TicketjdemeyerTue, 24 Oct 2017 19:14:32 GMTstatus changed
https://trac.sagemath.org/ticket/10950#comment:28
https://trac.sagemath.org/ticket/10950#comment:28
<ul>
<li><strong>status</strong>
changed from <em>new</em> to <em>needs_review</em>
</li>
</ul>
TicketgitWed, 25 Oct 2017 08:10:03 GMTcommit changed
https://trac.sagemath.org/ticket/10950#comment:29
https://trac.sagemath.org/ticket/10950#comment:29
<ul>
<li><strong>commit</strong>
changed from <em>d159670f8f4f3196abd340a5ed3df2d16a291f8c</em> to <em>b3abe34f8b1c8b9592ed661de2d40d2ece85a5d4</em>
</li>
</ul>
<p>
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
</p>
<table class="wiki">
<tr><td><a class="ext-link" href="https://git.sagemath.org/sage.git/commit/?id=b98ce2845f2fae050502dfa8ee855127395d8810"><span class="icon"></span>b98ce28</a></td><td><code>Better hash for matrices</code>
</td></tr><tr><td><a class="ext-link" href="https://git.sagemath.org/sage.git/commit/?id=b3abe34f8b1c8b9592ed661de2d40d2ece85a5d4"><span class="icon"></span>b3abe34</a></td><td><code>Fix doctests</code>
</td></tr></table>
TicketjdemeyerWed, 25 Oct 2017 08:17:08 GMT
https://trac.sagemath.org/ticket/10950#comment:30
https://trac.sagemath.org/ticket/10950#comment:30
<p>
Reorganized the code a bit, added documentation in <code>matrix0.pyx</code> to explain why I did things the way I did.
</p>
TicketgitWed, 25 Oct 2017 08:20:42 GMTcommit changed
https://trac.sagemath.org/ticket/10950#comment:31
https://trac.sagemath.org/ticket/10950#comment:31
<ul>
<li><strong>commit</strong>
changed from <em>b3abe34f8b1c8b9592ed661de2d40d2ece85a5d4</em> to <em>da5c1ff296edce8e6037d134e774ffdf328984f8</em>
</li>
</ul>
<p>
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
</p>
<table class="wiki">
<tr><td><a class="ext-link" href="https://git.sagemath.org/sage.git/commit/?id=210febfdf2555cd5ffc8ad20303131edb4931b54"><span class="icon"></span>210febf</a></td><td><code>Better hash for matrices</code>
</td></tr><tr><td><a class="ext-link" href="https://git.sagemath.org/sage.git/commit/?id=da5c1ff296edce8e6037d134e774ffdf328984f8"><span class="icon"></span>da5c1ff</a></td><td><code>Fix doctests</code>
</td></tr></table>
TicketvdelecroixWed, 25 Oct 2017 09:35:44 GMT
https://trac.sagemath.org/ticket/10950#comment:32
https://trac.sagemath.org/ticket/10950#comment:32
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/10950#comment:26" title="Comment 26">jdemeyer</a>:
</p>
<blockquote class="citation">
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/10950#comment:25" title="Comment 25">vdelecroix</a>:
</p>
<blockquote class="citation">
<p>
Instead of <code>hash(self.get_unsafe(i, j))</code> what about introducing <code>self._entry_hash(i, j)</code>?
</p>
</blockquote>
<p>
Please elaborate. Why should I do that?
</p>
</blockquote>
<p>
To make hashing faster! That would be a method easy to implement for extension classes that store the data as a C attribute. In such situation you don't want to build a Python object each time you compute the hash of an entry. And you don't want to copy/paste of the <code>_hash_</code> function either.
</p>
TicketjdemeyerWed, 25 Oct 2017 09:41:38 GMT
https://trac.sagemath.org/ticket/10950#comment:33
https://trac.sagemath.org/ticket/10950#comment:33
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/10950#comment:32" title="Comment 32">vdelecroix</a>:
</p>
<blockquote class="citation">
<p>
To make hashing faster! That would be a method easy to implement for extension classes that store the data as a C attribute. In such situation you don't want to build a Python object each time you compute the hash of an entry. And you don't want to copy/paste of the <code>_hash_</code> function either.
</p>
</blockquote>
<p>
On the other hand, you force every <code>Matrix</code> class to implement yet another method...
</p>
<p>
I'm not really convinced... in any case, that can wait for a follow-up ticket.
</p>
TicketvdelecroixWed, 25 Oct 2017 09:43:30 GMT
https://trac.sagemath.org/ticket/10950#comment:34
https://trac.sagemath.org/ticket/10950#comment:34
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/10950#comment:33" title="Comment 33">jdemeyer</a>:
</p>
<blockquote class="citation">
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/10950#comment:32" title="Comment 32">vdelecroix</a>:
</p>
<blockquote class="citation">
<p>
To make hashing faster! That would be a method easy to implement for extension classes that store the data as a C attribute. In such situation you don't want to build a Python object each time you compute the hash of an entry. And you don't want to copy/paste of the <code>_hash_</code> function either.
</p>
</blockquote>
<p>
On the other hand, you force every <code>Matrix</code> class to implement yet another method...
</p>
</blockquote>
<p>
It would not be forced: <code>return hash(self._get_unsafe(i,j))</code> is a reasonable default.
</p>
<blockquote class="citation">
<p>
I'm not really convinced... in any case, that can wait for a follow-up ticket.
</p>
</blockquote>
<p>
Agreed, it can wait.
</p>
TicketjdemeyerWed, 25 Oct 2017 10:19:35 GMT
https://trac.sagemath.org/ticket/10950#comment:35
https://trac.sagemath.org/ticket/10950#comment:35
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/10950#comment:34" title="Comment 34">vdelecroix</a>:
</p>
<blockquote class="citation">
<p>
It would not be forced: <code>return hash(self._get_unsafe(i,j))</code> is a reasonable default.
</p>
</blockquote>
<p>
But that would add an extra level of indirection in the general case, slowing that down.
</p>
TickettscrimWed, 25 Oct 2017 22:57:13 GMTstatus changed; reviewer set
https://trac.sagemath.org/ticket/10950#comment:36
https://trac.sagemath.org/ticket/10950#comment:36
<ul>
<li><strong>status</strong>
changed from <em>needs_review</em> to <em>positive_review</em>
</li>
<li><strong>reviewer</strong>
set to <em>Travis Scrimshaw, Vincent Delecroix</em>
</li>
</ul>
<p>
I don't think there should be any Python object getting created by <code>_get_unsafe</code>, just a pointer to a Python object. What would be a use-case for implementing a custom <code>_entry_hash</code>?
</p>
<p>
Anyways, since you say it can wait for a followup and this is a good improvement, I am setting it to a positive review. If you still want to discuss things first, feel free to revert.
</p>
TicketjdemeyerThu, 26 Oct 2017 06:43:25 GMT
https://trac.sagemath.org/ticket/10950#comment:37
https://trac.sagemath.org/ticket/10950#comment:37
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/10950#comment:36" title="Comment 36">tscrim</a>:
</p>
<blockquote class="citation">
<p>
I don't think there should be any Python object getting created by <code>_get_unsafe</code>
</p>
</blockquote>
<p>
...only if the entries of the matrix are stored as Python objects. For many specialized classes (matrices over ZZ or over finite fields), the entries are stored in some C array, so <code>_get_unsafe()</code> does need to create a Python object.
</p>
TicketvbraunMon, 30 Oct 2017 07:41:31 GMTstatus, branch changed; resolution set
https://trac.sagemath.org/ticket/10950#comment:38
https://trac.sagemath.org/ticket/10950#comment:38
<ul>
<li><strong>status</strong>
changed from <em>positive_review</em> to <em>closed</em>
</li>
<li><strong>resolution</strong>
set to <em>fixed</em>
</li>
<li><strong>branch</strong>
changed from <em>u/jdemeyer/the_hash_function_for_matrices_suffers_from_many_collisions_with_permutation_matrices</em> to <em>da5c1ff296edce8e6037d134e774ffdf328984f8</em>
</li>
</ul>
TicketjdemeyerMon, 30 Oct 2017 16:35:56 GMTcommit deleted
https://trac.sagemath.org/ticket/10950#comment:39
https://trac.sagemath.org/ticket/10950#comment:39
<ul>
<li><strong>commit</strong>
<em>da5c1ff296edce8e6037d134e774ffdf328984f8</em> deleted
</li>
</ul>
<p>
Some more comments:
</p>
<p>
This hash function is a lot better than the old one. However, it is far from perfect:
</p>
<pre class="wiki">sage: M = diagonal_matrix([0, 1, -2, 1]); M.set_immutable(); hash(M)
0
</pre><p>
Still, it is the best hash that I could come up with which is very efficient for sparse matrices: it takes time <code>O(s)</code> with <code>s</code> the number of non-zero entries of a matrix (independent of the size of the matrix). If you are willing to sort the indices (as suggested in <a class="ticket" href="https://trac.sagemath.org/ticket/10950#comment:7" title="Comment 7">7</a>), you get a better hash but with a runtime of <code>O(s log(s))</code>.
</p>
Ticket