Sage: Ticket #16116: Multiplication of dense cyclotomic matrices should be faster
https://trac.sagemath.org/ticket/16116
<p>
This ticket organizes various improvements in order to get faster matrices over cyclotomic fields. The aim is to implement and compare three different ways to perform such computation:
</p>
<ul><li>libgap wrappers
</li><li>generic matrix (<code>matrix_generic_dense.Matrix_generic_dense</code>)
</li><li>the specialized inadapted class that is used by default now in Sage (<code>matrix_cyclo_dense.Matrix_cyclo_dense</code>)
</li></ul><p>
Concrete tickets:
</p>
<ul><li><a class="closed ticket" href="https://trac.sagemath.org/ticket/23704" title="enhancement: getitem/setitem for libgap (closed: fixed)">#23704</a>: getitem/setitem for libgap elements
</li><li><a class="closed ticket" href="https://trac.sagemath.org/ticket/23706" title="enhancement: allow several implementations of matrices in MatrixSpace (closed: fixed)">#23706</a>: gap class for matrices + be able to change the implementation
</li></ul><hr />
<p>
Description from several years ago...
</p>
<p>
The multiplication of matrices with a (universal) cyclotomic fields as base ring could be optimized as the following profiling shows:
</p>
<pre class="wiki">sage: def make_matrix1(R,a,b):
....: return matrix(R, 3, [[-1, 1, 2*a],
....: [-4*a*b - 1, 4*a*b + 4*b^2, 4*a*b + 2*a],
....: [-2*a, 2*a + 2*b, 2*a]])
sage: PR.<x,y> = PolynomialRing(QQ)
sage: I = Ideal(x^2 - 1/2*x - 1/4, y^3 - 1/2*y^2 - 1/2*y + 1/8)
sage: Q = PR.quotient(I)
sage: elmt = make_matrix1(Q, x, y)
sage: %timeit elmt^2
1000 loops, best of 3: 1.17 ms per loop
sage: UCF.<E> = UniversalCyclotomicField()
sage: ae = (E(10)+~E(10))/2 #same value as a
sage: be = (E(14)+~E(14))/2 #same value as b
sage: m = make_matrix1(UCF, ae, be)
sage: %timeit m^2
100 loops, best of 3: 8.13 ms per loop
sage: CF.<F> = CyclotomicField(2*5*7)
sage: af = (F^7+~F^7)/2 #same value as a
sage: bf = (F^5+~F^5)/2 #same value as b
sage: m2 = make_matrix1(CF, af, bf)
sage: %timeit m2^2
100 loops, best of 3: 4.99 ms per loop
</pre><p>
The three matrices elmt, m and m2 are the same encoded into 3 different base rings. It would be natural to think that the cyclotomic field be the optimal field to do computations, but it does not seem to be the case in practice.
</p>
<p>
Here is a univariate example.
</p>
<pre class="wiki">sage: def make_matrix2(R, a):
....: return matrix(R, 3, [[-2*a, 1, 6*a+2],
....: [-2*a, 2*a, 4*a+1],
....: [0, 0, 1]])
sage: PR.<x> = PolynomialRing(QQ)
sage: I = Ideal(x^2 - 1/2*x - 1/4)
sage: Q = PR.quotient(I)
sage: elmt_uni = make_matrix2(Q, x)
sage: %timeit elmt_uni*elmt_uni
1000 loops, best of 3: 1.46 ms per loop
sage: CF.<F> = CyclotomicField(2*5)
sage: f5 = (F+~F)/2
sage: m = make_matrix2(CF, f5)
sage: type(m)
<type 'sage.matrix.matrix_cyclo_dense.Matrix_cyclo_dense'>
sage: m.parent()
Full MatrixSpace of 3 by 3 dense matrices over
Cyclotomic Field of order 10 and degree 4
sage: %timeit m*m
100 loops, best of 3: 1.98 ms per loop
</pre><p>
Then, I disactivated the verification on cyclotomic fields on line 962 of the file /src/sage/matrix/matrix_space.py to get a matrix_generic_dense instead of matrix_cyclo_dense.
</p>
<pre class="wiki">sage: CF.<F> = CyclotomicField(2*5)
sage: f5 = (F+~F)/2
sage: m = make_matrix2(CF, f5)
sage: m.parent()
Full MatrixSpace of 3 by 3 dense matrices over
Cyclotomic Field of order 10 and degree 4
sage: type(m)
<type 'sage.matrix.matrix_generic_dense.Matrix_generic_dense'>
sage: %timeit m*m
1000 loops, best of 3: 251 µs per loop
</pre><p>
The gain is significant. Is there a known use cases where the specialized implementation is faster than the generic one? If yes, should we make some threshold test to choose between the two implementations?
</p>
en-usSagehttps://trac.sagemath.org/chrome/site/logo_sagemath_trac.png
https://trac.sagemath.org/ticket/16116
Trac 1.1.6jipilabThu, 10 Apr 2014 09:20:00 GMTcc, description changed
https://trac.sagemath.org/ticket/16116#comment:1
https://trac.sagemath.org/ticket/16116#comment:1
<ul>
<li><strong>cc</strong>
<em>was</em> added
</li>
<li><strong>description</strong>
modified (<a href="/ticket/16116?action=diff&version=1">diff</a>)
</li>
</ul>
TicketjipilabThu, 10 Apr 2014 09:41:33 GMTdescription changed
https://trac.sagemath.org/ticket/16116#comment:2
https://trac.sagemath.org/ticket/16116#comment:2
<ul>
<li><strong>description</strong>
modified (<a href="/ticket/16116?action=diff&version=2">diff</a>)
</li>
</ul>
Ticketvbraun_spamTue, 06 May 2014 15:20:58 GMTmilestone changed
https://trac.sagemath.org/ticket/16116#comment:3
https://trac.sagemath.org/ticket/16116#comment:3
<ul>
<li><strong>milestone</strong>
changed from <em>sage-6.2</em> to <em>sage-6.3</em>
</li>
</ul>
Ticketvbraun_spamSun, 10 Aug 2014 16:51:03 GMTmilestone changed
https://trac.sagemath.org/ticket/16116#comment:4
https://trac.sagemath.org/ticket/16116#comment:4
<ul>
<li><strong>milestone</strong>
changed from <em>sage-6.3</em> to <em>sage-6.4</em>
</li>
</ul>
TicketvdelecroixSat, 11 Apr 2015 08:42:21 GMT
https://trac.sagemath.org/ticket/16116#comment:5
https://trac.sagemath.org/ticket/16116#comment:5
<p>
Hello,
</p>
<p>
With <a class="closed ticket" href="https://trac.sagemath.org/ticket/18152" title="enhancement: Universal Cyclotomic Field implementation using libgap (closed: fixed)">#18152</a>, I got a 10x speedup
</p>
<p>
old version:
</p>
<pre class="wiki">sage: %timeit m^2
100 loops, best of 3: 3.65 ms per loop
</pre><p>
new version:
</p>
<pre class="wiki">sage: %timeit m^2
The slowest run took 69.65 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 336 µs per loop
</pre><p>
Vincent
</p>
TicketvdelecroixSat, 11 Apr 2015 08:44:49 GMT
https://trac.sagemath.org/ticket/16116#comment:6
https://trac.sagemath.org/ticket/16116#comment:6
<p>
And using libgap directly is even faster
</p>
<pre class="wiki">sage: M = m._libgap_()
sage: %timeit M^2
The slowest run took 9.57 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 183 µs per loop
</pre><p>
So, as written in the bottom of the description in <a class="closed ticket" href="https://trac.sagemath.org/ticket/18152" title="enhancement: Universal Cyclotomic Field implementation using libgap (closed: fixed)">#18152</a>, we should wrap GAP matrices to deal with dense cyclotomics matrices in Sage.
</p>
<p>
Vincent
</p>
TicketvdelecroixSat, 11 Apr 2015 13:15:35 GMTdescription changed
https://trac.sagemath.org/ticket/16116#comment:7
https://trac.sagemath.org/ticket/16116#comment:7
<ul>
<li><strong>description</strong>
modified (<a href="/ticket/16116?action=diff&version=7">diff</a>)
</li>
</ul>
<p>
Hello,
</p>
<p>
I reformatted your example such that they fit in less lines (it can easily switched back to your original version if you do not like it).
</p>
<p>
I had a quick look at the code for dense cyclotomic matrices. The implementation is quite old and uses a lot of reduction mod p (even for multiplication). The code calls a lot of Python code like creating a finite field, creating a matrix space, etc which are relatively slow compared to a small matrix multiplication. Did you try multiplying larger matrices (i.e. 10x10 or 15x15)? On the other hand, I am pretty sure that some cleaning could be of great speed up. By cleaning I mean:
</p>
<ul><li>declare <code>cdef</code> variables wherever possible
</li><li>let as minimum as possible <code>import</code> inside the methods
</li><li>...
</li></ul><p>
You can also do some profiling on the code (using "%prun" and "%crun"), see <a class="closed ticket" href="https://trac.sagemath.org/ticket/17689" title="enhancement: A tutorial on profiling in Sage (closed: fixed)">#17689</a> which is not yet in the current development release.
</p>
<p>
Vincent
</p>
TicketwasSat, 11 Apr 2015 14:19:11 GMT
https://trac.sagemath.org/ticket/16116#comment:8
https://trac.sagemath.org/ticket/16116#comment:8
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/16116#comment:7" title="Comment 7">vdelecroix</a>:
</p>
<blockquote class="citation">
<p>
Hello,
</p>
<p>
I reformatted your example such that they fit in less lines (it can easily switched back to your original version if you do not like it).
</p>
<p>
I had a quick look at the code for dense cyclotomic matrices. The implementation is quite old and uses a lot of reduction mod p (even for multiplication). The code calls a lot of Python code like creating a finite field, creating a matrix space, etc which are relatively slow compared to a small matrix multiplication. Did you try multiplying larger matrices (i.e. 10x10 or 15x15)?
</p>
</blockquote>
<p>
I designed and implemented the algorithm for dense cyclotomic matrices. We were optimizing for larger matrices... which in the context of modular forms means at least 100 rows (and often much, much more). GAP/pari on the other hand optimize for relatively tiny matrices. The asymptotically fast algorithms for large matrices are totally different than for small...
</p>
TicketvdelecroixSat, 11 Apr 2015 14:26:24 GMT
https://trac.sagemath.org/ticket/16116#comment:9
https://trac.sagemath.org/ticket/16116#comment:9
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/16116#comment:8" title="Comment 8">was</a>:
</p>
<blockquote class="citation">
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/16116#comment:7" title="Comment 7">vdelecroix</a>:
</p>
<blockquote class="citation">
<p>
Hello,
</p>
<p>
I reformatted your example such that they fit in less lines (it can easily switched back to your original version if you do not like it).
</p>
<p>
I had a quick look at the code for dense cyclotomic matrices. The implementation is quite old and uses a lot of reduction mod p (even for multiplication). The code calls a lot of Python code like creating a finite field, creating a matrix space, etc which are relatively slow compared to a small matrix multiplication. Did you try multiplying larger matrices (i.e. 10x10 or 15x15)?
</p>
</blockquote>
<p>
I designed and implemented the algorithm for dense cyclotomic matrices. We were optimizing for larger matrices... which in the context of modular forms means at least 100 rows (and often much, much more). GAP/pari on the other hand optimize for relatively tiny matrices. The asymptotically fast algorithms for large matrices are totally different than for small...
</p>
</blockquote>
<p>
All right. I now understand better what I read! I see two possibilities.
</p>
<p>
1) [easy one] We add a test in the matrix space constructor:
</p>
<ul><li>if the size is small -> use the generic implementation of dense matrices
</li><li>if the size is large -> use your optimized version
</li></ul><p>
This requires a bit of benchmark.
</p>
<p>
2) [better one] Wrap pari matrices for small sizes or add a another multiplication on the current datatype that is fast on small matrices.
</p>
<p>
Vincent
</p>
TicketjipilabMon, 13 Apr 2015 13:19:29 GMT
https://trac.sagemath.org/ticket/16116#comment:10
https://trac.sagemath.org/ticket/16116#comment:10
<p>
Hi,
</p>
<p>
In the case I'm interested in, it is definitely for small sizes. Say up to 20-25 at the very very biggest. Most commonly it is going to be between 3 and 10. This is related to the tickets <a class="new ticket" href="https://trac.sagemath.org/ticket/15703" title="enhancement: Refactor Coxeter groups as matrix groups and non crystallographic root ... (new)">#15703</a>, <a class="new ticket" href="https://trac.sagemath.org/ticket/16087" title="enhancement: Limit roots in the geometric representation of Coxeter groups (new)">#16087</a>.
</p>
<p>
How to proceed to add another multiplication to the matrices that could be used for small matrices?
</p>
<p>
Are there examples around with such a thing? I could look at it...
</p>
TicketwasMon, 13 Apr 2015 13:21:51 GMT
https://trac.sagemath.org/ticket/16116#comment:11
https://trac.sagemath.org/ticket/16116#comment:11
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/16116#comment:10" title="Comment 10">jipilab</a>:
</p>
<blockquote class="citation">
<p>
Hi,
</p>
<p>
In the case I'm interested in, it is definitely for small sizes. Say up to 20-25 at the very very biggest. Most commonly it is going to be between 3 and 10. This is related to the tickets <a class="new ticket" href="https://trac.sagemath.org/ticket/15703" title="enhancement: Refactor Coxeter groups as matrix groups and non crystallographic root ... (new)">#15703</a>, <a class="new ticket" href="https://trac.sagemath.org/ticket/16087" title="enhancement: Limit roots in the geometric representation of Coxeter groups (new)">#16087</a>.
</p>
<p>
How to proceed to add another multiplication to the matrices that could be used for small matrices?
</p>
<p>
Are there examples around with such a thing? I could look at it...
</p>
</blockquote>
<p>
Matrices over ZZ used to have special code for small versus large. I think now that variation in algorithms is mainly hidden by calling FLINT. Look into it.
</p>
TicketvdelecroixMon, 13 Apr 2015 13:37:18 GMT
https://trac.sagemath.org/ticket/16116#comment:12
https://trac.sagemath.org/ticket/16116#comment:12
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/16116#comment:11" title="Comment 11">was</a>:
</p>
<blockquote class="citation">
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/16116#comment:10" title="Comment 10">jipilab</a>:
</p>
<blockquote class="citation">
<p>
Hi,
</p>
<p>
In the case I'm interested in, it is definitely for small sizes. Say up to 20-25 at the very very biggest. Most commonly it is going to be between 3 and 10. This is related to the tickets <a class="new ticket" href="https://trac.sagemath.org/ticket/15703" title="enhancement: Refactor Coxeter groups as matrix groups and non crystallographic root ... (new)">#15703</a>, <a class="new ticket" href="https://trac.sagemath.org/ticket/16087" title="enhancement: Limit roots in the geometric representation of Coxeter groups (new)">#16087</a>.
</p>
<p>
How to proceed to add another multiplication to the matrices that could be used for small matrices?
</p>
<p>
Are there examples around with such a thing? I could look at it...
</p>
</blockquote>
<p>
Matrices over ZZ used to have special code for small versus large. I think now that variation in algorithms is mainly hidden by calling FLINT. Look into it.
</p>
</blockquote>
<p>
William, are you sure that the representation you used in <code>MatrixDense_cyclotomic</code> is the thing we want for small sizes? If that so, I would rather implement something for any number fields. I do not see why it might be different. Did you have something in mind?
</p>
<p>
In the present case, I would rather modify <code>MatrixSpace._get_matrix_class</code> to choose the generic dense matrices for small sizes and see if something break.
</p>
TicketwasMon, 13 Apr 2015 16:37:37 GMT
https://trac.sagemath.org/ticket/16116#comment:13
https://trac.sagemath.org/ticket/16116#comment:13
<blockquote class="citation">
<p>
William, are you sure that the representation you used in MatrixDense_cyclotomic is the thing we want for small sizes?
</p>
</blockquote>
<p>
No. In fact, I'm pretty sure they are *not* what you would want for small sizes.
</p>
TickettscrimWed, 16 Sep 2015 23:09:26 GMT
https://trac.sagemath.org/ticket/16116#comment:14
https://trac.sagemath.org/ticket/16116#comment:14
<p>
We also have a related issue:
</p>
<pre class="wiki">sage: R = CyclotomicField(12)
sage: M = matrix.random(R, 40,40)
sage: N = matrix.random(R, 3, 3)
sage: %time K = M.tensor_product(N)
CPU times: user 5.75 s, sys: 28.4 ms, total: 5.78 s
Wall time: 5.73 s
sage: R.defining_polynomial()
x^4 - x^2 + 1
sage: type(M)
<type 'sage.matrix.matrix_cyclo_dense.Matrix_cyclo_dense'>
sage: R = NumberField(x^4 - x^2 + 1, 'a')
sage: M = matrix.random(R, 40,40)
sage: N = matrix.random(R, 3, 3)
sage: %time K = M.tensor_product(N)
CPU times: user 225 ms, sys: 16.4 ms, total: 241 ms
Wall time: 232 ms
sage: type(M)
<type 'sage.matrix.matrix_generic_dense.Matrix_generic_dense'>
</pre><p>
Where the issue is coming from having a scalar times a matrix. Here's some profiling info of doing it over the cyclotomic field:
</p>
<pre class="wiki"> 594202 1.806 0.000 3.620 0.000 number_field.py:9200(_element_constructor_)
594202 0.816 0.000 1.171 0.000 number_field.py:6628(_coerce_non_number_field_element_in)
4184691 0.569 0.000 0.569 0.000 {isinstance}
</pre><p>
This is nowhere to be found when doing it over the number field. (For very small matrices this isn't a problem per se, but it still is visible when profiling.)
</p>
<p>
So my conclusion is that we are doing something wrong with how we handle multiplication with cyclotomics in the matrix versus our generic dense cases.
</p>
TickettscrimWed, 16 Sep 2015 23:18:48 GMT
https://trac.sagemath.org/ticket/16116#comment:15
https://trac.sagemath.org/ticket/16116#comment:15
<p>
I should note that I get very different profiling when I reverse the orders of the tensor product, which from the naive implementation of the tensor product and thoughts about scalar multiplication surprises me:
</p>
<pre class="wiki">sage: R = CyclotomicField(2)
sage: M = matrix.random(R, 40,40)
sage: N = matrix.random(R, 3, 3)
sage: %time K = N.tensor_product(M)
CPU times: user 337 ms, sys: 20.6 ms, total: 358 ms
Wall time: 335 ms
sage: %time K = M.tensor_product(N)
CPU times: user 3.99 s, sys: 32.5 ms, total: 4.02 s
Wall time: 3.97 s
</pre><p>
There are quite a lot more function calls (~10x) to the <code>_element_constructor_</code> in one ordering:
</p>
<pre class="wiki">48023 for _element_constructor_
577312 function calls (577311 primitive calls) in 0.421 seconds
</pre><p>
versus
</p>
<pre class="wiki">594198 for _element_constructor_
7240514 function calls (7240513 primitive calls) in 4.992 seconds
</pre>
TickettscrimSun, 20 Sep 2015 16:46:05 GMTmilestone changed
https://trac.sagemath.org/ticket/16116#comment:16
https://trac.sagemath.org/ticket/16116#comment:16
<ul>
<li><strong>milestone</strong>
changed from <em>sage-6.4</em> to <em>sage-6.9</em>
</li>
</ul>
<p>
The part which handles speeding up the tensor product is now <a class="closed ticket" href="https://trac.sagemath.org/ticket/19258" title="defect: Speedup tensor products of cyclotomic matrices (closed: fixed)">#19258</a>.
</p>
TickettscrimSun, 03 Jan 2016 01:15:31 GMTmilestone changed
https://trac.sagemath.org/ticket/16116#comment:17
https://trac.sagemath.org/ticket/16116#comment:17
<ul>
<li><strong>milestone</strong>
changed from <em>sage-6.9</em> to <em>sage-7.0</em>
</li>
</ul>
<p>
It seems that matrix multiplication over the universal cyclotomic field is on the same order as the polynomial ring (probably because it uses the generic matrix class):
</p>
<pre class="wiki">sage: %timeit m * m
1000 loops, best of 3: 224 µs per loop
sage: %timeit elmt * elmt
1000 loops, best of 3: 207 µs per loop
</pre><p>
However for UCF matrices, I'm thinking we might benefit from either using (lib)GAP's matrix multiplication or internally storing the GAP element and only converting it to a Sage UCF element as necessary. See <a class="closed ticket" href="https://trac.sagemath.org/ticket/19821" title="enhancement: Increase speed for Coxeter groups, Weyl groups, and quantum Bruhat graph (closed: fixed)">#19821</a> for a use-case.
</p>
TicketvdelecroixSun, 03 Jan 2016 03:32:32 GMT
https://trac.sagemath.org/ticket/16116#comment:18
https://trac.sagemath.org/ticket/16116#comment:18
<p>
At least on sage-7.0.beta2, wrapping GAP matrices for the examples mentioned in the ticket description will not bring any magic
</p>
<pre class="wiki">sage: M = m._libgap_()
sage: %timeit A = M^2
1000 loops, best of 3: 181 µs per loop
sage: %timeit A = M^3
1000 loops, best of 3: 456 µs per loop
</pre><p>
versus
</p>
<pre class="wiki">sage: %timeit a = m^2
1000 loops, best of 3: 298 µs per loop
sage: %timeit a = m^3
1000 loops, best of 3: 690 µs per loop
</pre><p>
We are below x2 speedup. But in this example the matrix is small and coefficients relatively dense (~25 nonzero coefficients). Though, the gain is significant with 10x10 dense matrices with small coefficients
</p>
<pre class="wiki">sage: m1 = matrix(10, [E(randint(2,3)) for _ in range(100)])
sage: m2 = matrix(10, [E(randint(2,3)) for _ in range(100)])
sage: %timeit m1*m2
100 loops, best of 3: 4.51 ms per loop
sage: %timeit M1*M2
1000 loops, best of 3: 329 µs per loop
</pre><p>
We might update the ticket description accordingly. Two concrete propositions are:
</p>
<ul><li>below a certain threshold (to be determined) use generic matrices for cyclotomic fields
</li><li>wrap GAP matrices for UCF
</li></ul><p>
What do you think?
</p>
TickettscrimSun, 03 Jan 2016 03:57:51 GMT
https://trac.sagemath.org/ticket/16116#comment:19
https://trac.sagemath.org/ticket/16116#comment:19
<p>
A 30-40% speed reduction is nothing to scoff at either. So from that data, I think for dense matrices over the UCF we should always just wrap GAP.
</p>
TicketvdelecroixThu, 24 Aug 2017 21:09:46 GMTkeywords, milestone changed
https://trac.sagemath.org/ticket/16116#comment:20
https://trac.sagemath.org/ticket/16116#comment:20
<ul>
<li><strong>keywords</strong>
<em>days88</em> added
</li>
<li><strong>milestone</strong>
changed from <em>sage-7.0</em> to <em>sage-8.1</em>
</li>
</ul>
TicketvdelecroixThu, 24 Aug 2017 22:40:17 GMTcomponent, type, description changed
https://trac.sagemath.org/ticket/16116#comment:21
https://trac.sagemath.org/ticket/16116#comment:21
<ul>
<li><strong>component</strong>
changed from <em>number fields</em> to <em>linear algebra</em>
</li>
<li><strong>type</strong>
changed from <em>enhancement</em> to <em>task</em>
</li>
<li><strong>description</strong>
modified (<a href="/ticket/16116?action=diff&version=21">diff</a>)
</li>
</ul>
TicketvdelecroixFri, 25 Aug 2017 00:06:27 GMTdescription changed
https://trac.sagemath.org/ticket/16116#comment:22
https://trac.sagemath.org/ticket/16116#comment:22
<ul>
<li><strong>description</strong>
modified (<a href="/ticket/16116?action=diff&version=22">diff</a>)
</li>
</ul>
Ticket