# Ticket #10255(needs_review defect)

Opened 3 years ago

## Correct karatsuba multiplication of univariate polynomials for different degree polynomials

Reported by: Owned by: lftabera AlexGhitza major sage-5.10 basic arithmetic karatsuba, multiplication, polynomial N/A Luis Felipe Tabera Alonso

In the generic implementation of univariate polynomials over exact rings, plain karatsuba is used.

However, for different degree polynomials, the karatsuba code makes more products and additions than the classical multiplication code. Moreover, for equally sized polynomials, the degree in which karatsuba starts to be better than plain multiplication looks too high for me.

See attachments comparison_product_50_400.png and comparison_addition_50_400.png for the number of operations multiplying degree 50 and 400 polynomials, in yellow, the number of operations using _mul_generic, in red using current _mul_karatsuba as of 4.6.1 and in blue with the patch proposed.

sage: K=QQ[I][x]
sage: f=K.random_element(1500)
sage: g=K.random_element(1500)
sage: %time _ = f._mul_generic(g)
CPU times: user 6.03 s, sys: 0.08 s, total: 6.11 s
Wall time: 6.43 s
sage: %time _ = f._mul_karatsuba(g)
CPU times: user 6.95 s, sys: 0.06 s, total: 7.01 s
Wall time: 7.76 s

See comment:13 for some benchmarks

Apply: trac_10255_karatsuba_improbement.patch

## Change History

### comment:1 Changed 3 years ago by lftabera

Same example with my patch. Pure karatsuba:

sage: K=QQ[I][x]
sage: f=K.random_element(1500)
sage: g=K.random_element(1500)
sage: %time _ = f._mul_karatsuba(g)
CPU times: user 1.53 s, sys: 0.01 s, total: 1.54 s
Wall time: 1.73 s

TODO:

-Take a more careful look to the correctness of code. -Check that the changes do not introduce performance penalties on a wider set of rings, degrees and size of polynomials. -Introduce a user-definable threshold (with a sensible default for most common rings) for which karatsuba falls back to classic multiplication.

### comment:2 Changed 2 years ago by lftabera

the code is too naive for different sized polynomials. Even with my improvements I get this:

Two degree 849 polynomials, no threshold:
87729 products

One degree 849 polynomial and a 449 polynomial:
218649 products

Experimental data

sage: f=ZZ[x].random_element(849)
sage: g=ZZ[x].random_element(849)
sage: %timeit f._mul_karatsuba(g)
5 loops, best of 3: 95.9 ms per loop
sage: f=ZZ[x].random_element(849)
sage: g=ZZ[x].random_element(449)
sage: %timeit f._mul_karatsuba(g)
5 loops, best of 3: 218 ms per loop

The shorter product is more expensive!

I have a patch right now to solve this, have to clean it up though.

### comment:3 Changed 2 years ago by lftabera

Further improvements for polynomials of of different sizes:

if f is of degree m-1 and g is of degree n-1 with m<n, consider g as:

g0 + x^m*g1 + x^(2*m)g2 + ... + x^(q*m)*gq

Compute the products f*gi with karatsuba and reconstruct f*g

With this strategy, for polynomials of degree 849 and 449 I get:

39942 products

Far better from previous method. Anyway this is not optimal and it is hard to messure the recursion overhead. I have taken this strategy from old versions of gmp documentation.

Some timmings (Debian 64bits). All polyomials have been generated with random_element without further parameters. Pure karatsuba without threshold for better comparison with old code. With an appropriate threshold the timming whould slightly better.

Note that for rigns like QQ[x] and GF(2)[x] sage uses faster special libraries (FLINT, NTL and the like).

Degree f:1000,g:1000

QQ[x]
old code: 828 ms
new code: 498 ms
GF(2)[x]
old code: 284 ms
new code: 99.1 ms
QQ[sqrt(2)][x]
old code: 2.61 s
new code: 480 ms # 5 times faster
ZZ[x]['y']['z']['t']
old code: 86.95 s
new code: 71.10 s

degree f:849, g:449

QQ[x]
old code: 988 ms
new code: 343 ms
GF(2)[x]
old code: 148 ms
new code: 67.6 ms
QQ[sqrt(2)][x]
old code: 1.93 s
new code: 313 ms #6 times faster
ZZ[x]['y']['z']['t']
old code: 342.16 s
new code: 48.11 s # 7 times faster

degree 3, 5

QQ[x]
old code: 135 µs
new code: 83.6 µs
GF(2)[x]
old code: 137 µs
new code: 100 µs
QQ[sqrt(2)][x]
old code: 323 µs
new code: 86.4 µs
ZZ[x]['y']['z']['t']
old code: 19.5 ms
new code: 13.2 ms

TODO:

• Sensible default threshold.
• Documentation cleanup. (almost done)
• More testing for correctness rather than speed. This is a very basic feature, so we cannot allow correctness bugs here.

### comment:4 Changed 2 years ago by lftabera

All doctest passes and I am trying to make doctest independent of future changes of the default threshold. But I get the following difference that puzzles me.

sage -t -long -force_lib "devel/sage/sage/rings/qqbar.py"
**********************************************************************
File "/home/rosamari/sage/devel/sage/sage/rings/qqbar.py", line 373:
sage: sage_input(one, verify=True)
Expected:
# Verified
R.<x> = QQbar[]
v1 = AA(2)
v2 = QQbar(sqrt(v1))
v3 = QQbar(3)
v4 = sqrt(v3)
v5 = v2*v4
v6 = (1 - v2)*(1 - v4) - 1 - v5
v7 = QQbar(sqrt(v1))
v8 = sqrt(v3)
si1 = v7*v8
cp = AA.common_polynomial(x^2 + ((1 - v7)*(1 + v8) - 1 + si1)*x - si1)
v9 = QQbar.polynomial_root(cp, RIF(-RR(1.7320508075688774), -RR(1.7320508075688772)))
v10 = 1 - v9
v11 = v6 + (v10 - 1)
v12 = -1 - v4 - QQbar.polynomial_root(cp, RIF(-RR(1.7320508075688774), -RR(1.7320508075688772)))
v13 = 1 + v12
v14 = v10*(v6 + v5) - (v6 - v5*v9)
si2 = v5*v9
AA.polynomial_root(AA.common_polynomial(x^4 + (v11 + (v13 - 1))*x^3 + (v14 + (v13*v11 - v11))*x^2 + (v13*(v14 - si2) - (v14 - si2*v12))*x - si2*v12), RIF(RR(0.99999999999999989), RR(1.0000000000000002)))
Got:
# Verified
R.<x> = QQbar[]
v1 = QQbar(3)
v2 = sqrt(v1)
v3 = AA(2)
v4 = QQbar(sqrt(v3))
v5 = sqrt(v1)
si1 = v4*v5
cp = AA.common_polynomial(x^2 + ((1 - v4)*(1 + v5) - 1 + si1)*x - si1)
v6 = -1 - v2 - QQbar.polynomial_root(cp, RIF(-RR(1.7320508075688774), -RR(1.7320508075688772)))
v7 = QQbar.polynomial_root(cp, RIF(-RR(1.7320508075688774), -RR(1.7320508075688772)))
v8 = QQbar(sqrt(v3))
v9 = v8*v2
v10 = (1 - v8)*(1 - v2) - 1 - v9
v11 = -v7 + v10
v12 = v6*v11
si2 = v7*v9
v13 = (1 - v7)*(v10 + v9) - v10 + si2
si3 = v6*si2
AA.polynomial_root(AA.common_polynomial(x^4 + ((1 + v6)*(1 + v11) - 1 - v12)*x^3 + (v12 + v13)*x^2 + ((1 + v6)*(v13 - si2) - v13 + si3)*x - si3), RIF(RR(0.99999999999999989), RR(1.0000000000000002)))
**********************************************************************
1 of 152 in __main__.example_0
***Test Failed*** 1 failures.

### comment:5 Changed 2 years ago by lftabera

I have eliminated karatsuba_sumand karatsuba_diff so there are less function calls and less if branches in the coding (we know an can predict at each step the length of every list.

I have also eliminated most slicing of lists so less objects are creating. The recursion does not operates over slices but over the orginial list plus some pointers to the slice we are working on.

With this improvements the code is about 10% faster. I also think that improves readability.

For number fields of degree 2 and 3 the code is about three times slower than pari. For number fields of degree >=6 sage seems faster than pari.

For dense bivariate polynomials in QQ[x][y] the multiplication starts to be faster than in QQ[x,y] for degree 16, for dense polynomials of degree 100 QQ[x][y] is 45x faster than QQ[x,y]. Of course, for non-dense polynomials multivariate multiplication is much much faster.

Some doctests added to compare karatsuba with generic multiplication and doctest for the non-commutative ring case.

Still has the QQbar fail.

### comment:6 follow-up: ↓ 7 Changed 2 years ago by spancratz

Hello,

First of all, I want to say that it's great that you are looking at this part of the Sage library and are trying to improve it!

From reading your ticket progress, it seems that perhaps you might benefit from some help with this. Have you considered advertising this problem on the developers' list at sage-devel? Perhaps someone else is interested in this, too.

In any case, here are a couple of comments:

1) What is your main motivation for wanting to improve this code? If you want to achieve faster computations in polynomial rings over specific rings, QQ[i], you should most definitely either write some Cython code or even a separate C library. You should *not* be tinkering with the generic Python multiplication code. To keep with the example, if you are interested in QQ[i], then (just as a first idea) you should represent your polynomials as sums of two polynomials internally, one for the real part and one for the imaginary code. This approach also applies to other number fields of small degree.

2) Stating what you definitely already know: Karatsuba multiplication trades additions and multiplications in the base ring. Thus, the threshold highly depends on the relative speeds of addition and multiplication, and it will be very difficult for you to make a default choice that is useful in general. Even worse, what about base rings where multiplication is faster than addition?

3) I haven't looked at the details, but I believe you might simply not be able to apply this code to polynomial rings with inexact base rings.

All that said, while this patch looks very interesting, I think there are a lot of things that need to be sorted out. By the way, I am not quite sure whether any one else is currently reading this ticket, but at least during the next week I should have a lot of time to look at this.

Best wishes,

Sebastian

### comment:7 in reply to: ↑ 6 Changed 2 years ago by lftabera

Sebastian,

Thanks for looking at this. I feel that the ticket is ready for review, but there is a doctest fail and I wanted to understand why. My impression is that the doctest should be changed, but I did not have enough time to dig it.

1) What is your main motivation for wanting to improve this code? If you want to achieve faster computations in polynomial rings over specific rings, QQ[i], you should most definitely either write some Cython code or even a separate C library. You should *not* be tinkering with the generic Python multiplication code. To keep with the example, if you are interested in QQ[i], then (just as a first idea) you should represent your polynomials as sums of two polynomials internally, one for the real part and one for the imaginary code. This approach also applies to other number fields of small degree.

My primary interest was univariate polynomial rings over number fields of degree around 30-40. I was considering to use the specific class I have added in #8558, because, for number field base rings, the slowness comes really from adding number field elements by python 0. But then I realised that the karatsuba multiplication is not well enough implemented if both polynomials have different degree, so I changed the cython code improving here and there.

2) Stating what you definitely already know: Karatsuba multiplication trades additions and multiplications in the base ring. Thus, the threshold highly depends on the relative speeds of addition and multiplication, and it will be very difficult for you to make a default choice that is useful in general. Even worse, what about base rings where multiplication is faster than addition?

That is why the default threshold is zero. No classical multiplication at all. But I have added a ring-dependent parameter to change this value. Univariate polynomial rings have now a method set_karatsuba_threshold to deal with this issue. Rings in which addition is faster than multiplication also benefits from karatsuba, since the number of additions is also subquadratic in the degree of the polynomials.

3) I haven't looked at the details, but I believe you might simply not be able to apply this code to polynomial rings with inexact base rings.

By default, unless the user calls explicitelly _mul_karatsuba, mutiplication for inexact rings is the classical multiplication algorithm. I have not changed this. Also,if a class (ZZ[x], GF(5)[x], etc.) has its own multiplication code, my patch does not change this.

All that said, while this patch looks very interesting, I think there are a lot of things that need to be sorted out. By the way, I am not quite sure whether any one else is currently reading this ticket, but at least during the next week I should have a lot of time to look at this.

Thanks!, if you have any question on why I changed this or that or have further suggestions/improvements, do not hesitate to ask.

Luis

### comment:8 Changed 2 years ago by jdemeyer

One can actually a speedup of more than a factor 10 by using PARI:

The following function converts a polynomial over a number field to a PARI object:

def nfpol_to_pari(f):
return pari([c._pari_('a') for c in f.list()]).Polrev()

sage: K=QQ[I][x]
sage: f=K.random_element(1500)
sage: g=K.random_element(1500)
sage: %time _ = f._mul_generic(g)
CPU times: user 3.57 s, sys: 0.00 s, total: 3.57 s
Wall time: 3.59 s
sage: %time _ = f._mul_karatsuba(g)
CPU times: user 5.06 s, sys: 0.05 s, total: 5.11 s
Wall time: 5.12 s
sage: fpari = nfpol_to_pari(f)
sage: gpari = nfpol_to_pari(g)
sage: %time _ = fpari*gpari
CPU times: user 0.26 s, sys: 0.00 s, total: 0.26 s
Wall time: 0.26 s

Since PARI is written completely in C, it's obvious why one could expect such a speedup. Obviously, writing specialized code to deal with number field polynomials would even be better.

### comment:9 Changed 2 years ago by lftabera

Obviously you did not try the computations with my patch, it is still much slower than PARI for small number fields (as I acknowledge in the ticket, PARI is around three times faster for degree two extensions):

sage: K=QQ[I][x]
sage: f=K.random_element(1500)
sage: g=K.random_element(1500)
sage: %time _ = f._mul_generic(g)
CPU times: user 5.94 s, sys: 0.02 s, total: 5.96 s
Wall time: 6.01 s
sage: %time _ = f._mul_karatsuba(g)
CPU times: user 1.31 s, sys: 0.00 s, total: 1.32 s
Wall time: 1.33 s
sage: K.set_karatsuba_threshold(8)
sage: %time _ = f._mul_karatsuba(g)
CPU times: user 0.97 s, sys: 0.00 s, total: 0.97 s
Wall time: 1.00 s
sage: %time fpari = nfpol_to_pari(f)
CPU times: user 0.24 s, sys: 0.00 s, total: 0.24 s
Wall time: 0.25 s
sage: %time gpari = nfpol_to_pari(g)
CPU times: user 0.24 s, sys: 0.00 s, total: 0.24 s
Wall time: 0.25 s
sage: %time hpari = fpari*gpari
CPU times: user 0.42 s, sys: 0.00 s, total: 0.42 s
Wall time: 0.44 s
sage: %time h = K(hpari)
CPU times: user 0.70 s, sys: 0.02 s, total: 0.72 s
Wall time: 0.73 s

If you take into account a naive transformation from sage data to pari and back, the total time is worse than the mul_karatsuba time. So if a specialiczed class would use PARI, this transformation has to be taken into account.

This pari advantage disapears for bigger number fields

sage: K=CyclotomicField(7)[x]
sage: f=K.random_element(1500)
sage: g=K.random_element(1500)
sage: %time _ = f._mul_karatsuba(g)
CPU times: user 15.33 s, sys: 0.02 s, total: 15.35 s
Wall time: 15.44 s
sage: %time _ = f._mul_karatsuba(g,8)
CPU times: user 12.96 s, sys: 0.08 s, total: 13.03 s
Wall time: 13.10 s
sage: %time fpari = nfpol_to_pari(f)
CPU times: user 0.30 s, sys: 0.00 s, total: 0.30 s
Wall time: 0.32 s
sage: %time gpari = nfpol_to_pari(g)
CPU times: user 0.30 s, sys: 0.00 s, total: 0.31 s
Wall time: 0.32 s
sage: %time hpari = fpari*gpari
CPU times: user 24.26 s, sys: 0.02 s, total: 24.28 s
Wall time: 25.23 s
sage: %time h = K(hpari)
CPU times: user 1.48 s, sys: 0.03 s, total: 1.50 s
Wall time: 1.52 s

Note: for absolute number fields, I am used to set a karatsuba threshold in 8.

### comment:10 follow-up: ↓ 11 Changed 2 years ago by spancratz

Hi Luis,

I'm sorry for the delay in my reply. Also, thank you Jeroen for adding the explicit timings. Let me just go through a couple of things:

• "I feel that the ticket is ready for review"

I don't really agree. It is certainly in a rather useful state, but it requires more people looking at it, using it, and providing feedback.

• For one, as you say, this "is a very basic feature, so we cannot allow correctness bugs here." There should be a *lot* more tests. Rather than having a single test case for illustrative purposes (as is the case in many places in the Sage library), there should be more bulk tests here. You could have that in the Sage framework e.g. by creating a list with f._mul_karatsuba(g) - f*g for many random pairs (f,g), then remove duplicates from the list, and check that the list has length one and only contains 0.
• Your code is still noticeably slower than PARI for the base ring being a number field of small degree. I guess strongly that this will be a much more typical use case than large degree number fields and as such this is not acceptable.
• Another one of your examples is ZZ[x]y?z?t?. I think this is a very bad example. I don't think one should ever expect a user to work with a multivariate polynomial ring in this way just to get some obscure speed-up. Of course, dense and sparse multivariate polynomial rings offer massively different performance characteristic. Perhaps there should be a dense option to MPolynomialRing, but that's not for this ticket. In any case, this example should not be relevant here. Finally, if you really care about dense multivariate polynomials over the integers, I am sure this can be done much, much faster using C or Cython and arrays of mpz_t's for the coefficients. The speed-up of this will be much, much larger than anything you get from nesting a bunch univariate polynomial rings in Python.
• Finally, the statement

"If you take into account a naive transformation from sage data to pari and back, the total time is worse than the mul_karatsuba time. So if a specialiczed class would use PARI, this transformation has to be taken into account."

is not true. In short, the code for polynomials over number fields should then be changed so that the internal representation simply was the PARI object --- no converting back and forth in between the computations whatsoever!

I have looked at the currently only remaining failing test case in QQbar. Unfortunately, I am very puzzled by what could possibly cause this and don't have any help to offer right now.

By the way, I am sorry if this might sound a little harsh in places. Of course, as said, I am still very happy to keep looking at this etc.

Best wishes,

Sebastian

### comment:11 in reply to: ↑ 10 Changed 2 years ago by lftabera

Hi Sebastian,

Hi Luis,

I'm sorry for the delay in my reply. Also, thank you Jeroen for adding the explicit timings. Let me just go through a couple of things:

• "I feel that the ticket is ready for review"

I don't really agree. It is certainly in a rather useful state, but it requires more people looking at it, using it, and providing feedback.

• For one, as you say, this "is a very basic feature, so we cannot allow correctness bugs here." There should be a *lot* more tests. Rather than having a single test case for illustrative purposes (as is the case in many places in the Sage library), there should be more bulk tests here. You could have that in the Sage framework e.g. by creating a list with f._mul_karatsuba(g) - f*g for many random pairs (f,g), then remove duplicates from the list, and check that the list has length one and only contains 0.

This is my fault. Of course I have made much more testing (random polynomials over different rings and ranged degrees), but they are not reflected in the ticket. I will upload the scripts I used for testing.

• Your code is still noticeably slower than PARI for the base ring being a number field of small degree. I guess strongly that this will be a much more typical use case than large degree number fields and as such this is not acceptable.

Ok I agree, but this ticket is not about polynomials for number fields, but multiplication of Polynomial_generic_dense_field elements.

A new class for polynomials over number fields would be good, but that is a different ticket. Also, the claim that PARI is better for small degrees is not so clear to me, at least without some nontrivial work. K.random_element creates polynomials with very small coefficients and no denominator, these polynomials are not the most common user-base case in my opinion. For such polynomials, it seems to me that memory management is the problem, creating and destroying lots of number field elements. If you take not so big coefficients and, specially, if your coefficients do have denominators, then PARI is not better due to flint integers in sage.

Note: computations made on an old laptop.

sage: f=K.random_element(15,10**6) /  ZZ.random_element(10**6)
sage: g=K.random_element(15,10**6) / ZZ.random_element(10**6)
sage: fpari = nfpol_to_pari(f)
sage: gpari = nfpol_to_pari(g)
sage: %timeit _=f._mul_karatsuba(g,8)
125 loops, best of 3: 4.56 ms per loop
sage: %timeit fpari*gpari
25 loops, best of 3: 11 ms per loop
sage: f=K.random_element(1500,10**6)  /  ZZ.random_element(10**6)
sage: g=K.random_element(1500,10**6)  /  ZZ.random_element(10**6)
sage: fpari = nfpol_to_pari(f)
sage: gpari = nfpol_to_pari(g)
sage: %time _=f._mul_karatsuba(g,8)
CPU times: user 3.90 s, sys: 0.05 s, total: 3.95 s
Wall time: 9.30 s
sage: %time _=fpari*gpari
CPU times: user 7.17 s, sys: 0.08 s, total: 7.25 s
Wall time: 18.11 s
• Another one of your examples is ZZ[x]y?z?t?. I think this is a very bad example. I don't think one should ever expect a user to work with a multivariate polynomial ring in this way just to get some obscure speed-up. Of course, dense and sparse multivariate polynomial rings offer massively different performance characteristic. Perhaps there should be a dense option to MPolynomialRing, but that's not for this ticket. In any case, this example should not be relevant here. Finally, if you really care about dense multivariate polynomials over the integers, I am sure this can be done much, much faster using C or Cython and arrays of mpz_t's for the coefficients. The speed-up of this will be much, much larger than anything you get from nesting a bunch univariate polynomial rings in Python.

Ok, this is a bad example, I needed some rings that produced genuine Polynomial_generic_dense_field appart from number fields. I also wanted to compare the algorithm with singular multivariate multiplication which I think it is considered fairly good. Note that ZZ[x][y] are currently arrays of mpz_t elements.

• Finally, the statement

"If you take into account a naive transformation from sage data to pari and back, the total time is worse than the mul_karatsuba time. So if a specialiczed class would use PARI, this transformation has to be taken into account."

is not true. In short, the code for polynomials over number fields should then be changed so that the internal representation simply was the PARI object --- no converting back and forth in between the computations whatsoever!

I have looked at the currently only remaining failing test case in QQbar. Unfortunately, I am very puzzled by what could possibly cause this and don't have any help to offer right now.

QQbar is an exact ring, so it will use karatsuba multiplication. But computations are done inexactly if they are safe. Since the karatsuba code for different degree polynomials have changed, then the inexact results for inexact rings have also changed. I think that this is the reason for the change of the doctest here.

By the way, I am sorry if this might sound a little harsh in places. Of course, as said, I am still very happy to keep looking at this etc.

Luis

### comment:12 Changed 2 years ago by spancratz

Hi Luis,

Ok, this is a bad example, I needed some rings that produced genuine Polynomial_generic_dense_field appart from number fields.

Ok I agree, but this ticket is not about polynomials for number fields, but multiplication of Polynomial_generic_dense_field elements.

Well, which base ring is it that you are interested in?

I don't really think there is all that much point in squeezing a small improvement out of the generic case when often much more is available, even more so when it is not clear that your patch improves the run-time across all base rings.

By the way, perhaps other developers might disagree with me. Perhaps it would be a good idea for you to ask about this on sage-devel?

Of course, these are just words... Here's an example:

sage: R.<x> = QQ[I][]
sage: f = R.random_element(1500)
sage: g = R.random_element(1500)
sage: %time _ = f._mul_generic(g)
CPU times: user 4.39 s, sys: 0.01 s, total: 4.40 s
Wall time: 4.42 s
sage: %time _ = f._mul_karatsuba(g)
CPU times: user 5.63 s, sys: 0.04 s, total: 5.67 s
Wall time: 5.73 s
sage: S.<y> = QQ[]
sage: f0 = S([f[i][0] for i in range(f.degree() + 1)])
sage: f1 = S([f[i][1] for i in range(f.degree() + 1)])
sage: g0 = S([g[i][0] for i in range(g.degree() + 1)])
sage: g1 = S([g[i][1] for i in range(g.degree() + 1)])
sage: timeit('f0 * g0 - f1 * g1')
125 loops, best of 3: 2.76 ms per loop
sage: timeit('f1 * g0 + f0 * g1')
125 loops, best of 3: 2.99 ms per loop

Here, the current code takes about 5s, whereas one can easily complete the computation in 6ms. This improvement is much better than the factor of 7 that your patch provides.

By the way, I believe a similar trick works for all polynomials over number fields in the case when the degree of the polynomial is typically larger than the degree of the number field.

In the opposite, when the degree of the number field is typically larger than the degree of the polynomials, I guess one would want to take a different approach.

I also wanted to compare the algorithm with singular multivariate multiplication which I think it is considered fairly good.

Hopefully, my code snippet above convinces you that the current implementation of QQ[I]x? isn't all that good.

Note that ZZ[x][y] are currently arrays of mpz_t elements.

I do not think this is true. ZZx?y? is different from ZZ['x','y']. The latter is implemented as a sparse multivariate polynomial ring. But the former is *not* implemented as a single contiguous array of mpz_t's but as a list of coefficients, where each coefficient (i.e. polynomials in ZZx?) is again implemented (in FLINT, this time) as an array of mpz_t's. That is to say, here you have nested lists, which is not was I meant: I meant a single contiguous array of mpz_t's. I'm not saying one should do that in the generic case, but if one really cares about dense two-variate polynomials over ZZ, this approach should be favourable.

Anyway, all that said, let's be more positive again and look at your actual code:

• In _mul_generic, I think you should move some of the isolated and non-edge cases from TESTS to EXAMPLES. TESTS should then contains more bulk tests as suggested earlier on this ticket.
• In _mul_generic, your calls to list() will likely create new lists of objects. If they are guaranteed to be shallow copies of the elements, that is ok. This needs to be verified. If they are deep copies, however, this should be changed.
• The code in do_karatsuba is not very readable. I think this can easily be improved by removing all the comments from the function body --- there are many more comment lines then actual code lines! --- into a brief descriptive documentation in the function header.

Best wishes,

Sebastian

### Changed 2 years ago by lftabera

tests over some rings

### comment:13 follow-up: ↓ 15 Changed 2 years ago by lftabera

Hi Luis,

Ok, this is a bad example, I needed some rings that produced genuine Polynomial_generic_dense_field appart from number fields.

Ok I agree, but this ticket is not about polynomials for number fields, but multiplication of Polynomial_generic_dense_field elements.

Well, which base ring is it that you are interested in?

The fact that I am interested in polynomials over number fields does not invalidate the fact that the current generic karatsuba method must be changed because it is too inefficient. Do you think that mul_karatsuba is ok when it is slower for degree 1500 polynomials?

I don't really think there is all that much point in squeezing a small improvement out of the generic case when often much more is available, even more so when it is not clear that your patch improves the run-time across all base rings.

It does improve across all rings (or at least is not worse) because:

• It has less function calls, less slicing of lists and less if branches (low impact).
• It does not perform plenty of unnecessary additions of 0 + ModuleElement? (high impact over some base rings, like number fields or orders of number fields).
• User has control over the karatsuba threshold, this can speed up things in some cases. By default the threshold is the same as the previous implementation. So this is not worse than the previous method (quite ring dependent).
• The case of different sized polynomials is inefficient in the current implementation (high impact over every ring).

This last reason is very important. For instance, take a pair of polynomials over ZZ. One of degree twice the other. _mul_karatsuba is slower than _mul_generic independently of the degree. This is a bug in my opinion.

sage: K=ZZ[x]
sage: f=K.random_element(4000)
sage: g=K.random_element(2000)
sage: h=K.random_element(1500)
sage: %time _= f._mul_generic(g)
CPU times: user 0.90 s, sys: 0.00 s, total: 0.90 s
Wall time: 0.90 s
sage: %time _= f._mul_karatsuba(g)
CPU times: user 2.65 s, sys: 0.00 s, total: 2.66 s
Wall time: 2.66 s
sage: %time _= f._mul_karatsuba(h)
CPU times: user 3.41 s, sys: 0.00 s, total: 3.41 s
Wall time: 3.42 s

With patch:

sage: K=ZZ[x]
sage: f=K.random_element(4000)
sage: g=K.random_element(2000)
sage: h=K.random_element(1500)
sage: %time _=f._mul_generic(g)
CPU times: user 0.90 s, sys: 0.00 s, total: 0.90 s
Wall time: 0.91 s
sage: %time _=f._mul_karatsuba(g)
CPU times: user 0.24 s, sys: 0.00 s, total: 0.24 s
Wall time: 0.24 s
sage: %time _=f._mul_karatsuba(h)
CPU times: user 0.25 s, sys: 0.00 s, total: 0.25 s
Wall time: 0.25 s

I do not care if FLINT is lightning fast here. This example is just a symptom that _mul_karatsuba is not doing its job compared with _mul_generic. Moreover, multiplying a polynomial of degree 4000 and other of degree 1500 is much slower than multiplying a polynomial of degree 4000 and other of degree 2000. With the patch they cost around the same time, not something I am confortable with.

More examples for more rings, taking two polynomials. One of degree 27 and the other 31. The first line is the time taking with my patch, second line with SAGE 4.6.

ZZ[x]
625 loops, best of 3: 195 µs per loop
625 loops, best of 3: 435 µs per loop

ZZ[I][x]
625 loops, best of 3: 1.05 ms per loop
5 loops, best of 3: 219 ms per loop

ZZ[I,sqrt(2)][x]
25 loops, best of 3: 10.3 ms per loop
5 loops, best of 3: 317 ms per loop

QQ[x]
625 loops, best of 3: 1 ms per loop
125 loops, best of 3: 1.93 ms per loop

Integers(6)[x]
625 loops, best of 3: 263 µs per loop
625 loops, best of 3: 890 µs per loop

ZZ[x]['y']
625 loops, best of 3: 263 µs per loop
125 loops, best of 3: 5.06 ms per loop

QuaternionAlgebra(1,1)[x]
125 loops, best of 3: 2.36 ms per loop
125 loops, best of 3: 4.43 ms per loop

GF(2)[x].fraction_field()[y]
5 loops, best of 3: 44 ms per loop
5 loops, best of 3: 70.6 ms per loop

QQ[x].fraction_field()['y']
5 loops, best of 3: 229 ms per loop
5 loops, best of 3: 290 ms per loop

GF(5)[x].quotient_ring(x^5-x)[x]
5 loops, best of 3: 53.5 ms per loop
5 loops, best of 3: 80.6 ms per loop

SR[x]
625 loops, best of 3: 1.26 ms per loop
125 loops, best of 3: 3.95 ms per loop

QQ['x,y,z']['t']
25 loops, best of 3: 25.7 ms per loop
25 loops, best of 3: 34.1 ms per loop

GF(7^2,'a')[x] #Here the time is virtually the same due to the cost of creating the lists
5 loops, best of 3: 68.8 ms per loop
5 loops, best of 3: 71.3 ms per loop
• x200 faster for orders of number fields
• x30 for relative orders
• x2 faster for quaternion algebras
• x2 for the symbolic ring

I do not think that this is a small improvement.

By the way, perhaps other developers might disagree with me. Perhaps it would be a good idea for you to ask about this on sage-devel?

I'll do

Of course, these are just words... Here's an example:

sage: R.<x> = QQ[I][]
sage: f = R.random_element(1500)
sage: g = R.random_element(1500)
sage: %time _ = f._mul_generic(g)
CPU times: user 4.39 s, sys: 0.01 s, total: 4.40 s
Wall time: 4.42 s
sage: %time _ = f._mul_karatsuba(g)
CPU times: user 5.63 s, sys: 0.04 s, total: 5.67 s
Wall time: 5.73 s
sage: S.<y> = QQ[]
sage: f0 = S([f[i][0] for i in range(f.degree() + 1)])
sage: f1 = S([f[i][1] for i in range(f.degree() + 1)])
sage: g0 = S([g[i][0] for i in range(g.degree() + 1)])
sage: g1 = S([g[i][1] for i in range(g.degree() + 1)])
sage: timeit('f0 * g0 - f1 * g1')
125 loops, best of 3: 2.76 ms per loop
sage: timeit('f1 * g0 + f0 * g1')
125 loops, best of 3: 2.99 ms per loop

Here, the current code takes about 5s, whereas one can easily complete the computation in 6ms. This improvement is much better than the factor of 7 that your patch provides.

Ok, but you are disregarding all other cases.

Anyway, all that said, let's be more positive again and look at your actual code:

• In _mul_generic, I think you should move some of the isolated and non-edge cases from TESTS to EXAMPLES. TESTS should then contains more bulk tests as suggested earlier on this ticket.
• In _mul_generic, your calls to list() will likely create new lists of objects. If they are guaranteed to be shallow copies of the elements, that is ok. This needs to be verified. If they are deep copies, however, this should be changed.

I have not really changed _mul_generic, only moved the code to avoid duplication of code. Calling list is already in the original code. list is also used for addition by the way. But I realize something, I have to take care of inline operations in the code. If base ring elements are inmutable (as I think they are expected to) this will just call non-inlined operations. If they are mutable, multiplication of polynomials will change the input polynomials, which is wrong. I will change this.

• The code in do_karatsuba is not very readable. I think this can easily be improved by removing all the comments from the function body --- there are many more comment lines then actual code lines! --- into a brief descriptive documentation in the function header.

I will take a look,

Best regards,

Luis

### comment:14 Changed 2 years ago by lftabera

For the number field specific case, see #10591

### comment:15 in reply to: ↑ 13 Changed 2 years ago by spancratz

Hi Luis,

I am terribly sorry for not writing on this thread again any earlier; after returning to the UK following the Sage bug days in Seattle, I was busy catching up on some other things. Anyway...

Ok I agree, but this ticket is not about polynomials for number fields, but multiplication of Polynomial_generic_dense_field elements.

Well, which base ring is it that you are interested in?

The fact that I am interested in polynomials over number fields does not invalidate the fact that the current generic karatsuba method must be changed because it is too inefficient. Do you think that mul_karatsuba is ok when it is slower for degree 1500 polynomials?

Of course, it is great that (via this patch!) there might soon be some code which speeds up the generic implementation. But to be honest, personally I don't think the generic implementation is all that important; after all, it is only a fall-back option used if no-one has brought up enough energy to implement something more specific.

I do not care if FLINT is lightning fast here.

I feel this is a bit of a shame. Like rjf mentioned on sage-devel for many base rings you will be able to come up with something much, much faster than this small (by comparison!) improvement.

This example is just a symptom that _mul_karatsuba is not doing its job compared with _mul_generic. Moreover, multiplying a polynomial of degree 4000 and other of degree 1500 is much slower than multiplying a polynomial of degree 4000 and other of degree 2000.

Of course, you are right, this is ridiculous. Implicit zero-padding (or even explicit, if you assume only references are stored) should be nearly for free, so a 4000 by 1500 product shouldn't cost any noticeably more than a 4000 by 2000 product. (Of course, ideally it should cost less!)

• x200 faster for orders of number fields
• x30 for relative orders
• x2 faster for quaternion algebras
• x2 for the symbolic ring

I do not think that this is a small improvement.

Obviously, we can agree to disagree here. In absolute terms, all of the above improvements are fantastic! In relative terms, looking at what would be possible just by writing base ring specific code, however, I am confident you could do better.

Of course, these are just words... Here's an example:

sage: R.<x> = QQ[I][]
sage: f = R.random_element(1500)
sage: g = R.random_element(1500)
sage: %time _ = f._mul_generic(g)
CPU times: user 4.39 s, sys: 0.01 s, total: 4.40 s
Wall time: 4.42 s
sage: %time _ = f._mul_karatsuba(g)
CPU times: user 5.63 s, sys: 0.04 s, total: 5.67 s
Wall time: 5.73 s
sage: S.<y> = QQ[]
sage: f0 = S([f[i][0] for i in range(f.degree() + 1)])
sage: f1 = S([f[i][1] for i in range(f.degree() + 1)])
sage: g0 = S([g[i][0] for i in range(g.degree() + 1)])
sage: g1 = S([g[i][1] for i in range(g.degree() + 1)])
sage: timeit('f0 * g0 - f1 * g1')
125 loops, best of 3: 2.76 ms per loop
sage: timeit('f1 * g0 + f0 * g1')
125 loops, best of 3: 2.99 ms per loop

Ok, but you are disregarding all other cases.

Obviously, and deliberately so. But please let me know which ring you actually, mathematically care about. I would not be surprised if we could come up with something noticeably faster!

By the way, I am not sure whether anyone else is looking at this bit of code. While I can't promise how much spare time I will have in the next couple of weeks, I am happy to keep looking at this, run tests, give comments etc.

Best wishes,

Sebastian

### comment:16 Changed 2 years ago by lftabera

Hi Sebastian,

I think that all is said, I understand your point of view and appreciate the interest you are showing in this ticket. But I also think that this ticket is necessary and that current behavior for different sized polynomials is a bug.

I am not currently working a lot on Sage, since I am overloaded with pre-exams and administrative duties, but I will try to finish with the cleaning of the code in the next two weeks.

Bests

### comment:17 Changed 2 years ago by lftabera

just for the record, I attach two images showing some numbers.

comparison_product_50_400.png Shows the number of products made to multiply a fixed polynomial of degree 49 and another polynomial of degree from 0 to 399.

• In yellow using algorithm _mul_generic
• In red using current _mul_karatsuba as of 4.6.1
• in blue with this patch

We can appreciate that for a big difference in the degrees of the polynomials, current _mul_karatsuba needs more additions and more products than _mul_generic.

### comment:18 Changed 2 years ago by lftabera

• Status changed from new to needs_review
• Authors set to Luis Felipe Tabera Alonso

Apply: trac_karatsuba_improvements.2.patch

### comment:19 Changed 21 months ago by lftabera

• Description modified (diff)

rebase against 4.7.1

### comment:20 Changed 3 months ago by lftabera

• Type changed from enhancement to defect
• Description modified (diff)
• Summary changed from improve karatsuba multiplication of univariate polynomials to Correct karatsuba multiplication of univariate polynomials for different degree polynomials

Reflect better the nature of the problem in the description.

rebase 5.8.beta2

### comment:21 Changed 3 months ago by lftabera

• Description modified (diff)

Apply: trac_10255_karatsuba_improbement.patch

Note: See TracTickets for help on using tickets.