Opened 11 years ago

Closed 7 years ago

#10255 closed defect (fixed)

Correct karatsuba multiplication of univariate polynomials for different degree polynomials

Reported by: lftabera Owned by: AlexGhitza
Priority: major Milestone: sage-6.1
Component: basic arithmetic Keywords: karatsuba, multiplication, polynomial
Cc: Merged in:
Authors: Luis Felipe Tabera Alonso, Marc Mezzarobba Reviewers: Marc Mezzarobba, Luis Felipe Tabera Alonso
Report Upstream: N/A Work issues:
Branch: u/mmezzarobba/10255-karatsuba Commit: 597747007214bdb661d112296f54a77989aa2a01
Dependencies: Stopgaps:

Status badges

Description (last modified by lftabera)

In the generic implementation of univariate polynomials over exact rings, plain karatsuba is used.

However, for different degree polynomials, the karatsuba code makes more products and additions than the classical multiplication code. Moreover, for equally sized polynomials, the degree in which karatsuba starts to be better than plain multiplication looks too high for me.

See attachments comparison_product_50_400.png and comparison_addition_50_400.png for the number of operations multiplying degree 50 and 400 polynomials, in yellow, the number of operations using _mul_generic, in red using current _mul_karatsuba as of 4.6.1 and in blue with the patch proposed.

sage: K=QQ[I][x]   
sage: f=K.random_element(1500)
sage: g=K.random_element(1500)
sage: %time _ = f._mul_generic(g)
CPU times: user 6.03 s, sys: 0.08 s, total: 6.11 s
Wall time: 6.43 s
sage: %time _ = f._mul_karatsuba(g)
CPU times: user 6.95 s, sys: 0.06 s, total: 7.01 s
Wall time: 7.76 s

See comment:13 for some benchmarks

Apply: trac_10255_karatsuba_improbement.patch

Attachments (6)

test_karatsuba.py (1.9 KB) - added by lftabera 10 years ago.
tests over some rings
comparison_product_50_400.png (20.0 KB) - added by lftabera 10 years ago.
comparison_addition_50_400.png (22.1 KB) - added by lftabera 10 years ago.
trac_karatsuba_improvements.patch (24.4 KB) - added by lftabera 10 years ago.
trac_karatsuba_improvements.2.patch (25.9 KB) - added by lftabera 10 years ago.
trac_10255_karatsuba_improbement.patch (30.4 KB) - added by lftabera 7 years ago.
rebase 5.13

Download all attachments as: .zip

Change History (36)

comment:1 Changed 11 years ago by lftabera

Same example with my patch. Pure karatsuba:

sage: K=QQ[I][x]   
sage: f=K.random_element(1500)
sage: g=K.random_element(1500)
sage: %time _ = f._mul_karatsuba(g)
CPU times: user 1.53 s, sys: 0.01 s, total: 1.54 s
Wall time: 1.73 s

TODO:

-Take a more careful look to the correctness of code. -Check that the changes do not introduce performance penalties on a wider set of rings, degrees and size of polynomials. -Introduce a user-definable threshold (with a sensible default for most common rings) for which karatsuba falls back to classic multiplication.

comment:2 Changed 10 years ago by lftabera

the code is too naive for different sized polynomials. Even with my improvements I get this:

Two degree 849 polynomials, no threshold:
274750 additions
87729 products

One degree 849 polynomial and a 449 polynomial:
472812 additions
218649 products

Experimental data

sage: f=ZZ[x].random_element(849)
sage: g=ZZ[x].random_element(849)
sage: %timeit f._mul_karatsuba(g)
5 loops, best of 3: 95.9 ms per loop
sage: f=ZZ[x].random_element(849)
sage: g=ZZ[x].random_element(449)
sage: %timeit f._mul_karatsuba(g)
5 loops, best of 3: 218 ms per loop

The shorter product is more expensive!

I have a patch right now to solve this, have to clean it up though.

comment:3 Changed 10 years ago by lftabera

Further improvements for polynomials of of different sizes:

if f is of degree m-1 and g is of degree n-1 with m<n, consider g as:

g0 + x^m*g1 + x^(2*m)g2 + ... + x^(q*m)*gq

Compute the products f*gi with karatsuba and reconstruct f*g

With this strategy, for polynomials of degree 849 and 449 I get:

214507 additions
39942 products

Far better from previous method. Anyway this is not optimal and it is hard to messure the recursion overhead. I have taken this strategy from old versions of gmp documentation.

Some timmings (Debian 64bits). All polyomials have been generated with random_element without further parameters. Pure karatsuba without threshold for better comparison with old code. With an appropriate threshold the timming whould slightly better.

Note that for rigns like QQ[x] and GF(2)[x] sage uses faster special libraries (FLINT, NTL and the like).

Degree f:1000,g:1000

QQ[x] 
old code: 828 ms
new code: 498 ms
GF(2)[x] 
old code: 284 ms
new code: 99.1 ms
QQ[sqrt(2)][x]
old code: 2.61 s
new code: 480 ms # 5 times faster
ZZ[x]['y']['z']['t']
old code: 86.95 s
new code: 71.10 s

degree f:849, g:449

QQ[x] 
old code: 988 ms
new code: 343 ms
GF(2)[x] 
old code: 148 ms
new code: 67.6 ms
QQ[sqrt(2)][x]
old code: 1.93 s
new code: 313 ms #6 times faster
ZZ[x]['y']['z']['t']
old code: 342.16 s
new code: 48.11 s # 7 times faster

degree 3, 5

QQ[x] 
old code: 135 µs
new code: 83.6 µs
GF(2)[x] 
old code: 137 µs
new code: 100 µs
QQ[sqrt(2)][x]
old code: 323 µs
new code: 86.4 µs
ZZ[x]['y']['z']['t']
old code: 19.5 ms
new code: 13.2 ms 

TODO:

  • Sensible default threshold.
  • Documentation cleanup. (almost done)
  • More testing for correctness rather than speed. This is a very basic feature, so we cannot allow correctness bugs here.

comment:4 Changed 10 years ago by lftabera

All doctest passes and I am trying to make doctest independent of future changes of the default threshold. But I get the following difference that puzzles me.

sage -t -long -force_lib "devel/sage/sage/rings/qqbar.py"   
**********************************************************************
File "/home/rosamari/sage/devel/sage/sage/rings/qqbar.py", line 373:
    sage: sage_input(one, verify=True)
Expected:
    # Verified
    R.<x> = QQbar[]
    v1 = AA(2)
    v2 = QQbar(sqrt(v1))
    v3 = QQbar(3)
    v4 = sqrt(v3)
    v5 = v2*v4
    v6 = (1 - v2)*(1 - v4) - 1 - v5
    v7 = QQbar(sqrt(v1))
    v8 = sqrt(v3)
    si1 = v7*v8
    cp = AA.common_polynomial(x^2 + ((1 - v7)*(1 + v8) - 1 + si1)*x - si1)
    v9 = QQbar.polynomial_root(cp, RIF(-RR(1.7320508075688774), -RR(1.7320508075688772)))
    v10 = 1 - v9
    v11 = v6 + (v10 - 1)
    v12 = -1 - v4 - QQbar.polynomial_root(cp, RIF(-RR(1.7320508075688774), -RR(1.7320508075688772)))
    v13 = 1 + v12
    v14 = v10*(v6 + v5) - (v6 - v5*v9)
    si2 = v5*v9
    AA.polynomial_root(AA.common_polynomial(x^4 + (v11 + (v13 - 1))*x^3 + (v14 + (v13*v11 - v11))*x^2 + (v13*(v14 - si2) - (v14 - si2*v12))*x - si2*v12), RIF(RR(0.99999999999999989), RR(1.0000000000000002)))
Got:
    # Verified
    R.<x> = QQbar[]
    v1 = QQbar(3)
    v2 = sqrt(v1)
    v3 = AA(2)
    v4 = QQbar(sqrt(v3))
    v5 = sqrt(v1)
    si1 = v4*v5
    cp = AA.common_polynomial(x^2 + ((1 - v4)*(1 + v5) - 1 + si1)*x - si1)
    v6 = -1 - v2 - QQbar.polynomial_root(cp, RIF(-RR(1.7320508075688774), -RR(1.7320508075688772)))
    v7 = QQbar.polynomial_root(cp, RIF(-RR(1.7320508075688774), -RR(1.7320508075688772)))
    v8 = QQbar(sqrt(v3))
    v9 = v8*v2
    v10 = (1 - v8)*(1 - v2) - 1 - v9
    v11 = -v7 + v10
    v12 = v6*v11
    si2 = v7*v9
    v13 = (1 - v7)*(v10 + v9) - v10 + si2
    si3 = v6*si2
    AA.polynomial_root(AA.common_polynomial(x^4 + ((1 + v6)*(1 + v11) - 1 - v12)*x^3 + (v12 + v13)*x^2 + ((1 + v6)*(v13 - si2) - v13 + si3)*x - si3), RIF(RR(0.99999999999999989), RR(1.0000000000000002)))
**********************************************************************
1 items had failures:
   1 of 152 in __main__.example_0
***Test Failed*** 1 failures.

comment:5 Changed 10 years ago by lftabera

I have eliminated karatsuba_sumand karatsuba_diff so there are less function calls and less if branches in the coding (we know an can predict at each step the length of every list.

I have also eliminated most slicing of lists so less objects are creating. The recursion does not operates over slices but over the orginial list plus some pointers to the slice we are working on.

With this improvements the code is about 10% faster. I also think that improves readability.

For number fields of degree 2 and 3 the code is about three times slower than pari. For number fields of degree >=6 sage seems faster than pari.

For dense bivariate polynomials in QQ[x][y] the multiplication starts to be faster than in QQ[x,y] for degree 16, for dense polynomials of degree 100 QQ[x][y] is 45x faster than QQ[x,y]. Of course, for non-dense polynomials multivariate multiplication is much much faster.

Some doctests added to compare karatsuba with generic multiplication and doctest for the non-commutative ring case.

Still has the QQbar fail.

comment:6 follow-up: Changed 10 years ago by spancratz

Hello,

First of all, I want to say that it's great that you are looking at this part of the Sage library and are trying to improve it!

From reading your ticket progress, it seems that perhaps you might benefit from some help with this. Have you considered advertising this problem on the developers' list at sage-devel? Perhaps someone else is interested in this, too.

In any case, here are a couple of comments:

1) What is your main motivation for wanting to improve this code? If you want to achieve faster computations in polynomial rings over specific rings, QQ[i], you should most definitely either write some Cython code or even a separate C library. You should *not* be tinkering with the generic Python multiplication code. To keep with the example, if you are interested in QQ[i], then (just as a first idea) you should represent your polynomials as sums of two polynomials internally, one for the real part and one for the imaginary code. This approach also applies to other number fields of small degree.

2) Stating what you definitely already know: Karatsuba multiplication trades additions and multiplications in the base ring. Thus, the threshold highly depends on the relative speeds of addition and multiplication, and it will be very difficult for you to make a default choice that is useful in general. Even worse, what about base rings where multiplication is faster than addition?

3) I haven't looked at the details, but I believe you might simply not be able to apply this code to polynomial rings with inexact base rings.

All that said, while this patch looks very interesting, I think there are a lot of things that need to be sorted out. By the way, I am not quite sure whether any one else is currently reading this ticket, but at least during the next week I should have a lot of time to look at this.

Best wishes,

Sebastian

comment:7 in reply to: ↑ 6 Changed 10 years ago by lftabera

Sebastian,

Thanks for looking at this. I feel that the ticket is ready for review, but there is a doctest fail and I wanted to understand why. My impression is that the doctest should be changed, but I did not have enough time to dig it.

Replying to spancratz:

1) What is your main motivation for wanting to improve this code? If you want to achieve faster computations in polynomial rings over specific rings, QQ[i], you should most definitely either write some Cython code or even a separate C library. You should *not* be tinkering with the generic Python multiplication code. To keep with the example, if you are interested in QQ[i], then (just as a first idea) you should represent your polynomials as sums of two polynomials internally, one for the real part and one for the imaginary code. This approach also applies to other number fields of small degree.

My primary interest was univariate polynomial rings over number fields of degree around 30-40. I was considering to use the specific class I have added in #8558, because, for number field base rings, the slowness comes really from adding number field elements by python 0. But then I realised that the karatsuba multiplication is not well enough implemented if both polynomials have different degree, so I changed the cython code improving here and there.

2) Stating what you definitely already know: Karatsuba multiplication trades additions and multiplications in the base ring. Thus, the threshold highly depends on the relative speeds of addition and multiplication, and it will be very difficult for you to make a default choice that is useful in general. Even worse, what about base rings where multiplication is faster than addition?

That is why the default threshold is zero. No classical multiplication at all. But I have added a ring-dependent parameter to change this value. Univariate polynomial rings have now a method set_karatsuba_threshold to deal with this issue. Rings in which addition is faster than multiplication also benefits from karatsuba, since the number of additions is also subquadratic in the degree of the polynomials.

3) I haven't looked at the details, but I believe you might simply not be able to apply this code to polynomial rings with inexact base rings.

By default, unless the user calls explicitelly _mul_karatsuba, mutiplication for inexact rings is the classical multiplication algorithm. I have not changed this. Also,if a class (ZZ[x], GF(5)[x], etc.) has its own multiplication code, my patch does not change this.

All that said, while this patch looks very interesting, I think there are a lot of things that need to be sorted out. By the way, I am not quite sure whether any one else is currently reading this ticket, but at least during the next week I should have a lot of time to look at this.

Thanks!, if you have any question on why I changed this or that or have further suggestions/improvements, do not hesitate to ask.

Luis

comment:8 Changed 10 years ago by jdemeyer

One can actually a speedup of more than a factor 10 by using PARI:

The following function converts a polynomial over a number field to a PARI object:

def nfpol_to_pari(f):
    return pari([c._pari_('a') for c in f.list()]).Polrev()

Now let's try your example:

sage: K=QQ[I][x]   
sage: f=K.random_element(1500)
sage: g=K.random_element(1500)
sage: %time _ = f._mul_generic(g)
CPU times: user 3.57 s, sys: 0.00 s, total: 3.57 s
Wall time: 3.59 s
sage: %time _ = f._mul_karatsuba(g)
CPU times: user 5.06 s, sys: 0.05 s, total: 5.11 s
Wall time: 5.12 s
sage: fpari = nfpol_to_pari(f)
sage: gpari = nfpol_to_pari(g)
sage: %time _ = fpari*gpari
CPU times: user 0.26 s, sys: 0.00 s, total: 0.26 s
Wall time: 0.26 s

Since PARI is written completely in C, it's obvious why one could expect such a speedup. Obviously, writing specialized code to deal with number field polynomials would even be better.

comment:9 Changed 10 years ago by lftabera

Obviously you did not try the computations with my patch, it is still much slower than PARI for small number fields (as I acknowledge in the ticket, PARI is around three times faster for degree two extensions):

sage: K=QQ[I][x]
sage: f=K.random_element(1500)
sage: g=K.random_element(1500)
sage: %time _ = f._mul_generic(g)
CPU times: user 5.94 s, sys: 0.02 s, total: 5.96 s
Wall time: 6.01 s
sage: %time _ = f._mul_karatsuba(g)
CPU times: user 1.31 s, sys: 0.00 s, total: 1.32 s
Wall time: 1.33 s
sage: K.set_karatsuba_threshold(8)
sage: %time _ = f._mul_karatsuba(g)  
CPU times: user 0.97 s, sys: 0.00 s, total: 0.97 s
Wall time: 1.00 s
sage: %time fpari = nfpol_to_pari(f)
CPU times: user 0.24 s, sys: 0.00 s, total: 0.24 s
Wall time: 0.25 s
sage: %time gpari = nfpol_to_pari(g)
CPU times: user 0.24 s, sys: 0.00 s, total: 0.24 s
Wall time: 0.25 s
sage: %time hpari = fpari*gpari
CPU times: user 0.42 s, sys: 0.00 s, total: 0.42 s
Wall time: 0.44 s
sage: %time h = K(hpari)       
CPU times: user 0.70 s, sys: 0.02 s, total: 0.72 s
Wall time: 0.73 s

If you take into account a naive transformation from sage data to pari and back, the total time is worse than the mul_karatsuba time. So if a specialiczed class would use PARI, this transformation has to be taken into account.

This pari advantage disapears for bigger number fields

sage: K=CyclotomicField(7)[x]
sage: f=K.random_element(1500)
sage: g=K.random_element(1500)
sage: %time _ = f._mul_karatsuba(g) 
CPU times: user 15.33 s, sys: 0.02 s, total: 15.35 s
Wall time: 15.44 s
sage: %time _ = f._mul_karatsuba(g,8)
CPU times: user 12.96 s, sys: 0.08 s, total: 13.03 s
Wall time: 13.10 s
sage: %time fpari = nfpol_to_pari(f)
CPU times: user 0.30 s, sys: 0.00 s, total: 0.30 s
Wall time: 0.32 s
sage: %time gpari = nfpol_to_pari(g) 
CPU times: user 0.30 s, sys: 0.00 s, total: 0.31 s
Wall time: 0.32 s
sage: %time hpari = fpari*gpari      
CPU times: user 24.26 s, sys: 0.02 s, total: 24.28 s
Wall time: 25.23 s
sage: %time h = K(hpari)             
CPU times: user 1.48 s, sys: 0.03 s, total: 1.50 s
Wall time: 1.52 s

Note: for absolute number fields, I am used to set a karatsuba threshold in 8.

comment:10 follow-up: Changed 10 years ago by spancratz

Hi Luis,

I'm sorry for the delay in my reply. Also, thank you Jeroen for adding the explicit timings. Let me just go through a couple of things:

  • "I feel that the ticket is ready for review"

I don't really agree. It is certainly in a rather useful state, but it requires more people looking at it, using it, and providing feedback.

  • For one, as you say, this "is a very basic feature, so we cannot allow correctness bugs here." There should be a *lot* more tests. Rather than having a single test case for illustrative purposes (as is the case in many places in the Sage library), there should be more bulk tests here. You could have that in the Sage framework e.g. by creating a list with f._mul_karatsuba(g) - f*g for many random pairs (f,g), then remove duplicates from the list, and check that the list has length one and only contains 0.
  • Your code is still noticeably slower than PARI for the base ring being a number field of small degree. I guess strongly that this will be a much more typical use case than large degree number fields and as such this is not acceptable.
  • Another one of your examples is ZZ[x]y?z?t?. I think this is a very bad example. I don't think one should ever expect a user to work with a multivariate polynomial ring in this way just to get some obscure speed-up. Of course, dense and sparse multivariate polynomial rings offer massively different performance characteristic. Perhaps there should be a dense option to MPolynomialRing, but that's not for this ticket. In any case, this example should not be relevant here. Finally, if you really care about dense multivariate polynomials over the integers, I am sure this can be done much, much faster using C or Cython and arrays of mpz_t's for the coefficients. The speed-up of this will be much, much larger than anything you get from nesting a bunch univariate polynomial rings in Python.
  • Finally, the statement

"If you take into account a naive transformation from sage data to pari and back, the total time is worse than the mul_karatsuba time. So if a specialiczed class would use PARI, this transformation has to be taken into account."

is not true. In short, the code for polynomials over number fields should then be changed so that the internal representation simply was the PARI object --- no converting back and forth in between the computations whatsoever!

I have looked at the currently only remaining failing test case in QQbar. Unfortunately, I am very puzzled by what could possibly cause this and don't have any help to offer right now.

By the way, I am sorry if this might sound a little harsh in places. Of course, as said, I am still very happy to keep looking at this etc.

Best wishes,

Sebastian

comment:11 in reply to: ↑ 10 Changed 10 years ago by lftabera

Hi Sebastian,

Replying to spancratz:

Hi Luis,

I'm sorry for the delay in my reply. Also, thank you Jeroen for adding the explicit timings. Let me just go through a couple of things:

  • "I feel that the ticket is ready for review"

I don't really agree. It is certainly in a rather useful state, but it requires more people looking at it, using it, and providing feedback.

  • For one, as you say, this "is a very basic feature, so we cannot allow correctness bugs here." There should be a *lot* more tests. Rather than having a single test case for illustrative purposes (as is the case in many places in the Sage library), there should be more bulk tests here. You could have that in the Sage framework e.g. by creating a list with f._mul_karatsuba(g) - f*g for many random pairs (f,g), then remove duplicates from the list, and check that the list has length one and only contains 0.

This is my fault. Of course I have made much more testing (random polynomials over different rings and ranged degrees), but they are not reflected in the ticket. I will upload the scripts I used for testing.

  • Your code is still noticeably slower than PARI for the base ring being a number field of small degree. I guess strongly that this will be a much more typical use case than large degree number fields and as such this is not acceptable.

Ok I agree, but this ticket is not about polynomials for number fields, but multiplication of Polynomial_generic_dense_field elements.

A new class for polynomials over number fields would be good, but that is a different ticket. Also, the claim that PARI is better for small degrees is not so clear to me, at least without some nontrivial work. K.random_element creates polynomials with very small coefficients and no denominator, these polynomials are not the most common user-base case in my opinion. For such polynomials, it seems to me that memory management is the problem, creating and destroying lots of number field elements. If you take not so big coefficients and, specially, if your coefficients do have denominators, then PARI is not better due to flint integers in sage.

Note: computations made on an old laptop.

sage: f=K.random_element(15,10**6) /  ZZ.random_element(10**6)  
sage: g=K.random_element(15,10**6) / ZZ.random_element(10**6)
sage: fpari = nfpol_to_pari(f)                              
sage: gpari = nfpol_to_pari(g)                                
sage: %timeit _=f._mul_karatsuba(g,8)
125 loops, best of 3: 4.56 ms per loop
sage: %timeit fpari*gpari            
25 loops, best of 3: 11 ms per loop
sage: f=K.random_element(1500,10**6)  /  ZZ.random_element(10**6)      
sage: g=K.random_element(1500,10**6)  /  ZZ.random_element(10**6)
sage: fpari = nfpol_to_pari(f)
sage: gpari = nfpol_to_pari(g)
sage: %time _=f._mul_karatsuba(g,8)  
CPU times: user 3.90 s, sys: 0.05 s, total: 3.95 s
Wall time: 9.30 s
sage: %time _=fpari*gpari            
CPU times: user 7.17 s, sys: 0.08 s, total: 7.25 s
Wall time: 18.11 s
  • Another one of your examples is ZZ[x]y?z?t?. I think this is a very bad example. I don't think one should ever expect a user to work with a multivariate polynomial ring in this way just to get some obscure speed-up. Of course, dense and sparse multivariate polynomial rings offer massively different performance characteristic. Perhaps there should be a dense option to MPolynomialRing, but that's not for this ticket. In any case, this example should not be relevant here. Finally, if you really care about dense multivariate polynomials over the integers, I am sure this can be done much, much faster using C or Cython and arrays of mpz_t's for the coefficients. The speed-up of this will be much, much larger than anything you get from nesting a bunch univariate polynomial rings in Python.

Ok, this is a bad example, I needed some rings that produced genuine Polynomial_generic_dense_field appart from number fields. I also wanted to compare the algorithm with singular multivariate multiplication which I think it is considered fairly good. Note that ZZ[x][y] are currently arrays of mpz_t elements.

  • Finally, the statement

"If you take into account a naive transformation from sage data to pari and back, the total time is worse than the mul_karatsuba time. So if a specialiczed class would use PARI, this transformation has to be taken into account."

is not true. In short, the code for polynomials over number fields should then be changed so that the internal representation simply was the PARI object --- no converting back and forth in between the computations whatsoever!

Answered above.

I have looked at the currently only remaining failing test case in QQbar. Unfortunately, I am very puzzled by what could possibly cause this and don't have any help to offer right now.

QQbar is an exact ring, so it will use karatsuba multiplication. But computations are done inexactly if they are safe. Since the karatsuba code for different degree polynomials have changed, then the inexact results for inexact rings have also changed. I think that this is the reason for the change of the doctest here.

By the way, I am sorry if this might sound a little harsh in places. Of course, as said, I am still very happy to keep looking at this etc.

Do not worry, certainly your comments and Jeroen's are really welcome.

Luis

comment:12 Changed 10 years ago by spancratz

Hi Luis,

Ok, this is a bad example, I needed some rings that produced genuine Polynomial_generic_dense_field appart from number fields.

Ok I agree, but this ticket is not about polynomials for number fields, but multiplication of Polynomial_generic_dense_field elements.

Well, which base ring is it that you are interested in?

I don't really think there is all that much point in squeezing a small improvement out of the generic case when often much more is available, even more so when it is not clear that your patch improves the run-time across all base rings.

By the way, perhaps other developers might disagree with me. Perhaps it would be a good idea for you to ask about this on sage-devel?

Of course, these are just words... Here's an example:

sage: R.<x> = QQ[I][]
sage: f = R.random_element(1500)
sage: g = R.random_element(1500)
sage: %time _ = f._mul_generic(g)
CPU times: user 4.39 s, sys: 0.01 s, total: 4.40 s
Wall time: 4.42 s
sage: %time _ = f._mul_karatsuba(g)
CPU times: user 5.63 s, sys: 0.04 s, total: 5.67 s
Wall time: 5.73 s
sage: S.<y> = QQ[]
sage: f0 = S([f[i][0] for i in range(f.degree() + 1)])
sage: f1 = S([f[i][1] for i in range(f.degree() + 1)])
sage: g0 = S([g[i][0] for i in range(g.degree() + 1)])
sage: g1 = S([g[i][1] for i in range(g.degree() + 1)])
sage: timeit('f0 * g0 - f1 * g1')
125 loops, best of 3: 2.76 ms per loop
sage: timeit('f1 * g0 + f0 * g1')
125 loops, best of 3: 2.99 ms per loop

Here, the current code takes about 5s, whereas one can easily complete the computation in 6ms. This improvement is much better than the factor of 7 that your patch provides.

By the way, I believe a similar trick works for all polynomials over number fields in the case when the degree of the polynomial is typically larger than the degree of the number field.

In the opposite, when the degree of the number field is typically larger than the degree of the polynomials, I guess one would want to take a different approach.

I also wanted to compare the algorithm with singular multivariate multiplication which I think it is considered fairly good.

Hopefully, my code snippet above convinces you that the current implementation of QQ[I]x? isn't all that good.

Note that ZZ[x][y] are currently arrays of mpz_t elements.

I do not think this is true. ZZx?y? is different from ZZ['x','y']. The latter is implemented as a sparse multivariate polynomial ring. But the former is *not* implemented as a single contiguous array of mpz_t's but as a list of coefficients, where each coefficient (i.e. polynomials in ZZx?) is again implemented (in FLINT, this time) as an array of mpz_t's. That is to say, here you have nested lists, which is not was I meant: I meant a single contiguous array of mpz_t's. I'm not saying one should do that in the generic case, but if one really cares about dense two-variate polynomials over ZZ, this approach should be favourable.

Anyway, all that said, let's be more positive again and look at your actual code:

  • In _mul_generic, I think you should move some of the isolated and non-edge cases from TESTS to EXAMPLES. TESTS should then contains more bulk tests as suggested earlier on this ticket.
  • In _mul_generic, your calls to list() will likely create new lists of objects. If they are guaranteed to be shallow copies of the elements, that is ok. This needs to be verified. If they are deep copies, however, this should be changed.
  • The code in do_karatsuba is not very readable. I think this can easily be improved by removing all the comments from the function body --- there are many more comment lines then actual code lines! --- into a brief descriptive documentation in the function header.

Best wishes,

Sebastian

Changed 10 years ago by lftabera

tests over some rings

comment:13 follow-up: Changed 10 years ago by lftabera

Replying to spancratz:

Hi Luis,

Ok, this is a bad example, I needed some rings that produced genuine Polynomial_generic_dense_field appart from number fields.

Ok I agree, but this ticket is not about polynomials for number fields, but multiplication of Polynomial_generic_dense_field elements.

Well, which base ring is it that you are interested in?

The fact that I am interested in polynomials over number fields does not invalidate the fact that the current generic karatsuba method must be changed because it is too inefficient. Do you think that mul_karatsuba is ok when it is slower for degree 1500 polynomials?

I don't really think there is all that much point in squeezing a small improvement out of the generic case when often much more is available, even more so when it is not clear that your patch improves the run-time across all base rings.

It does improve across all rings (or at least is not worse) because:

  • It has less function calls, less slicing of lists and less if branches (low impact).
  • It does not perform plenty of unnecessary additions of 0 + ModuleElement? (high impact over some base rings, like number fields or orders of number fields).
  • User has control over the karatsuba threshold, this can speed up things in some cases. By default the threshold is the same as the previous implementation. So this is not worse than the previous method (quite ring dependent).
  • The case of different sized polynomials is inefficient in the current implementation (high impact over every ring).

This last reason is very important. For instance, take a pair of polynomials over ZZ. One of degree twice the other. _mul_karatsuba is slower than _mul_generic independently of the degree. This is a bug in my opinion.

sage: K=ZZ[x]
sage: f=K.random_element(4000)
sage: g=K.random_element(2000)
sage: h=K.random_element(1500)
sage: %time _= f._mul_generic(g)
CPU times: user 0.90 s, sys: 0.00 s, total: 0.90 s
Wall time: 0.90 s
sage: %time _= f._mul_karatsuba(g)
CPU times: user 2.65 s, sys: 0.00 s, total: 2.66 s
Wall time: 2.66 s
sage: %time _= f._mul_karatsuba(h)
CPU times: user 3.41 s, sys: 0.00 s, total: 3.41 s
Wall time: 3.42 s

With patch:

sage: K=ZZ[x]
sage: f=K.random_element(4000)
sage: g=K.random_element(2000)
sage: h=K.random_element(1500)
sage: %time _=f._mul_generic(g)
CPU times: user 0.90 s, sys: 0.00 s, total: 0.90 s
Wall time: 0.91 s
sage: %time _=f._mul_karatsuba(g)
CPU times: user 0.24 s, sys: 0.00 s, total: 0.24 s
Wall time: 0.24 s
sage: %time _=f._mul_karatsuba(h)
CPU times: user 0.25 s, sys: 0.00 s, total: 0.25 s
Wall time: 0.25 s

I do not care if FLINT is lightning fast here. This example is just a symptom that _mul_karatsuba is not doing its job compared with _mul_generic. Moreover, multiplying a polynomial of degree 4000 and other of degree 1500 is much slower than multiplying a polynomial of degree 4000 and other of degree 2000. With the patch they cost around the same time, not something I am confortable with.

More examples for more rings, taking two polynomials. One of degree 27 and the other 31. The first line is the time taking with my patch, second line with SAGE 4.6.

ZZ[x]
625 loops, best of 3: 195 µs per loop
625 loops, best of 3: 435 µs per loop

ZZ[I][x]
625 loops, best of 3: 1.05 ms per loop
5 loops, best of 3: 219 ms per loop

ZZ[I,sqrt(2)][x]
25 loops, best of 3: 10.3 ms per loop
5 loops, best of 3: 317 ms per loop

QQ[x]
625 loops, best of 3: 1 ms per loop
125 loops, best of 3: 1.93 ms per loop

Integers(6)[x]
625 loops, best of 3: 263 µs per loop
625 loops, best of 3: 890 µs per loop

ZZ[x]['y']
625 loops, best of 3: 263 µs per loop
125 loops, best of 3: 5.06 ms per loop

QuaternionAlgebra(1,1)[x]
125 loops, best of 3: 2.36 ms per loop
125 loops, best of 3: 4.43 ms per loop

GF(2)[x].fraction_field()[y]
5 loops, best of 3: 44 ms per loop
5 loops, best of 3: 70.6 ms per loop

QQ[x].fraction_field()['y']
5 loops, best of 3: 229 ms per loop
5 loops, best of 3: 290 ms per loop

GF(5)[x].quotient_ring(x^5-x)[x]
5 loops, best of 3: 53.5 ms per loop
5 loops, best of 3: 80.6 ms per loop

SR[x]
625 loops, best of 3: 1.26 ms per loop
125 loops, best of 3: 3.95 ms per loop

QQ['x,y,z']['t']
25 loops, best of 3: 25.7 ms per loop
25 loops, best of 3: 34.1 ms per loop

GF(7^2,'a')[x] #Here the time is virtually the same due to the cost of creating the lists
5 loops, best of 3: 68.8 ms per loop
5 loops, best of 3: 71.3 ms per loop
  • x200 faster for orders of number fields
  • x30 for relative orders
  • x2 faster for quaternion algebras
  • x2 for the symbolic ring

I do not think that this is a small improvement.

By the way, perhaps other developers might disagree with me. Perhaps it would be a good idea for you to ask about this on sage-devel?

I'll do

Of course, these are just words... Here's an example:

sage: R.<x> = QQ[I][]
sage: f = R.random_element(1500)
sage: g = R.random_element(1500)
sage: %time _ = f._mul_generic(g)
CPU times: user 4.39 s, sys: 0.01 s, total: 4.40 s
Wall time: 4.42 s
sage: %time _ = f._mul_karatsuba(g)
CPU times: user 5.63 s, sys: 0.04 s, total: 5.67 s
Wall time: 5.73 s
sage: S.<y> = QQ[]
sage: f0 = S([f[i][0] for i in range(f.degree() + 1)])
sage: f1 = S([f[i][1] for i in range(f.degree() + 1)])
sage: g0 = S([g[i][0] for i in range(g.degree() + 1)])
sage: g1 = S([g[i][1] for i in range(g.degree() + 1)])
sage: timeit('f0 * g0 - f1 * g1')
125 loops, best of 3: 2.76 ms per loop
sage: timeit('f1 * g0 + f0 * g1')
125 loops, best of 3: 2.99 ms per loop

Here, the current code takes about 5s, whereas one can easily complete the computation in 6ms. This improvement is much better than the factor of 7 that your patch provides.

Ok, but you are disregarding all other cases.

Anyway, all that said, let's be more positive again and look at your actual code:

  • In _mul_generic, I think you should move some of the isolated and non-edge cases from TESTS to EXAMPLES. TESTS should then contains more bulk tests as suggested earlier on this ticket.
  • In _mul_generic, your calls to list() will likely create new lists of objects. If they are guaranteed to be shallow copies of the elements, that is ok. This needs to be verified. If they are deep copies, however, this should be changed.

I have not really changed _mul_generic, only moved the code to avoid duplication of code. Calling list is already in the original code. list is also used for addition by the way. But I realize something, I have to take care of inline operations in the code. If base ring elements are inmutable (as I think they are expected to) this will just call non-inlined operations. If they are mutable, multiplication of polynomials will change the input polynomials, which is wrong. I will change this.

  • The code in do_karatsuba is not very readable. I think this can easily be improved by removing all the comments from the function body --- there are many more comment lines then actual code lines! --- into a brief descriptive documentation in the function header.

I will take a look,

Best regards,

Luis

comment:14 Changed 10 years ago by lftabera

For the number field specific case, see #10591

comment:15 in reply to: ↑ 13 Changed 10 years ago by spancratz

Hi Luis,

I am terribly sorry for not writing on this thread again any earlier; after returning to the UK following the Sage bug days in Seattle, I was busy catching up on some other things. Anyway...

Ok I agree, but this ticket is not about polynomials for number fields, but multiplication of Polynomial_generic_dense_field elements.

Well, which base ring is it that you are interested in?

The fact that I am interested in polynomials over number fields does not invalidate the fact that the current generic karatsuba method must be changed because it is too inefficient. Do you think that mul_karatsuba is ok when it is slower for degree 1500 polynomials?

Of course, it is great that (via this patch!) there might soon be some code which speeds up the generic implementation. But to be honest, personally I don't think the generic implementation is all that important; after all, it is only a fall-back option used if no-one has brought up enough energy to implement something more specific.

I do not care if FLINT is lightning fast here.

I feel this is a bit of a shame. Like rjf mentioned on sage-devel for many base rings you will be able to come up with something much, much faster than this small (by comparison!) improvement.

This example is just a symptom that _mul_karatsuba is not doing its job compared with _mul_generic. Moreover, multiplying a polynomial of degree 4000 and other of degree 1500 is much slower than multiplying a polynomial of degree 4000 and other of degree 2000.

Of course, you are right, this is ridiculous. Implicit zero-padding (or even explicit, if you assume only references are stored) should be nearly for free, so a 4000 by 1500 product shouldn't cost any noticeably more than a 4000 by 2000 product. (Of course, ideally it should cost less!)

  • x200 faster for orders of number fields
  • x30 for relative orders
  • x2 faster for quaternion algebras
  • x2 for the symbolic ring

I do not think that this is a small improvement.

Obviously, we can agree to disagree here. In absolute terms, all of the above improvements are fantastic! In relative terms, looking at what would be possible just by writing base ring specific code, however, I am confident you could do better.

Of course, these are just words... Here's an example:

sage: R.<x> = QQ[I][]
sage: f = R.random_element(1500)
sage: g = R.random_element(1500)
sage: %time _ = f._mul_generic(g)
CPU times: user 4.39 s, sys: 0.01 s, total: 4.40 s
Wall time: 4.42 s
sage: %time _ = f._mul_karatsuba(g)
CPU times: user 5.63 s, sys: 0.04 s, total: 5.67 s
Wall time: 5.73 s
sage: S.<y> = QQ[]
sage: f0 = S([f[i][0] for i in range(f.degree() + 1)])
sage: f1 = S([f[i][1] for i in range(f.degree() + 1)])
sage: g0 = S([g[i][0] for i in range(g.degree() + 1)])
sage: g1 = S([g[i][1] for i in range(g.degree() + 1)])
sage: timeit('f0 * g0 - f1 * g1')
125 loops, best of 3: 2.76 ms per loop
sage: timeit('f1 * g0 + f0 * g1')
125 loops, best of 3: 2.99 ms per loop

Ok, but you are disregarding all other cases.

Obviously, and deliberately so. But please let me know which ring you actually, mathematically care about. I would not be surprised if we could come up with something noticeably faster!

By the way, I am not sure whether anyone else is looking at this bit of code. While I can't promise how much spare time I will have in the next couple of weeks, I am happy to keep looking at this, run tests, give comments etc.

Best wishes,

Sebastian

comment:16 Changed 10 years ago by lftabera

Hi Sebastian,

I think that all is said, I understand your point of view and appreciate the interest you are showing in this ticket. But I also think that this ticket is necessary and that current behavior for different sized polynomials is a bug.

I am not currently working a lot on Sage, since I am overloaded with pre-exams and administrative duties, but I will try to finish with the cleaning of the code in the next two weeks.

Bests

Changed 10 years ago by lftabera

Changed 10 years ago by lftabera

comment:17 Changed 10 years ago by lftabera

just for the record, I attach two images showing some numbers.

comparison_product_50_400.png Shows the number of products made to multiply a fixed polynomial of degree 49 and another polynomial of degree from 0 to 399.

comparison_addition_50_400.png shows the number of additions performed by the same polynomials.

  • In yellow using algorithm _mul_generic
  • In red using current _mul_karatsuba as of 4.6.1
  • in blue with this patch

We can appreciate that for a big difference in the degrees of the polynomials, current _mul_karatsuba needs more additions and more products than _mul_generic.

Changed 10 years ago by lftabera

comment:18 Changed 10 years ago by lftabera

  • Authors set to Luis Felipe Tabera Alonso
  • Status changed from new to needs_review

Apply: trac_karatsuba_improvements.2.patch

comment:19 Changed 10 years ago by lftabera

  • Description modified (diff)

rebase against 4.7.1

Changed 10 years ago by lftabera

comment:20 Changed 8 years ago by lftabera

  • Description modified (diff)
  • Summary changed from improve karatsuba multiplication of univariate polynomials to Correct karatsuba multiplication of univariate polynomials for different degree polynomials
  • Type changed from enhancement to defect

Reflect better the nature of the problem in the description.

comment:21 Changed 8 years ago by lftabera

  • Description modified (diff)

Apply: trac_10255_karatsuba_improbement.patch

comment:22 Changed 8 years ago by jdemeyer

  • Milestone changed from sage-5.11 to sage-5.12

Changed 7 years ago by lftabera

rebase 5.13

comment:23 Changed 7 years ago by lftabera

Apply: trac_10255_karatsuba_improbement.patch

comment:24 Changed 7 years ago by lftabera

  • Branch set to u/lftabera/ticket/10255

comment:25 Changed 7 years ago by git

  • Commit set to 4d08ddff868a930de116c31163af910ea33b3d7d

Branch pushed to git repo; I updated commit sha1. New commits:

4d08ddf#10255 Improve karatsuba multiplication.

comment:26 Changed 7 years ago by mmezzarobba

  • Branch changed from u/lftabera/ticket/10255 to u/mmezzarobba/10255-karatsuba
  • Commit changed from 4d08ddff868a930de116c31163af910ea33b3d7d to f6cd17edfaaa9ef285174579417855ebae41ae73
  • Reviewers set to Marc Mezzarobba

I for one believe that having a reasonable generic multiplication is important, no matter how much faster specialized implementations can be.

The implementation in this ticket is a significant improvement over the existing one. The code looks fine to me and works well on all examples I tried.

(There are still, e.g., a few list copies that would be nice to avoid, but this does not look easy to do without changes to the not-so-uniform interface of polynomials. In any case, this is not a reason to further delay this ticket IMO. Same goes for the other improvements I can think of.)

A few minor points though, all related to documentation and tests:

  • Shouldn't the test in the docstring of do_karatsuba_different_size read
     sage: K = ZZ[x]
     sage: f = K.random_element(21) # instead of 28
     sage: g = K.random_element(34)
     sage: f*g - f._mul_karatsuba(g,0)
     0
    
    in accordance with the comment about Fibonacci numbers?
  • The tests that call random_element() should probably use sage.misc.random_testing.
  • I pushed a commit with minor changes to some docstrings and comments (no code change). Could you please have a look and tell me if you agree with my changes?

New commits:

f6cd17eImproved Karatsuba: cosmetic changes to docstrings and comments

comment:27 Changed 7 years ago by lftabera

  • Branch changed from u/mmezzarobba/10255-karatsuba to u/lftabera/ticket/10255
  • Commit changed from f6cd17edfaaa9ef285174579417855ebae41ae73 to 15dabc332b766acd7adffad8cfc1371653037831

Marc,

Thanks a lot for taking a look into this and working on improving the documentation. I agree with your changes. I have also changed the bogus non-fibnacci degree.

Concerning the random tests, I have moved the comparison of the result with external libraries to rings/tests.py, but I have left the rest of the tests in place. Do you agree with this changes?

comment:28 Changed 7 years ago by mmezzarobba

  • Branch changed from u/lftabera/ticket/10255 to u/mmezzarobba/10255-karatsuba
  • Commit changed from 15dabc332b766acd7adffad8cfc1371653037831 to 597747007214bdb661d112296f54a77989aa2a01

Hi,

I mostly agree with your changes, but I think randomized tests over other rings than ZZ would not be a luxury. So I modified test_karatsuba_multiplication to take the base ring as input. Can you have a look at my patch?

Most importantly, your test used to compare g*f to f._mul_karatsuba(g, threshold). I changed g*f to (essentially) f*g in order to handle noncommutative base rings. But perhaps you wrote g*f deliberately? I also made the degree bound passed to R.random_element random. And I changed ZZ['x'] to PolynomialRing(base_ring, 'x') because some rings currently don't support the former syntax.

If you are ok with my changes, please go on and set the ticket to positive_review.


New commits:

5977470Improved Karatsuba: more thorough tests

comment:29 Changed 7 years ago by lftabera

  • Authors changed from Luis Felipe Tabera Alonso to Luis Felipe Tabera Alonso, Marc Mezzarobba
  • Reviewers changed from Marc Mezzarobba to Marc Mezzarobba, Luis Felipe Tabera Alonso
  • Status changed from needs_review to positive_review

The code looks good to me. I wrote f._mul_karatsuba thinking on testing commutativity, but now I realize that assuming that the other library I am testing against is correct, it does not make much sense.

comment:30 Changed 7 years ago by vbraun

  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.