Opened 9 years ago

Last modified 8 years ago

#10532 needs_info enhancement

Faster multiplication for multivariate power series

Reported by: niles Owned by: malb
Priority: major Milestone:
Component: commutative algebra Keywords: multivariate power series multiplication Karatsuba
Cc: mario, pernici, lftabera, ncohen Merged in:
Authors: pernici Reviewers: Niles Johnson
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Description

Multivariate power series are implemented in #1956. Ticket #10480 has a couple of different algorithms for improving PowerSeries_poly multiplication. MPowerSeries._mul_ should be modified to take advantage of the optimal algorithm.

Attachments (7)

trac_10532_faster_MPowerSeries_mul.patch (2.9 KB) - added by niles 9 years ago.
depends on #1956 and trac_10480_fast_PowerSeries_poly_multiplication2.patch
mu2.sage (1.6 KB) - added by pernici 9 years ago.
benchmark
bench1.sage (1.1 KB) - added by pernici 9 years ago.
benchmark
trac_1956_faster_MPowerSeries_mul2.patch (4.4 KB) - added by pernici 8 years ago.
trac_10532_invert.patch (2.4 KB) - added by pernici 8 years ago.
trac_10532_square_trunc.patch (2.2 KB) - added by pernici 8 years ago.
trac_10532_send_to_bg.patch (1.7 KB) - added by pernici 8 years ago.

Download all attachments as: .zip

Change History (21)

Changed 9 years ago by niles

depends on #1956 and trac_10480_fast_PowerSeries_poly_multiplication2.patch

comment:1 Changed 9 years ago by niles

  • Authors set to pernici
  • Reviewers set to Niles Johnson
  • Status changed from new to needs_info

A patch posted by pernici at #1956: First apply #1956, then trac_10480_fast_PowerSeries_poly_multiplication2.patch from #10480, then trac_10532_faster_MPowerSeries_mul.patch. Timings before and after the patch are listed in #1956, comment 59, showing much improvement with this patch.

Comparisons with the Karatsuba algorithm of #10480 still need to be done.

comment:2 Changed 9 years ago by ncohen

  • Cc ncohen added

(cc me)

Changed 9 years ago by pernici

benchmark

Changed 9 years ago by pernici

benchmark

comment:3 Changed 9 years ago by pernici

Attachments: bench1.sage, mu2.sage

Here are some benchmarks comparing:

(1) with do_mul_trunc_generic from trac_10480_fast_PowerSeries_poly_multiplication2.patch, trac_1956_faster_MPowerSeries_mul.patch

(2,K) with do_trunc_karatsuba, where K is K_threshold from trac_karatsuba_improvements.patch, trac_10480_fast_PowerSeries_poly_multiplication-alternative.patch

(1) or (2) are applied after applying trac_1956_multi_power_series_new_4.patch, trac_1956_uni_multi_ps_2.patch, trac_1956_multi_ps_cleanup.patch.

These tests are run with QQ and GF(7) as examples of fields implemented in libsingular, RR as field not implemented there.

Various values of K are tried; (1) corresponds to the K=infinity case for QQ and GF(7) (slightly slower because do_mul_trunc_generic has a slightly slower implementation than the corresponding do_trunc_classical by lftabera), but not for RR (and I guess for other fields not used in libsingular), where (1) is slower.

The conclusion is that I think that (2) with K=infinity (or a large integer to avoid using infinity) is the best choice.

bench1.sage are the benchmarks posted by Niles to compare with Magma, with precision multiplied by N.

times are in seconds; processor:4-core: Intel Core i7 CPU 860 @ 2.80GHz on x86_64 GNU/Linux

prec = prec1*N
bench1.sage with GF(7)
test no. 1     2     3     4     5     6     7     8
prec1    16    15    50    30    51    30    30    30
N=3
(1)      0.45  0.03                    0.49  0.49  0.45
(2,8)    2.16  0.06                    1.21  1.25  1.10
(2,64)   0.45  0.03                    0.55  0.56  0.54
(2,128)  0.43  0.03                    0.44  0.45  0.44
N=4
(1)      1.80  0.06                    1.44  1.34  1.22
(2,8)    3.95  0.12                    3.89  3.79  3.25
(2,64)   1.88  0.06                    1.69  1.60  1.50
(2,128)  1.78  0.06                    1.45  1.34  1.22
N=5
(1)      8.96  0.21                    5.01  4.65  4.27
(2,8)    23.2  0.52                    14.8  14.3  12.3
(2,64)   11.2  0.23                    6.55  6.01  5.53
(2,128)  8.74  0.19                    5.50  5.11  4.69
N=6
(1)      27.0  0.67                    9.99  9.35  9.17
(2,8)    87.2  1.56                    30.4  31.2  26.4
(2,64)   34.5  0.74                    13.6  12.8  12.4
(2,128)  26.1  0.67                    10.9  10.1  9.79
N=7
(1)      40.8  1.46                    18.6  17.3  16.1
(2,8)    132   3.67                    57.6  60.4  51.4
(2,64)   51.1  1.55                    25.0  23.3  21.1
(2,128)  39.3  1.51                    20.4  19.2  17.6
(2,256)  38.3  1.32                    18.3  16.9  15.9


bench1.sage with RR
test no. 1     2     3     4     5     6     7     8
prec1    16    15    50    30    51    30    30    30
N=1
(1)      1.18  0.06  0.88  0.70  0.23  0.67  0.69  0.70
(2,8)    0.34  0.04  0.61  4.92  0.18  0.47  0.57  0.89
(2,64)   0.18  0.04  0.74  0.27  0.20  0.24  0.26  0.26
N=2
(1)      57    0.96  1.6   8.4   0.67  8.2   8.5   8.4
(2,8)    19.5  0.29  1.4   >540  0.44  9.9   32    183
(2,64)   6.06  0.16  1.2   2.65  0.43  2.55  2.63  2.68
(2,128)  6.07  0.16  1.3   2.64  0.48  2.57  2.63  2.63
N=3
(2,64)   58.1  0.93  1.39  79.4  0.72  18.5  28.8  48.8
(2,128)  58.9  0.94  1.59  11.8  0.78  11.6  12.0  11.7

mu2.sage is a modification of the benchmark mu.sage posted in ticket #1956 in which the random polynomials are evaluated in t_i + 1; the various series are stored in dat/ to compare times with the same series and different algorithms; this is a workabout for the lack of something like random.seed for random_element. Timings are only indicative.

mu2.sage  with QQ
        N=1   2     3     4     5     6     7     8
n=2
prec=20 bound=20
(1)     0.007 0.01  0.02  0.04  0.09  0.21  0.58  1.79
(2,8)   0.015 0.03  0.06  0.10  0.22  0.58  1.88  5.92
(2,32)  0.007 0.01  0.02  0.04  0.09  0.20  0.57  1.74
prec=20 bound=100
(1)     0.007 0.01  0.03  0.05  0.11  0.28  0.79  2.4
(2,8)   0.019 0.04  0.07  0.13  0.30  0.85  2.64  8.2
(2,32)  0.007 0.01  0.03  0.05  0.11  0.27  0.78  2.4
prec=100 bound=100
(1)     2.96  4.46  7.14  13.4  31.8
(2,8)   18.0  32.6  67.1  162   494
(2,32)  6.82  10.7  18.8  38.7  104
(2,128) 2.91  4.40  7.23  13.5  32.1

mu2.sage  with GF(7)
n=2
prec=100 bound=100
(1)     0.08  0.08  0.08  0.08  0.09  0.09  0.09  0.09
(2,8)   0.41  0.45  0.45  0.44  0.45  0.45  0.46  0.44
(2,128) 0.07  0.08  0.08  0.08  0.08  0.08  0.08  0.08

prec=200 bound=200
(1)     1.05  1.21  1.22  1.23  1.22
(2,8)   11.0  12.5  12.4  12.4  12.5
(2,128) 1.41  1.59  1.58  1.59  1.58
(2,256) 1.00  1.14  1.12  1.14  1.15

mu2.sage with RR
n=2
prec=20 bound=20
(1)     0.03  0.06  0.06  0.06  0.06  0.06  0.06  0.06
(2,8)   0.03  0.04  0.15  0.41  1.15  3.66  12.8  45.7
(2,128) 0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01

prec=20 bound=100
(1)     0.05  0.05  0.06  0.06  0.06  0.06  0.06  0.06
(2,8)   0.03  0.07  0.23  0.42  1.08  2.71  5.94  10.2
(2,128) 0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01

prec=100 bound=100
(1)     1.87  1.87  1.91  1.87  1.88
(2,8)   1.06  8.95  69.0  332   2050
(2,128) 0.31  0.31  0.31  0.31  0.31

n=4
prec=20 bound=20
(1)     4.05  68.8  69.5  71.1  76.1
(2,8)   3.91  66.0  615   >2000
(2,128) 1.16  2.00  2.02  2.05  2.03

comment:4 Changed 9 years ago by lftabera

If K=infinity is needed to obtain good performance then either we are doing something wrong or truncated karatsuba should be eliminated in favor of the classical truncated algorithm. I will try to look at this, but probably not this week.

comment:5 Changed 9 years ago by pernici

In the message above I claimed that patch (1) uses do_mul_trunc_generic; this is not true for RR, in which case it falls back to _mul_generic, the classical multiplication; this is the reason for which (1) is much slower than the K=infinity case for RR, calling do_trunc_classical. Since the truncated classical multiplication do_trunc_classical computes the same terms as the classical multiplication, apart from those which are then truncated away, it is fine for non-exact fields.

The conclusion of the previous message stands: I think that (2) with K=infinity (or a large integer to avoid using infinity) is the best choice.

lftabera wrote:

If K=infinity is needed to obtain good performance then either we are doing something wrong or truncated karatsuba should be eliminated in favor of the classical truncated algorithm.

Right, it can call directly the classical truncated algorithm, but the K=infinity case calls the classical truncated algorithm almost immediately, so I think that the overhead of calling it is negligible anyway.

Changed 8 years ago by pernici

comment:6 Changed 8 years ago by pernici

The attached patch trac_1956_faster_MPowerSeries_mul2.patch is applied after trac_1956_multi_power_series_new_4.patch, trac_1956_uni_multi_ps_2.patch, trac_1956_multi_ps_cleanup.patch

For multivariate series Karatsuba multiplication is slower than generic multiplication also in the case of prec=infinity

sage: R.<x,y,z,w> = QQ[[]]
sage: p = 1 + x/2 + y/3 + z/4
sage: %timeit p^20

times in ms; K=K_threshold
                  p^5     p^10    p^20       p^40
generic           0.31    1.62    28.5       1070
Karatsuba K=8     0.56    5.7     172        1900
Karatsuba K=64    0.56    5.7     167        1080

The attached patch uses generic multiplication for any precision; it has been made independent from ticket #10480, which deals mainly with the truncated Karatsuba multiplication, which is fast only for univariate series.

_mul_trunc_generic, coming from do_trunc_classical by lftabera, has a lot of code in common with _mul_generic and maybe it should be merged with it; also, it would be nice to do the _square_generic optimization for finite prec.

Changed 8 years ago by pernici

comment:7 Changed 8 years ago by pernici

The attached trac_10532_invert.patch is applied after #1956 patches and trac_1956_faster_MPowerSeries_mul2.patch

In this patch __invert__ is implemented using _mul_trunc_generic; inversion of a generic multivariate series becomes much faster.

In the benchmarks (0) is without this patch, (1) with it

sage: N = 4
sage: R = PowerSeriesRing(QQ,N,'x')
sage: x = R.gens()
sage: sx = sum(x)
sage: prec = 15
sage: p = (1 + sx + R.O(prec))^-1 + (1 + 2*sx + R.O(prec))^-1
sage: %time p1 = p^-1
Wall time: 0.40 s

N  prec     (0)    (1)
2  30       1.18   0.08
2  50       30.3   0.87
4  15       42.0   0.40
4  20       662    3.2
sage: N = 4
sage: R = PowerSeriesRing(QQ,N,'x')
sage: x = R.gens()
sage: prec = 30
sage: p = sum([(i+1)*x[i]^(i+1) for i in range(N)]) + R.O(prec)
sage: p = 1 + sum([p^i/i for i in range(1,prec)])
sage: %time p1 = p^-1
Wall time: 0.28 s

N  prec     (0)    (1)
2  30       0.37   0.03
2  50       5.7    0.28
4  30       32.6   0.28
8  30       1566   1.0

A special kind of series inversion are series of degree much lower than prec, like

p = 1 + 2*x[0] + 3*x[1] + 4*x[2] + 5*x[3]  + R.O(30)

inversion is much faster because in the iteration in the Newton method the expensive multiplication at order next_prec becomes trivial. It remains a square of a polynomial of degree roughly next_prec/2 . In this case this patch does not change speed; in this ticket there is a lot of benchmarks for this case, see bench1.sage

Finally I point out that in ticket #10720 I wrote a similar but slower version of __invert__ for series on rings of polynomials; I will improve that version, but notice that the speedup of __invert__ in a generic series on rings of polynomials is much smaller than here.

Changed 8 years ago by pernici

comment:8 follow-up: Changed 8 years ago by pernici

The attached trac_10532_square_trunc.patch is applied after #1956 patches, trac_1956_faster_MPowerSeries_mul2.patch and trac_10532_invert.patch

With this patch truncated multiplication is optimized for squares.

Benchmark comparison with PARI/GP:

sage: N = 4
sage: R = PowerSeriesRing(QQ,N,'x')
sage: x = R.gens()
sage: sx = sum(x)
sage: prec = 20
sage: n = 1
sage: %time p = ((1 + sx + R.O(prec))^-1 + (1 + 2*sx + R.O(prec))^-1)^-n
Wall time: 4.06 s

to do the computation with gp allocate more memory than the one assigned by default; here the computation is done directly with the background field

sage: try:
....:     gp.allocatemem()
....: except:
....:     pass
....:
sage: N = 4
sage: prec = 20
sage: n = 1
sage: sx = '+'.join(['x%d' % i for i in range(N)])
sage: fmt = '((1 + t*(%s) + O(t^%s))^-1 + (1 + 2*t*(%s) + O(t^%s))^-1)^-%s'
sage: %time p1 = gp(fmt %(sx,prec,sx,prec,n))
Wall time: 1.56 s

(0) version of ticket #1956
(1) with trac_10532_invert.patch
(2) with trac_10532_square_trunc.patch
(3) with gp (PARI/GP)
N  prec  n   (0)    (1)    (2)    (3)
2  30    1   1.2    0.10   0.10   0.06
2  30    20  8.6    0.22   0.18   0.12
2  50    1   21     1.2    1.2    0.55
2  50    20  213    2.1    1.8    1.2
4  15    1   42     0.46   0.46   0.22
4  15    20  323    1.2    0.96   0.43
4  20    1   673    4.1    4.1    1.6
4  20    20  >7000  10     8.1    3.2

comment:9 in reply to: ↑ 8 Changed 8 years ago by niles

Replying to pernici:

(0) version of ticket #1956
(1) with trac_10532_invert.patch
(2) with trac_10532_square_trunc.patch
(3) with gp (PARI/GP)

This looks *really good* -- but I'm a little confused by the comparison with gp . . . If that's always faster, should we just always do the multiplication there? There must be more subtleties involved, but I don't know *anything* about them :(

comment:10 follow-up: Changed 8 years ago by pernici

niles wrote:

This looks *really good* -- but I'm a little confused by the comparison

with gp . . . If that's always faster, should we just always do the multiplication there? There must be more subtleties involved, but I don't know *anything* about them :(

Well, I don't know either, I just looked at it today. I think that if one would like to use PARI, one should avoid gp as much as possible and call directly the PARI C library functions; in fact gp has a big overhead

sage: g = gp('1 + (x+y)*t + O(t^6)')
sage: %timeit g^2
25 loops, best of 3: 16 ms per loop
sage: R.<x,y> = QQ[[]]
sage: p = 1 + x + y + R.O(6)
sage: %timeit p^2
625 loops, best of 3: 56.5 µs per loop

so that it starts to be convenient to use it when the computation is long enough.

For this reason in the benchmark in the previous mail gp is slower for small precision

N  prec  n   (0)    (2)    (3)
2  10    1   0.02   0.01   0.02
2  20    1   0.23   0.03   0.03

In the above benchmarks gp is at most 2.5x faster; in those cases the overhead is negligible, so I guess that using directly the PARI library the speedup would be the same.

comment:11 in reply to: ↑ 10 ; follow-up: Changed 8 years ago by niles

Replying to pernici:

In the above benchmarks gp is at most 2.5x faster; in those cases the overhead is negligible, so I guess that using directly the PARI library the speedup would be the same.

I see -- maybe -- can this patch be written to use PARI directly by default then? In the cases where it isn't as fast, the computation time should be small anyway, right?

comment:12 in reply to: ↑ 11 ; follow-up: Changed 8 years ago by pernici

Replying to niles:

Replying to pernici:

In the above benchmarks gp is at most 2.5x faster; in those cases the overhead is negligible, so I guess that using directly the PARI library the speedup would be the same.

I see -- maybe -- can this patch be written to use PARI directly by default then? In the cases where it isn't as fast, the computation time should be small anyway, right?

It depends on the use cases; a user may need to do usually a lot of computations with low precision series; then it is important that they are done fast; in the rare case in which he needs a high precision computation, maybe he does not care that it takes 3x longer. Another user may care only for long computations.

Another observation: I suspect that total degree multivariate power series implemented with PARI would require changes in the implementation in ticket #1956, not only in this patch; the background power series ring would not be PowerSeriesRing(self.__poly_ring, T), but a PARI power series. For certain rings probably one cannot use PARI (e.g. symbolic rings?), then the current implementation should be used instead.

Maybe it would be better to make a specialized multivariate power series class using PARI; this would be the argument of another ticket, if someone is interested in working on it; ticket #1956 and this patch would deal with the generic implementation.

comment:13 in reply to: ↑ 12 Changed 8 years ago by niles

Replying to pernici:

Another observation: I suspect that total degree multivariate power series implemented with PARI would require changes in the implementation in ticket #1956, not only in this patch; the background power series ring would not be PowerSeriesRing(self.__poly_ring, T), but a PARI power series. For certain rings probably one cannot use PARI (e.g. symbolic rings?), then the current implementation should be used instead.

Maybe it would be better to make a specialized multivariate power series class using PARI; this would be the argument of another ticket, if someone is interested in working on it; ticket #1956 and this patch would deal with the generic implementation.

This sounds like a good idea -- thanks for clarifying. The improvements you've made are certainly worth finishing and including in sage.

Changed 8 years ago by pernici

comment:14 Changed 8 years ago by pernici

The attached trac_10532_send_to_bg.patch fixes _send_to_bg in the case of multivariate series in one variable, see comments 64-70 in ticket #1956

sage: R = PowerSeriesRing(QQ,1,'x')                                                
sage: f = R._poly_ring().gens()[0]
sage: R._send_to_bg(f + f^2)
x*T + x^2*T^2
sage: R.random_element(4)
-3/25*x^3 - 3*x^2 + x + O(x)^4
Note: See TracTickets for help on using tickets.