Opened 9 years ago
Last modified 8 years ago
#10532 needs_info enhancement
Faster multiplication for multivariate power series
Reported by: | niles | Owned by: | malb |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | commutative algebra | Keywords: | multivariate power series multiplication Karatsuba |
Cc: | mario, pernici, lftabera, ncohen | Merged in: | |
Authors: | pernici | Reviewers: | Niles Johnson |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | Stopgaps: |
Attachments (7)
Change History (21)
Changed 9 years ago by
comment:1 Changed 9 years ago by
- Reviewers set to Niles Johnson
- Status changed from new to needs_info
A patch posted by pernici at #1956: First apply #1956, then trac_10480_fast_PowerSeries_poly_multiplication2.patch
from #10480, then trac_10532_faster_MPowerSeries_mul.patch. Timings before and after the patch are listed in #1956, comment 59, showing much improvement with this patch.
Comparisons with the Karatsuba algorithm of #10480 still need to be done.
comment:3 Changed 9 years ago by
Attachments: bench1.sage, mu2.sage
Here are some benchmarks comparing:
(1) with do_mul_trunc_generic from trac_10480_fast_PowerSeries_poly_multiplication2.patch, trac_1956_faster_MPowerSeries_mul.patch
(2,K) with do_trunc_karatsuba, where K is K_threshold from trac_karatsuba_improvements.patch, trac_10480_fast_PowerSeries_poly_multiplication-alternative.patch
(1) or (2) are applied after applying trac_1956_multi_power_series_new_4.patch, trac_1956_uni_multi_ps_2.patch, trac_1956_multi_ps_cleanup.patch.
These tests are run with QQ and GF(7) as examples of fields implemented in libsingular, RR as field not implemented there.
Various values of K are tried; (1) corresponds to the K=infinity case for QQ and GF(7) (slightly slower because do_mul_trunc_generic has a slightly slower implementation than the corresponding do_trunc_classical by lftabera), but not for RR (and I guess for other fields not used in libsingular), where (1) is slower.
The conclusion is that I think that (2) with K=infinity (or a large integer to avoid using infinity) is the best choice.
bench1.sage are the benchmarks posted by Niles to compare with Magma, with precision multiplied by N.
times are in seconds; processor:4-core: Intel Core i7 CPU 860 @ 2.80GHz on x86_64 GNU/Linux
prec = prec1*N bench1.sage with GF(7) test no. 1 2 3 4 5 6 7 8 prec1 16 15 50 30 51 30 30 30 N=3 (1) 0.45 0.03 0.49 0.49 0.45 (2,8) 2.16 0.06 1.21 1.25 1.10 (2,64) 0.45 0.03 0.55 0.56 0.54 (2,128) 0.43 0.03 0.44 0.45 0.44 N=4 (1) 1.80 0.06 1.44 1.34 1.22 (2,8) 3.95 0.12 3.89 3.79 3.25 (2,64) 1.88 0.06 1.69 1.60 1.50 (2,128) 1.78 0.06 1.45 1.34 1.22 N=5 (1) 8.96 0.21 5.01 4.65 4.27 (2,8) 23.2 0.52 14.8 14.3 12.3 (2,64) 11.2 0.23 6.55 6.01 5.53 (2,128) 8.74 0.19 5.50 5.11 4.69 N=6 (1) 27.0 0.67 9.99 9.35 9.17 (2,8) 87.2 1.56 30.4 31.2 26.4 (2,64) 34.5 0.74 13.6 12.8 12.4 (2,128) 26.1 0.67 10.9 10.1 9.79 N=7 (1) 40.8 1.46 18.6 17.3 16.1 (2,8) 132 3.67 57.6 60.4 51.4 (2,64) 51.1 1.55 25.0 23.3 21.1 (2,128) 39.3 1.51 20.4 19.2 17.6 (2,256) 38.3 1.32 18.3 16.9 15.9 bench1.sage with RR test no. 1 2 3 4 5 6 7 8 prec1 16 15 50 30 51 30 30 30 N=1 (1) 1.18 0.06 0.88 0.70 0.23 0.67 0.69 0.70 (2,8) 0.34 0.04 0.61 4.92 0.18 0.47 0.57 0.89 (2,64) 0.18 0.04 0.74 0.27 0.20 0.24 0.26 0.26 N=2 (1) 57 0.96 1.6 8.4 0.67 8.2 8.5 8.4 (2,8) 19.5 0.29 1.4 >540 0.44 9.9 32 183 (2,64) 6.06 0.16 1.2 2.65 0.43 2.55 2.63 2.68 (2,128) 6.07 0.16 1.3 2.64 0.48 2.57 2.63 2.63 N=3 (2,64) 58.1 0.93 1.39 79.4 0.72 18.5 28.8 48.8 (2,128) 58.9 0.94 1.59 11.8 0.78 11.6 12.0 11.7
mu2.sage is a modification of the benchmark mu.sage posted in ticket #1956 in which the random polynomials are evaluated in t_i + 1; the various series are stored in dat/ to compare times with the same series and different algorithms; this is a workabout for the lack of something like random.seed for random_element. Timings are only indicative.
mu2.sage with QQ N=1 2 3 4 5 6 7 8 n=2 prec=20 bound=20 (1) 0.007 0.01 0.02 0.04 0.09 0.21 0.58 1.79 (2,8) 0.015 0.03 0.06 0.10 0.22 0.58 1.88 5.92 (2,32) 0.007 0.01 0.02 0.04 0.09 0.20 0.57 1.74 prec=20 bound=100 (1) 0.007 0.01 0.03 0.05 0.11 0.28 0.79 2.4 (2,8) 0.019 0.04 0.07 0.13 0.30 0.85 2.64 8.2 (2,32) 0.007 0.01 0.03 0.05 0.11 0.27 0.78 2.4 prec=100 bound=100 (1) 2.96 4.46 7.14 13.4 31.8 (2,8) 18.0 32.6 67.1 162 494 (2,32) 6.82 10.7 18.8 38.7 104 (2,128) 2.91 4.40 7.23 13.5 32.1 mu2.sage with GF(7) n=2 prec=100 bound=100 (1) 0.08 0.08 0.08 0.08 0.09 0.09 0.09 0.09 (2,8) 0.41 0.45 0.45 0.44 0.45 0.45 0.46 0.44 (2,128) 0.07 0.08 0.08 0.08 0.08 0.08 0.08 0.08 prec=200 bound=200 (1) 1.05 1.21 1.22 1.23 1.22 (2,8) 11.0 12.5 12.4 12.4 12.5 (2,128) 1.41 1.59 1.58 1.59 1.58 (2,256) 1.00 1.14 1.12 1.14 1.15 mu2.sage with RR n=2 prec=20 bound=20 (1) 0.03 0.06 0.06 0.06 0.06 0.06 0.06 0.06 (2,8) 0.03 0.04 0.15 0.41 1.15 3.66 12.8 45.7 (2,128) 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 prec=20 bound=100 (1) 0.05 0.05 0.06 0.06 0.06 0.06 0.06 0.06 (2,8) 0.03 0.07 0.23 0.42 1.08 2.71 5.94 10.2 (2,128) 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 prec=100 bound=100 (1) 1.87 1.87 1.91 1.87 1.88 (2,8) 1.06 8.95 69.0 332 2050 (2,128) 0.31 0.31 0.31 0.31 0.31 n=4 prec=20 bound=20 (1) 4.05 68.8 69.5 71.1 76.1 (2,8) 3.91 66.0 615 >2000 (2,128) 1.16 2.00 2.02 2.05 2.03
comment:4 Changed 9 years ago by
If K=infinity is needed to obtain good performance then either we are doing something wrong or truncated karatsuba should be eliminated in favor of the classical truncated algorithm. I will try to look at this, but probably not this week.
comment:5 Changed 9 years ago by
In the message above I claimed that patch (1) uses do_mul_trunc_generic; this is not true for RR, in which case it falls back to _mul_generic, the classical multiplication; this is the reason for which (1) is much slower than the K=infinity case for RR, calling do_trunc_classical. Since the truncated classical multiplication do_trunc_classical computes the same terms as the classical multiplication, apart from those which are then truncated away, it is fine for non-exact fields.
The conclusion of the previous message stands: I think that (2) with K=infinity (or a large integer to avoid using infinity) is the best choice.
lftabera wrote:
If K=infinity is needed to obtain good performance then either we are doing something wrong or truncated karatsuba should be eliminated in favor of the classical truncated algorithm.
Right, it can call directly the classical truncated algorithm, but the K=infinity case calls the classical truncated algorithm almost immediately, so I think that the overhead of calling it is negligible anyway.
Changed 8 years ago by
comment:6 Changed 8 years ago by
The attached patch trac_1956_faster_MPowerSeries_mul2.patch is applied after trac_1956_multi_power_series_new_4.patch, trac_1956_uni_multi_ps_2.patch, trac_1956_multi_ps_cleanup.patch
For multivariate series Karatsuba multiplication is slower than generic
multiplication also in the case of prec=infinity
sage: R.<x,y,z,w> = QQ[[]] sage: p = 1 + x/2 + y/3 + z/4 sage: %timeit p^20 times in ms; K=K_threshold p^5 p^10 p^20 p^40 generic 0.31 1.62 28.5 1070 Karatsuba K=8 0.56 5.7 172 1900 Karatsuba K=64 0.56 5.7 167 1080
The attached patch uses generic multiplication for any precision; it has been made independent from ticket #10480, which deals mainly with the truncated Karatsuba multiplication, which is fast only for univariate series.
_mul_trunc_generic
, coming from do_trunc_classical by lftabera,
has a lot of code in common with _mul_generic
and maybe it should be merged
with it; also, it would be nice to do the _square_generic
optimization
for finite prec
.
Changed 8 years ago by
comment:7 Changed 8 years ago by
The attached trac_10532_invert.patch is applied after #1956 patches and trac_1956_faster_MPowerSeries_mul2.patch
In this patch __invert__
is implemented using _mul_trunc_generic
;
inversion of a generic multivariate series becomes much faster.
In the benchmarks (0) is without this patch, (1) with it
sage: N = 4 sage: R = PowerSeriesRing(QQ,N,'x') sage: x = R.gens() sage: sx = sum(x) sage: prec = 15 sage: p = (1 + sx + R.O(prec))^-1 + (1 + 2*sx + R.O(prec))^-1 sage: %time p1 = p^-1 Wall time: 0.40 s N prec (0) (1) 2 30 1.18 0.08 2 50 30.3 0.87 4 15 42.0 0.40 4 20 662 3.2
sage: N = 4 sage: R = PowerSeriesRing(QQ,N,'x') sage: x = R.gens() sage: prec = 30 sage: p = sum([(i+1)*x[i]^(i+1) for i in range(N)]) + R.O(prec) sage: p = 1 + sum([p^i/i for i in range(1,prec)]) sage: %time p1 = p^-1 Wall time: 0.28 s N prec (0) (1) 2 30 0.37 0.03 2 50 5.7 0.28 4 30 32.6 0.28 8 30 1566 1.0
A special kind of series inversion are series of degree much lower than prec, like
p = 1 + 2*x[0] + 3*x[1] + 4*x[2] + 5*x[3] + R.O(30)
inversion is much faster because in the iteration in the Newton method the expensive multiplication at order next_prec becomes trivial. It remains a square of a polynomial of degree roughly next_prec/2 . In this case this patch does not change speed; in this ticket there is a lot of benchmarks for this case, see bench1.sage
Finally I point out that in ticket #10720 I wrote a similar but slower
version of __invert__
for series on rings of polynomials; I will improve
that version, but notice that the speedup of __invert__
in a generic
series on rings of polynomials is much smaller than here.
Changed 8 years ago by
comment:8 follow-up: ↓ 9 Changed 8 years ago by
The attached trac_10532_square_trunc.patch is applied after #1956 patches, trac_1956_faster_MPowerSeries_mul2.patch and trac_10532_invert.patch
With this patch truncated multiplication is optimized for squares.
Benchmark comparison with PARI/GP:
sage: N = 4 sage: R = PowerSeriesRing(QQ,N,'x') sage: x = R.gens() sage: sx = sum(x) sage: prec = 20 sage: n = 1 sage: %time p = ((1 + sx + R.O(prec))^-1 + (1 + 2*sx + R.O(prec))^-1)^-n Wall time: 4.06 s
to do the computation with gp
allocate more memory than the one assigned
by default; here the computation
is done directly with the background field
sage: try: ....: gp.allocatemem() ....: except: ....: pass ....: sage: N = 4 sage: prec = 20 sage: n = 1 sage: sx = '+'.join(['x%d' % i for i in range(N)]) sage: fmt = '((1 + t*(%s) + O(t^%s))^-1 + (1 + 2*t*(%s) + O(t^%s))^-1)^-%s' sage: %time p1 = gp(fmt %(sx,prec,sx,prec,n)) Wall time: 1.56 s (0) version of ticket #1956 (1) with trac_10532_invert.patch (2) with trac_10532_square_trunc.patch (3) with gp (PARI/GP) N prec n (0) (1) (2) (3) 2 30 1 1.2 0.10 0.10 0.06 2 30 20 8.6 0.22 0.18 0.12 2 50 1 21 1.2 1.2 0.55 2 50 20 213 2.1 1.8 1.2 4 15 1 42 0.46 0.46 0.22 4 15 20 323 1.2 0.96 0.43 4 20 1 673 4.1 4.1 1.6 4 20 20 >7000 10 8.1 3.2
comment:9 in reply to: ↑ 8 Changed 8 years ago by
Replying to pernici:
(0) version of ticket #1956 (1) with trac_10532_invert.patch (2) with trac_10532_square_trunc.patch (3) with gp (PARI/GP)
This looks *really good* -- but I'm a little confused by the comparison with gp
. . . If that's always faster, should we just always do the multiplication there? There must be more subtleties involved, but I don't know *anything* about them :(
comment:10 follow-up: ↓ 11 Changed 8 years ago by
niles wrote:
This looks *really good* -- but I'm a little confused by the comparison
with
gp
. . . If that's always faster, should we just always do the multiplication there? There must be more subtleties involved, but I don't know *anything* about them :(
Well, I don't know either, I just looked at it today.
I think that if one would like to use PARI, one should avoid gp
as much as possible and call directly the PARI C library functions;
in fact gp
has a big overhead
sage: g = gp('1 + (x+y)*t + O(t^6)') sage: %timeit g^2 25 loops, best of 3: 16 ms per loop sage: R.<x,y> = QQ[[]] sage: p = 1 + x + y + R.O(6) sage: %timeit p^2 625 loops, best of 3: 56.5 µs per loop
so that it starts to be convenient to use it when the computation is long enough.
For this reason
in the benchmark in the previous mail gp
is slower for small precision
N prec n (0) (2) (3) 2 10 1 0.02 0.01 0.02 2 20 1 0.23 0.03 0.03
In the above benchmarks gp
is at most 2.5x faster;
in those cases the overhead is negligible, so I guess that using directly
the PARI library the speedup would be the same.
comment:11 in reply to: ↑ 10 ; follow-up: ↓ 12 Changed 8 years ago by
Replying to pernici:
In the above benchmarks
gp
is at most 2.5x faster; in those cases the overhead is negligible, so I guess that using directly the PARI library the speedup would be the same.
I see -- maybe -- can this patch be written to use PARI directly by default then? In the cases where it isn't as fast, the computation time should be small anyway, right?
comment:12 in reply to: ↑ 11 ; follow-up: ↓ 13 Changed 8 years ago by
Replying to niles:
Replying to pernici:
In the above benchmarks
gp
is at most 2.5x faster; in those cases the overhead is negligible, so I guess that using directly the PARI library the speedup would be the same.I see -- maybe -- can this patch be written to use PARI directly by default then? In the cases where it isn't as fast, the computation time should be small anyway, right?
It depends on the use cases; a user may need to do usually a lot of computations with low precision series; then it is important that they are done fast; in the rare case in which he needs a high precision computation, maybe he does not care that it takes 3x longer. Another user may care only for long computations.
Another observation: I suspect that total degree multivariate power series implemented with PARI would require changes in the implementation in ticket #1956, not only in this patch; the background power series ring would not be PowerSeriesRing(self.__poly_ring, T)
, but a PARI power series. For certain rings probably one cannot use PARI (e.g. symbolic rings?), then the current implementation should be used instead.
Maybe it would be better to make a specialized multivariate power series class using PARI; this would be the argument of another ticket, if someone is interested in working on it; ticket #1956 and this patch would deal with the generic implementation.
comment:13 in reply to: ↑ 12 Changed 8 years ago by
Replying to pernici:
Another observation: I suspect that total degree multivariate power series implemented with PARI would require changes in the implementation in ticket #1956, not only in this patch; the background power series ring would not be
PowerSeriesRing(self.__poly_ring, T)
, but a PARI power series. For certain rings probably one cannot use PARI (e.g. symbolic rings?), then the current implementation should be used instead.
Maybe it would be better to make a specialized multivariate power series class using PARI; this would be the argument of another ticket, if someone is interested in working on it; ticket #1956 and this patch would deal with the generic implementation.
This sounds like a good idea -- thanks for clarifying. The improvements you've made are certainly worth finishing and including in sage.
Changed 8 years ago by
comment:14 Changed 8 years ago by
The attached trac_10532_send_to_bg.patch fixes _send_to_bg in the case of multivariate series in one variable, see comments 64-70 in ticket #1956
sage: R = PowerSeriesRing(QQ,1,'x') sage: f = R._poly_ring().gens()[0] sage: R._send_to_bg(f + f^2) x*T + x^2*T^2 sage: R.random_element(4) -3/25*x^3 - 3*x^2 + x + O(x)^4
depends on #1956 and
trac_10480_fast_PowerSeries_poly_multiplication2.patch