Opened 22 months ago

Last modified 4 weeks ago

#27492 new defect

Bug in parallelized computations involving symbolic functions

Reported by: egourgoulhon Owned by:
Priority: major Milestone:
Component: symbolics Keywords: parallelization, symbolic functions
Cc: mmancini, rws, mkoeppe, tscrim Merged in:
Authors: Reviewers:
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Description (last modified by egourgoulhon)

Sage fails in parallelized computations on tensor fields involving symbolic functions:

sage: Parallelism().set(nproc=2)
sage: M = Manifold(2, 'M')
sage: X.<x,y> = M.chart()
sage: a = function('f')(x)
sage: t = M.tensor_field(0, 2)
sage: t[0,0], t[0,1], t[1,1] = a, x+y, x-y
sage: v = M.vector_field()
sage: v[0], v[1] = 1 + x*y, -x^2
sage: s = t.contract(v)  # parallelized computation occurs here
---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/eric/sage/9.2.develop/local/lib/python3.7/site-packages/sage/interfaces/interface.py", line 718, in __init__
...
  File "sage/libs/ecl.pyx", line 352, in sage.libs.ecl.ecl_safe_funcall (build/cythonized/sage/libs/ecl.c:5707)
    raise RuntimeError("ECL says: {}".format(
RuntimeError: ECL says: THROW: The catch MACSYMA-QUIT is undefined.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/eric/sage/9.2.develop/local/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "sage/misc/fpickle.pyx", line 100, in sage.misc.fpickle.call_pickled_function (build/cythonized/sage/misc/fpickle.c:2313)
    res = eval("f(*args, **kwds)",sage.all.__dict__, {'args':args, 'kwds':kwds, 'f':f})
  File "<string>", line 1, in <module>
  File "/home/eric/sage/9.2.develop/local/lib/python3.7/site-packages/sage/tensor/modules/comp.py", line 2329, in make_Contraction
    sm += this[[ind_s]] * other[[ind_o]]
...
  File "/home/eric/sage/9.2.develop/local/lib/python3.7/site-packages/sage/interfaces/interface.py", line 720, in __init__
    raise TypeError(x)
TypeError: ECL says: THROW: The catch MACSYMA-QUIT is undefined.
"""

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
...
/home/eric/sage/9.2.develop/local/lib/python3.7/multiprocessing/pool.py in next(self, timeout)
    746         if success:
    747             return value
--> 748         raise value
    749 
    750     __next__ = next                    # XXX

TypeError: ECL says: THROW: The catch MACSYMA-QUIT is undefined.

The full error message is attached.

This issue has been reported in various places previously. With old Python-2 Sage it resulted in a silent error, see this sage-devel post (see also this one).

Attachments (2)

error_message_py3.txt (6.8 KB) - added by egourgoulhon 22 months ago.
error_message_Sage_9.2.beta6.txt (7.2 KB) - added by egourgoulhon 6 months ago.

Download all attachments as: .zip

Change History (23)

Changed 22 months ago by egourgoulhon

comment:1 Changed 22 months ago by egourgoulhon

  • Cc mmancini added

comment:2 Changed 22 months ago by embray

  • Milestone changed from sage-8.7 to sage-8.8

Ticket retargeted after milestone closed (if you don't believe this ticket is appropriate for the Sage 8.8 release please retarget manually)

comment:3 Changed 19 months ago by embray

  • Milestone sage-8.8 deleted

As the Sage-8.8 release milestone is pending, we should delete the sage-8.8 milestone for tickets that are not actively being worked on or that still require significant work to move forward. If you feel that this ticket should be included in the next Sage release at the soonest please set its milestone to the next release milestone (sage-8.9).

comment:4 Changed 15 months ago by nbruin

To help you along with this: it's probably a matter that the existence of the symbolic function 'f' isn't properly coordinated across the inter-process communication, so that a new symbolic function, also called 'f' is created at some point:

sage: a_new=s0.operands()[2]
sage: a
f(x)
sage: a_new
f(x)
sage: a - a
0
sage: a - a_new #this indicates something fishy
-f(x) + f(x)
sage: bool( a==a_new)
True
sage: s0.subs({a_new: a})
-x^3 - (x^2 - x*f(x))*y + f(x)
sage: s0.subs({a_new: a}).coefficient(a,1)
x*y + 1

If may pickling that's (part of) the problem:

sage: function('f')(x)-function('f')(x)
0
sage: function('f')(x)-SR('f(x)')
0
sage: loads(dumps(SR('f(x)')))-SR('f(x)')
0
sage: function('f')(x)-SR('f(x)')
f(x) - f(x)
sage: loads(dumps(function('f')(x)))-function('f')(x)
-f(x) + f(x)
sage: loads(dumps(function('f')(x)))-SR('f(x)')
0

as you can see, the different ways of creating a symbolic function start out as giving the same result. However, as soon as pickling has been involved, something changes, and results are not compatible any more. from that point onwards, SR('f(x)') seems to produce results consistent with pickling, but function('f')(x) is reliably not compatible (and in fact, further investigation shows that function('f')(x) across different invocations of pickle returns unidentical results. So there's some naste cache wiping/corruption going on somewhere.

You can see part of it here:

sage: explain_pickle(dumps(function('f')(x)))
pg_Expression = unpickle_global('sage.symbolic.expression', 'Expression')
si = unpickle_newobj(pg_Expression, ())
unpickle_build(si, (0r, ['x'], 'GARC\x03\tfunction\x00class\x00symbol\x00x\x00name\x00seq\x00python\x00f\x00sage_ex\x00\x01\x08\x01\x02\x02\n\x02"\x03\x04\n\x00+\x001\x00"\x07'))
si

As you can see, there's a rather opague string involved. The unpickle_build is part of the getstate/setstate protocol, and reading sage.symbolic.expression.Expression.__setstate__ and sage.symbolic.expression.Expression.__getstate__ shows you that a Pynac archive is involved. It's not hard to imagine the coordination problems that can arise when using such low-level tools. This needs a pynac expert. See https://github.com/pynac/pynac/issues/349 (if github is preferred for tracking pynac)

Last edited 15 months ago by nbruin (previous) (diff)

comment:5 Changed 15 months ago by nbruin

  • Cc rws added

comment:6 Changed 15 months ago by egourgoulhon

  • Description modified (diff)

comment:7 follow-up: Changed 14 months ago by gh-DeRhamSource

Since false results are produced, it seems to be a severe bug. On this, I'd suggest either to solve it for milestone 9 or at least to disable the feature or give it a warning in the documentation until a solution is found.

comment:8 in reply to: ↑ 7 Changed 14 months ago by egourgoulhon

Replying to gh-DeRhamSource:

Since false results are produced, it seems to be a severe bug. On this, I'd suggest either to solve it for milestone 9 or at least to disable the feature or give it a warning in the documentation until a solution is found.

Fortunately, with Python 3 (and hence Sage 9.0), there is no longer any silent error: a TypeError exception is raised instead. So there is no need to disable a feature (parallelization) that is massively used in tensor calculus on manifolds (see e.g. many of these examples).

Last edited 14 months ago by egourgoulhon (previous) (diff)

comment:9 follow-up: Changed 14 months ago by gh-DeRhamSource

Ah I see. And in conclusion, python2 will not even be supported anymore?

Sorry for interfering then. :)

comment:10 in reply to: ↑ 9 Changed 14 months ago by egourgoulhon

Replying to gh-DeRhamSource:

Ah I see. And in conclusion, python2 will not even be supported anymore?

Python 2 is almost dead: https://pythonclock.org/

comment:11 Changed 6 months ago by gh-mjungmath

  • Cc mkoeppe egourgoulhon tscrim added

comment:12 Changed 6 months ago by gh-mjungmath

  • Cc egourgoulhon removed

comment:13 Changed 6 months ago by mkoeppe

What about this ticket?

comment:14 Changed 6 months ago by gh-mjungmath

This bug is really annoying in terms of applications. You need parallelization to speed things up, especially in this case. I thought maybe you have some ideas to contribute. If not, my apologies.

comment:15 follow-up: Changed 6 months ago by mkoeppe

Could you try to update & clarify the ticket description please? It's OK to delete everything related to python2. We no longer support it.

comment:16 Changed 6 months ago by egourgoulhon

  • Description modified (diff)

Changed 6 months ago by egourgoulhon

comment:17 follow-up: Changed 5 months ago by mkoeppe

Well, there are lots of libraries involved in symbolics, so lots of questions.

I guess the first question to ask is whether the parallelization that you use involves several threads in one process, or several processes.

The reported error comes from Maxima, which runs in ECL. Is ECL prepared and configured for threadsafe operation? Are the parts of Maxima that the symbolics code uses threadsafe? (The assumptions code that I touched in #30074 is definitely not threadsafe.)

comment:18 Changed 4 weeks ago by egourgoulhon

See #31047 for a related issue.

comment:19 Changed 4 weeks ago by egourgoulhon

  • Description modified (diff)

comment:20 in reply to: ↑ 15 Changed 4 weeks ago by egourgoulhon

Replying to mkoeppe:

Could you try to update & clarify the ticket description please? It's OK to delete everything related to python2. We no longer support it.

Done.

comment:21 in reply to: ↑ 17 Changed 4 weeks ago by egourgoulhon

Replying to mkoeppe:

Well, there are lots of libraries involved in symbolics, so lots of questions.

I guess the first question to ask is whether the parallelization that you use involves several threads in one process, or several processes.

The parallelization is based on the Python module multiprocessing, so it involves several processes.

The reported error comes from Maxima, which runs in ECL. Is ECL prepared and configured for threadsafe operation? Are the parts of Maxima that the symbolics code uses threadsafe? (The assumptions code that I touched in #30074 is definitely not threadsafe.)

There is no error when the involved symbolic expressions do not contain any symbolic function, so I would say that is not an issue with Maxima and parallelization, but rather an issue with Sage's symbolic functions (cf. comment:4 and #31047).

Note: See TracTickets for help on using tickets.