Add Unicode support to the doctesting framework
Description
I have repaired a detail, but this is not working, in several distinct ways.
This seems to be broken on lines having unicode characters in comments such as:
sage: 0 + 0 # c'est trop bête
Some doctests come from the notebook, I made a pull request.
A failing doctest in src/sage/misc/rest_index_of_methods.py is related to the presence of "Lov\xc3\xa1sz theta"
sage: md=hashlib.md5() sage: md.update(u'été')  UnicodeEncodeError Traceback (most recent call last) <ipythoninput96a9959b36b47> in <module>() > 1 md.update(u'été') UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
solution: encode ?
sage: md.update(u'été'.encode('utf8'))
 another problem in src/sage/misc/cachefunc.pyx with a mu:
398: 625 loops, best of 3: 1.3 µs per loop
and same in src/sage/combinat/designs/evenly_distributed_sets.pyx and in src/sage/structure/list_clone_timings.py
 in src/sage_setup/autogen/pari/doc.py, some unicode strings with no u:
113: doc = doc.replace("@[pm]", "±") 115: doc = doc.replace("@[agrave]", "à") 116: doc = doc.replace("@[aacute]", "á") 117: doc = doc.replace("@[eacute]", "é") 118: doc = doc.replace("@[ouml]", "ö") 119: doc = doc.replace("@[uuml]", "ü") 120: doc = doc.replace("\\'{a}", "á")
Looks fine to me, though I did not test it. I'm amazed there isn't anything in the stdlib to check the PEP 263 coding line, but if there is I couldn't find it.
This will be nice to have. It reminds me of my thought the other day to propose adding a few unicode constants (and the ability to tabcomplete them easily) to Sage (e.g. for pi). Has that been proposed before?
The patchbot is not happy at all !
One example of failure:
File "src/doc/ja/a_tour_of_sage/index.rst", line 39, in doc.ja.a_tour_of_sage.index Failed example: x = var('x') # 記号変数を定義 Exception raised: Traceback (most recent call last): File "/home/chapoton/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 520, in _run self.compile_and_execute(example, compiler, test.globs) File "/home/chapoton/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 896, in compile_and_execute self.update_digests(example) File "/home/chapoton/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 783, in update_digests self.running_global_digest.update(s) UnicodeEncodeError: 'ascii' codec can't encode characters in position 2026: ordinal not in range(128)
Another kind of example:
File "src/sage/ext/fast_callable.pyx", line 34, in sage.ext.fast_callable Failed example: timeit('wilk.subs(x=30)') # random, long time Exception raised: Traceback (most recent call last): File "/home/chapoton/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 520, in _run self.compile_and_execute(example, compiler, test.globs) File "/home/chapoton/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 883, in compile_and_execute exec(compiled, globs) File "<doctest sage.ext.fast_callable[5]>", line 1, in <module> timeit('wilk.subs(x=30)') # random, long time File "sage/misc/sage_timeit_class.pyx", line 118, in sage.misc.sage_timeit_class.SageTimeit.__call__ (build/cythonized/sage/misc/sage_timeit_class.c:1411) print(self.eval(code, globals, preparse=preparse, **kwds)) File "/home/chapoton/sage/local/lib/python2.7/codecs.py", line 369, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 26: ordinal not in range(128)
Somewhat reminiscent of https://trac.sagemath.org/ticket/22756#comment:14 although the japanese stuff is definitely unicode and should be recognized as such by python.
Less failing doctests, but still many. Is there an expert that could help ?
Regarding this commit, and others like itis this not going to be a problem on Python 3, where the strings won't be repr'd as u'...'
?
In other projects I've modified the doctest checker to ignore the u
marker in strings by default. If a test needs to explicitly check whether or not a string is unicode there are other ways. But most of the time that isn't relevant.
Green Bot ! Please review !
I am not sure that this is the definitive fix, but at least good progress.
This needs to be tested also on Mac and Cygwin, and this I cannot do.
I don't think there's anything about this that specifically needs testing on Cygwin. I've never encountered any issues on Cygwin particularly pertaining to unicode in Python.
comment:50 Changed 5 years ago by
This is maybe a minor point, but here:

src/sage/doctest/forker.py
diff git a/src/sage/doctest/forker.py b/src/sage/doctest/forker.py index 9a5d982..705b718 100644
a b class SageDocTestRunner(doctest.DocTestRunner): 514 514 finally: 515 515 if self.debugger is not None: 516 516 self.debugger.set_continue() # ==== Example Finished ==== 517 got = self._fakeout.getvalue() # the actual output 517 got = self._fakeout.getvalue().decode('utf8') 518 # the actual output 519
it might be good if this were wrapped in a try/except, and fall back on latin1 if utf8 decoding fails, like:
got = self._fakeout.getvalue() try: got = got.decode('utf8') except UnicodeDecodeError: got = got.decode('latin1')
Of course this shouldn't happen, but in the off chance some test produces garbage output, this will at least prevent the doctest runner from crashing with a UnicodeDecodeError
. Instead you'll get some text, likely wrong, that can still be compared to the expected result.
comment:51 Changed 5 years ago by
Likewise here:

src/sage/interfaces/r.py
diff git a/src/sage/interfaces/r.py b/src/sage/interfaces/r.py index 9db603f..e9096df 100644
a b class R(ExtraTabCompletion, Expect): 668 668 ... 669 669 ImportError: ... 670 670 """ 671 ret = self.eval('require("%s")'%library_name) 671 ret = self.eval('require("%s")'%library_name).decode('utf8') 672 672 # try hard to parse the message string in a localeindependent way 673 673 if ' library(' in ret: # localeindependent keyword 674 674 raise ImportError("%s"%ret)
In this case I'm not really sure what the best thing is to do if there's an encoding error, but it should probably be handled. Will have to give that more thought. Basically any time you're decoding bytes coming from an external source it's important not to assume they'll always have the expected encoding :(
comment:52 Changed 5 years ago by
The rest makes sense to me. I haven't done a thorough investigation into all the places where Sage needs better unicode handling, and I'm sure this isn't exhaustive either, but I know you've done a lot of testing. The next issue is just what the patchbots will say....
comment:53 Changed 5 years ago by
patchbots said green already..
EDIT oh, no.. I missed the last report.
comment:54 Changed 5 years ago by
comment:55 followup: 57 Changed 5 years ago by
Damn, some of the patchbots see some problems with the gap interface, that I do not see locally..
comment:57 Changed 5 years ago by
Replying to chapoton:
Damn, some of the patchbots see some problems with the gap interface, that I do not see locally..
It probably has something to do with their locale settings. This is why I suggested not just assuming .decode('utf8')
will work :) Now the question is, how exactly does GAP determine what encoding to use...
comment:58 Changed 5 years ago by
Ah, I forgot to point that out here:

src/sage/interfaces/gap.py
diff git a/src/sage/interfaces/gap.py b/src/sage/interfaces/gap.py index dd7b27f..7e765a1 100644
a b from sage.interfaces.tab_completion import ExtraTabCompletion 188 188 from sage.structure.element import ModuleElement 189 189 import re 190 190 import os 191 import io 191 192 import pexpect 192 193 import time 193 194 import platform … … class Gap(Gap_generic): 1339 1340 (sline,) = match.groups() 1340 1341 if self.is_remote(): 1341 1342 self._get_tmpfile() 1342 F = open(self._local_tmpfile(),"r")1343 F = io.open(self._local_tmpfile(), "r", encoding='utf8') 1343 1344 help = F.read() 1344 1345 if pager: 1345 1346 from IPython.core.page import page
Let's see if we can determine what encoding GAP is actually using...
comment:59 Changed 5 years ago by
Maybe we should force the value of "GAPInfo.TermEncoding"
?
see https://www.gapsystem.org/Manuals/pkg/GAPDoc1.5.1/doc/chap5.html
Maybe using GAPInfo.TermEncoding := "UTF8";
I know nothing of GAP, some input from a GAP expert could be useful here..
comment:60 Changed 5 years ago by
I can reproduce the failing doctests locally by adding one line
self.eval('SetGAPDocTextTheme("none")') + self.eval('GAPInfo.TermEncoding := "latin1";') self.eval(r'\$SAGE.tempfile := "%s";'%tmp_to_use)
comment:61 Changed 5 years ago by
Interesting. It's not clear to me where TermEncoding
is used in the process of displaying some help docs. But if you were able to reproduce by setting that then it must be involved somewhere, somehow...
comment:64 Changed 5 years ago by
see the section 5.32 GAPDoc2Text
in the link https://www.gapsystem.org/Manuals/pkg/GAPDoc1.5.1/doc/chap5.html
I think this explain what GAP does when its doc is required as a text.
comment:65 Changed 5 years ago by
Great, that should explain it then.
Hmmmaybe rather than forcing TermEncoding
to a specific value we should read TermEncoding
and use that to determine how to decode text from GAP.
comment:66 Changed 5 years ago by
Can you replace the unconditional except: with something more sensible? Probably just name = str(name) instead of the entire try/except + assertion. Rest lgtm
comment:72 Changed 5 years ago by
Thanks Volker for having a look.
I have made a more precise except. I guess one could also have removed the tryexcept.
Looks good. Sorry that this review took so long!
The R error was fixed by unsetting R_HOME before running the test. Then I applied the patch and here is the result