Opened 10 years ago
Closed 5 years ago
#14153 closed defect (fixed)
Add Unicode support to the doctesting framework
Reported by:  Jeroen Demeyer  Owned by:  David Roe 

Priority:  major  Milestone:  sage8.1 
Component:  doctest framework  Keywords:  python3, unicode 
Cc:  Leif Leonhardy, Erik Bray, Jeroen Demeyer, Florent Hivert  Merged in:  
Authors:  Frédéric Chapoton  Reviewers:  Erik Bray, Volker Braun, David Roe 
Report Upstream:  N/A  Work issues:  
Branch:  ed1e3b5 (Commits, GitHub, GitLab)  Commit:  ed1e3b5bf17e6cc923e28b11862496dd0e1b9b04 
Dependencies:  Stopgaps: 
Description (last modified by )
Attachments (1)
Change History (79)
comment:1 Changed 10 years ago by
Description:  modified (diff) 

Changed 10 years ago by
Attachment:  doctest_unicode.patch added 

comment:2 Changed 10 years ago by
Cc:  Leif Leonhardy added 

comment:3 Changed 10 years ago by
comment:4 Changed 10 years ago by
Component:  doctest → doctest framework 

Owner:  changed from Minh Van Nguyen to David Roe 
comment:5 Changed 9 years ago by
Milestone:  sage5.11 → sage5.12 

comment:6 Changed 9 years ago by
Milestone:  sage6.1 → sage6.2 

comment:7 Changed 9 years ago by
Milestone:  sage6.2 → sage6.3 

comment:8 Changed 8 years ago by
Milestone:  sage6.3 → sage6.4 

comment:9 Changed 6 years ago by
Keywords:  python3 added 

comment:10 Changed 6 years ago by
Milestone:  sage6.4 → sage7.4 

comment:11 Changed 6 years ago by
Branch:  → public/14153 

Commit:  → cffdb9077dc8074a0cde0d0cd21d19046b2a95d5 
just made a branch from the old patch, not tried to test it
New commits:
cffdb90  trac 14153 unicode in documentation

comment:12 Changed 6 years ago by
Commit:  cffdb9077dc8074a0cde0d0cd21d19046b2a95d5 → 1712c87b8afdb2543e593b6e44165aefc52fdef5 

comment:13 Changed 6 years ago by
I have repaired a detail, but this is not working, in several distinct ways.
This seems to be broken on lines having unicode characters in comments such as:
sage: 0 + 0 # c'est trop bête
Some doctests come from the notebook, I made a pull request.
A failing doctest in src/sage/misc/rest_index_of_methods.py is related to the presence of "Lov\xc3\xa1sz theta"
comment:14 Changed 6 years ago by
seems that something like that is hurting us (in forker):
sage: md=hashlib.md5() sage: md.update(u'été')  UnicodeEncodeError Traceback (most recent call last) <ipythoninput96a9959b36b47> in <module>() > 1 md.update(u'été') UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
solution: encode ?
sage: md.update(u'été'.encode('utf8'))
comment:15 Changed 6 years ago by
 another problem in src/sage/misc/cachefunc.pyx with a mu:
398: 625 loops, best of 3: 1.3 µs per loop
and same in src/sage/combinat/designs/evenly_distributed_sets.pyx and in src/sage/structure/list_clone_timings.py
 in src/sage_setup/autogen/pari/doc.py, some unicode strings with no u:
113: doc = doc.replace("@[pm]", "±") 115: doc = doc.replace("@[agrave]", "à") 116: doc = doc.replace("@[aacute]", "á") 117: doc = doc.replace("@[eacute]", "é") 118: doc = doc.replace("@[ouml]", "ö") 119: doc = doc.replace("@[uuml]", "ü") 120: doc = doc.replace("\\'{a}", "á")
comment:16 Changed 6 years ago by
Commit:  1712c87b8afdb2543e593b6e44165aefc52fdef5 → 8cd268138984060c14b6b43e9ab2604317a9b20a 

Branch pushed to git repo; I updated commit sha1. New commits:
8cd2681  trac 14153 adding uft8 declaration in 2 files

comment:18 Changed 6 years ago by
Cc:  Erik Bray added 

comment:19 Changed 6 years ago by
Commit:  8cd268138984060c14b6b43e9ab2604317a9b20a → 489d13f0cd3640a28c05b42077b307173f0c495a 

Branch pushed to git repo; I updated commit sha1. New commits:
489d13f  Merge branch 'public/14153' in 8.0.b4

comment:20 Changed 6 years ago by
Description:  modified (diff) 

comment:21 Changed 6 years ago by
Keywords:  unicode added 

comment:22 Changed 6 years ago by
Reviewers:  → Erik Bray 

Looks fine to me, though I did not test it. I'm amazed there isn't anything in the stdlib to check the PEP 263 coding line, but if there is I couldn't find it.
This will be nice to have. It reminds me of my thought the other day to propose adding a few unicode constants (and the ability to tabcomplete them easily) to Sage (e.g. for pi). Has that been proposed before?
comment:23 Changed 6 years ago by
I have not tested either, so far. May launch my patchbot on the branch later.
comment:24 Changed 6 years ago by
The patchbot is not happy at all !
One example of failure:
File "src/doc/ja/a_tour_of_sage/index.rst", line 39, in doc.ja.a_tour_of_sage.index Failed example: x = var('x') # 記号変数を定義 Exception raised: Traceback (most recent call last): File "/home/chapoton/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 520, in _run self.compile_and_execute(example, compiler, test.globs) File "/home/chapoton/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 896, in compile_and_execute self.update_digests(example) File "/home/chapoton/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 783, in update_digests self.running_global_digest.update(s) UnicodeEncodeError: 'ascii' codec can't encode characters in position 2026: ordinal not in range(128)
Another kind of example:
File "src/sage/ext/fast_callable.pyx", line 34, in sage.ext.fast_callable Failed example: timeit('wilk.subs(x=30)') # random, long time Exception raised: Traceback (most recent call last): File "/home/chapoton/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 520, in _run self.compile_and_execute(example, compiler, test.globs) File "/home/chapoton/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 883, in compile_and_execute exec(compiled, globs) File "<doctest sage.ext.fast_callable[5]>", line 1, in <module> timeit('wilk.subs(x=30)') # random, long time File "sage/misc/sage_timeit_class.pyx", line 118, in sage.misc.sage_timeit_class.SageTimeit.__call__ (build/cythonized/sage/misc/sage_timeit_class.c:1411) print(self.eval(code, globals, preparse=preparse, **kwds)) File "/home/chapoton/sage/local/lib/python2.7/codecs.py", line 369, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 26: ordinal not in range(128)
comment:25 Changed 6 years ago by
To be clear, is that a problem with the patchbot or the doctest runner itself? It seems more like the latter, but the question is does this happen when running the tests outside the patchbot?
comment:26 Changed 6 years ago by
Somewhat reminiscent of https://trac.sagemath.org/ticket/22756#comment:14 although the japanese stuff is definitely unicode and should be recognized as such by python.
comment:28 Changed 6 years ago by
Commit:  489d13f0cd3640a28c05b42077b307173f0c495a → b9295dca7f1e4c7caa4f6173457a7560f500de40 

Branch pushed to git repo; I updated commit sha1. New commits:
b9295dc  trac 14153 encoding required

comment:29 followup: 32 Changed 6 years ago by
Commit:  b9295dca7f1e4c7caa4f6173457a7560f500de40 → be11f06e98fc93888967ae991f4ea91ce33c50e0 

Branch pushed to git repo; I updated commit sha1. New commits:
be11f06  trac 14153 fixing some doctests

comment:30 Changed 6 years ago by
Cc:  Jeroen Demeyer Florent Hivert added 

Less failing doctests, but still many. Is there an expert that could help ?
comment:31 Changed 6 years ago by
Depends on exactly what expertise is needed, but I'll have a look.
comment:32 Changed 6 years ago by
Replying to git:
Branch pushed to git repo; I updated commit sha1. New commits:
be11f06 trac 14153 fixing some doctests
Regarding this commit, and others like itis this not going to be a problem on Python 3, where the strings won't be repr'd as u'...'
?
In other projects I've modified the doctest checker to ignore the u
marker in strings by default. If a test needs to explicitly check whether or not a string is unicode there are other ways. But most of the time that isn't relevant.
comment:33 Changed 5 years ago by
Commit:  be11f06e98fc93888967ae991f4ea91ce33c50e0 → 0c611c5ad8aad31d5403155c40d87f2fb9e087ef 

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
0c611c5  trac 14153 unicode in documentation

comment:34 Changed 5 years ago by
Branch:  public/14153 → u/chapoton/14153 

Commit:  0c611c5ad8aad31d5403155c40d87f2fb9e087ef → 080916e00f751500b75fba8c992a7c699ca61ed2 
comment:35 Changed 5 years ago by
Commit:  080916e00f751500b75fba8c992a7c699ca61ed2 → f3d375ed3aaa3b2c72c9fa2b30b89ac78d0d85f8 

comment:36 Changed 5 years ago by
Commit:  f3d375ed3aaa3b2c72c9fa2b30b89ac78d0d85f8 → 7906a504011eb32f7890c0750d3d987568576655 

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
7906a50  trac 14153 unicode in documentation

comment:37 Changed 5 years ago by
Commit:  7906a504011eb32f7890c0750d3d987568576655 → 1813a63c6e37170300f76df3a8e818b6289b6a63 

Branch pushed to git repo; I updated commit sha1. New commits:
1813a63  trac 14153 some details

comment:38 Changed 5 years ago by
Branch:  u/chapoton/14153 → u/chapoton/14153v2 

Commit:  1813a63c6e37170300f76df3a8e818b6289b6a63 → e3497e851f155914d54174ad18bd28123f6b37a0 
comment:39 Changed 5 years ago by
Commit:  e3497e851f155914d54174ad18bd28123f6b37a0 → cc03d22738a327999db7162c095daac7d23f9c31 

Branch pushed to git repo; I updated commit sha1. New commits:
cc03d22  motorhead detail

comment:40 Changed 5 years ago by
Commit:  cc03d22738a327999db7162c095daac7d23f9c31 → b091e6a60311ab5756d2fb744e60adabae120a5a 

Branch pushed to git repo; I updated commit sha1. New commits:
b091e6a  trac 14153 fixing another doctest

comment:42 Changed 5 years ago by
Commit:  b091e6a60311ab5756d2fb744e60adabae120a5a → 8394952acfe919aba99695f84248edbc212a3c9c 

Branch pushed to git repo; I updated commit sha1. New commits:
8394952  trac 14153 fixing another doctest

comment:43 Changed 5 years ago by
Green Bot ! Please review !
I am not sure that this is the definitive fix, but at least good progress.
This needs to be tested also on Mac and Cygwin, and this I cannot do.
comment:44 Changed 5 years ago by
Authors:  → Frédéric Chapoton 

comment:45 Changed 5 years ago by
Milestone:  sage8.0 → sage8.1 

Priority:  minor → major 
Status:  new → needs_review 
comment:48 Changed 5 years ago by
I don't think there's anything about this that specifically needs testing on Cygwin. I've never encountered any issues on Cygwin particularly pertaining to unicode in Python.
comment:50 Changed 5 years ago by
This is maybe a minor point, but here:

src/sage/doctest/forker.py
diff git a/src/sage/doctest/forker.py b/src/sage/doctest/forker.py index 9a5d982..705b718 100644
a b class SageDocTestRunner(doctest.DocTestRunner): 514 514 finally: 515 515 if self.debugger is not None: 516 516 self.debugger.set_continue() # ==== Example Finished ==== 517 got = self._fakeout.getvalue() # the actual output 517 got = self._fakeout.getvalue().decode('utf8') 518 # the actual output 519
it might be good if this were wrapped in a try/except, and fall back on latin1 if utf8 decoding fails, like:
got = self._fakeout.getvalue() try: got = got.decode('utf8') except UnicodeDecodeError: got = got.decode('latin1')
Of course this shouldn't happen, but in the off chance some test produces garbage output, this will at least prevent the doctest runner from crashing with a UnicodeDecodeError
. Instead you'll get some text, likely wrong, that can still be compared to the expected result.
comment:51 Changed 5 years ago by
Likewise here:

src/sage/interfaces/r.py
diff git a/src/sage/interfaces/r.py b/src/sage/interfaces/r.py index 9db603f..e9096df 100644
a b class R(ExtraTabCompletion, Expect): 668 668 ... 669 669 ImportError: ... 670 670 """ 671 ret = self.eval('require("%s")'%library_name) 671 ret = self.eval('require("%s")'%library_name).decode('utf8') 672 672 # try hard to parse the message string in a localeindependent way 673 673 if ' library(' in ret: # localeindependent keyword 674 674 raise ImportError("%s"%ret)
In this case I'm not really sure what the best thing is to do if there's an encoding error, but it should probably be handled. Will have to give that more thought. Basically any time you're decoding bytes coming from an external source it's important not to assume they'll always have the expected encoding :(
comment:52 Changed 5 years ago by
The rest makes sense to me. I haven't done a thorough investigation into all the places where Sage needs better unicode handling, and I'm sure this isn't exhaustive either, but I know you've done a lot of testing. The next issue is just what the patchbots will say....
comment:53 Changed 5 years ago by
patchbots said green already..
EDIT oh, no.. I missed the last report.
comment:54 Changed 5 years ago by
Commit:  8394952acfe919aba99695f84248edbc212a3c9c → 4dc084e452d83fe8bf80b105d25b404a21092be7 

comment:55 followup: 57 Changed 5 years ago by
Status:  needs_review → needs_work 

Damn, some of the patchbots see some problems with the gap interface, that I do not see locally..
comment:57 Changed 5 years ago by
Replying to chapoton:
Damn, some of the patchbots see some problems with the gap interface, that I do not see locally..
It probably has something to do with their locale settings. This is why I suggested not just assuming .decode('utf8')
will work :) Now the question is, how exactly does GAP determine what encoding to use...
comment:58 Changed 5 years ago by
Ah, I forgot to point that out here:

src/sage/interfaces/gap.py
diff git a/src/sage/interfaces/gap.py b/src/sage/interfaces/gap.py index dd7b27f..7e765a1 100644
a b from sage.interfaces.tab_completion import ExtraTabCompletion 188 188 from sage.structure.element import ModuleElement 189 189 import re 190 190 import os 191 import io 191 192 import pexpect 192 193 import time 193 194 import platform … … class Gap(Gap_generic): 1339 1340 (sline,) = match.groups() 1340 1341 if self.is_remote(): 1341 1342 self._get_tmpfile() 1342 F = open(self._local_tmpfile(),"r")1343 F = io.open(self._local_tmpfile(), "r", encoding='utf8') 1343 1344 help = F.read() 1344 1345 if pager: 1345 1346 from IPython.core.page import page
Let's see if we can determine what encoding GAP is actually using...
comment:59 Changed 5 years ago by
Maybe we should force the value of "GAPInfo.TermEncoding"
?
see https://www.gapsystem.org/Manuals/pkg/GAPDoc1.5.1/doc/chap5.html
Maybe using GAPInfo.TermEncoding := "UTF8";
I know nothing of GAP, some input from a GAP expert could be useful here..
comment:60 Changed 5 years ago by
I can reproduce the failing doctests locally by adding one line
self.eval('SetGAPDocTextTheme("none")') + self.eval('GAPInfo.TermEncoding := "latin1";') self.eval(r'\$SAGE.tempfile := "%s";'%tmp_to_use)
comment:61 Changed 5 years ago by
Commit:  4dc084e452d83fe8bf80b105d25b404a21092be7 → 1c0dd43920d5c81b164992fbaf9ecd9f81a3154d 

Branch pushed to git repo; I updated commit sha1. New commits:
1c0dd43  trac 14153 trying to fix the gap doc doctests

comment:63 Changed 5 years ago by
Interesting. It's not clear to me where TermEncoding
is used in the process of displaying some help docs. But if you were able to reproduce by setting that then it must be involved somewhere, somehow...
comment:64 Changed 5 years ago by
see the section 5.32 GAPDoc2Text
in the link https://www.gapsystem.org/Manuals/pkg/GAPDoc1.5.1/doc/chap5.html
I think this explain what GAP does when its doc is required as a text.
comment:65 Changed 5 years ago by
Great, that should explain it then.
Hmmmaybe rather than forcing TermEncoding
to a specific value we should read TermEncoding
and use that to determine how to decode text from GAP.
comment:66 Changed 5 years ago by
Commit:  1c0dd43920d5c81b164992fbaf9ecd9f81a3154d → 0faa9436561f00cf1ed5b9c8a65ca4667032f2dd 

Branch pushed to git repo; I updated commit sha1. New commits:
0faa943  trac 14153 detect gap encoding

comment:68 Changed 5 years ago by
Status:  needs_work → needs_review 

comment:71 Changed 5 years ago by
Can you replace the unconditional except: with something more sensible? Probably just name = str(name) instead of the entire try/except + assertion. Rest lgtm
comment:72 Changed 5 years ago by
Commit:  0faa9436561f00cf1ed5b9c8a65ca4667032f2dd → ed1e3b5bf17e6cc923e28b11862496dd0e1b9b04 

comment:73 Changed 5 years ago by
Thanks Volker for having a look.
I have made a more precise except. I guess one could also have removed the tryexcept.
comment:75 Changed 5 years ago by
Reviewers:  Erik Bray → Erik Bray, Volker Bruan, David Roe 

Status:  needs_review → positive_review 
Looks good. Sorry that this review took so long!
comment:76 Changed 5 years ago by
Reviewers:  Erik Bray, Volker Bruan, David Roe → Erik Bray, Volker Braun, David Roe 

comment:78 Changed 5 years ago by
Branch:  u/chapoton/14153v2 → ed1e3b5bf17e6cc923e28b11862496dd0e1b9b04 

Resolution:  → fixed 
Status:  positive_review → closed 
I just tried this patch on the suspicion it could help with the test weirdness I reported in 5.9.beta0 on sagerelease
The R error was fixed by unsetting R_HOME before running the test. Then I applied the patch and here is the result