Opened 6 years ago
Closed 2 years ago
#14153 closed defect (fixed)
Add Unicode support to the doctesting framework
Reported by:  jdemeyer  Owned by:  roed 

Priority:  major  Milestone:  sage8.1 
Component:  doctest framework  Keywords:  python3, unicode 
Cc:  leif, embray, jdemeyer, hivert  Merged in:  
Authors:  Frédéric Chapoton  Reviewers:  Erik Bray, Volker Braun, David Roe 
Report Upstream:  N/A  Work issues:  
Branch:  ed1e3b5 (Commits)  Commit:  ed1e3b5bf17e6cc923e28b11862496dd0e1b9b04 
Dependencies:  Stopgaps: 
Description (last modified by )
Attachments (1)
Change History (79)
comment:1 Changed 6 years ago by
 Description modified (diff)
Changed 6 years ago by
comment:2 Changed 6 years ago by
 Cc leif added
comment:3 Changed 6 years ago by
comment:4 Changed 6 years ago by
 Component changed from doctest to doctest framework
 Owner changed from mvngu to roed
comment:5 Changed 6 years ago by
 Milestone changed from sage5.11 to sage5.12
comment:6 Changed 5 years ago by
 Milestone changed from sage6.1 to sage6.2
comment:7 Changed 5 years ago by
 Milestone changed from sage6.2 to sage6.3
comment:8 Changed 5 years ago by
 Milestone changed from sage6.3 to sage6.4
comment:9 Changed 3 years ago by
 Keywords python3 added
comment:10 Changed 3 years ago by
 Milestone changed from sage6.4 to sage7.4
comment:11 Changed 3 years ago by
 Branch set to public/14153
 Commit set to cffdb9077dc8074a0cde0d0cd21d19046b2a95d5
just made a branch from the old patch, not tried to test it
New commits:
cffdb90  trac 14153 unicode in documentation

comment:12 Changed 2 years ago by
 Commit changed from cffdb9077dc8074a0cde0d0cd21d19046b2a95d5 to 1712c87b8afdb2543e593b6e44165aefc52fdef5
comment:13 Changed 2 years ago by
I have repaired a detail, but this is not working, in several distinct ways.
This seems to be broken on lines having unicode characters in comments such as:
sage: 0 + 0 # c'est trop bête
Some doctests come from the notebook, I made a pull request.
A failing doctest in src/sage/misc/rest_index_of_methods.py is related to the presence of "Lov\xc3\xa1sz theta"
comment:14 Changed 2 years ago by
seems that something like that is hurting us (in forker):
sage: md=hashlib.md5() sage: md.update(u'été')  UnicodeEncodeError Traceback (most recent call last) <ipythoninput96a9959b36b47> in <module>() > 1 md.update(u'été') UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
solution: encode ?
sage: md.update(u'été'.encode('utf8'))
comment:15 Changed 2 years ago by
 another problem in src/sage/misc/cachefunc.pyx with a mu:
398: 625 loops, best of 3: 1.3 µs per loop
and same in src/sage/combinat/designs/evenly_distributed_sets.pyx and in src/sage/structure/list_clone_timings.py
 in src/sage_setup/autogen/pari/doc.py, some unicode strings with no u:
113: doc = doc.replace("@[pm]", "±") 115: doc = doc.replace("@[agrave]", "à") 116: doc = doc.replace("@[aacute]", "á") 117: doc = doc.replace("@[eacute]", "é") 118: doc = doc.replace("@[ouml]", "ö") 119: doc = doc.replace("@[uuml]", "ü") 120: doc = doc.replace("\\'{a}", "á")
comment:16 Changed 2 years ago by
 Commit changed from 1712c87b8afdb2543e593b6e44165aefc52fdef5 to 8cd268138984060c14b6b43e9ab2604317a9b20a
Branch pushed to git repo; I updated commit sha1. New commits:
8cd2681  trac 14153 adding uft8 declaration in 2 files

comment:17 Changed 2 years ago by
 Milestone changed from sage7.4 to sage8.0
This should be taken care of.
comment:18 Changed 2 years ago by
 Cc embray added
comment:19 Changed 2 years ago by
 Commit changed from 8cd268138984060c14b6b43e9ab2604317a9b20a to 489d13f0cd3640a28c05b42077b307173f0c495a
Branch pushed to git repo; I updated commit sha1. New commits:
489d13f  Merge branch 'public/14153' in 8.0.b4

comment:20 Changed 2 years ago by
 Description modified (diff)
comment:21 Changed 2 years ago by
 Keywords unicode added
comment:22 Changed 2 years ago by
 Reviewers set to Erik Bray
Looks fine to me, though I did not test it. I'm amazed there isn't anything in the stdlib to check the PEP 263 coding line, but if there is I couldn't find it.
This will be nice to have. It reminds me of my thought the other day to propose adding a few unicode constants (and the ability to tabcomplete them easily) to Sage (e.g. for pi). Has that been proposed before?
comment:23 Changed 2 years ago by
I have not tested either, so far. May launch my patchbot on the branch later.
comment:24 Changed 2 years ago by
The patchbot is not happy at all !
One example of failure:
File "src/doc/ja/a_tour_of_sage/index.rst", line 39, in doc.ja.a_tour_of_sage.index Failed example: x = var('x') # 記号変数を定義 Exception raised: Traceback (most recent call last): File "/home/chapoton/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 520, in _run self.compile_and_execute(example, compiler, test.globs) File "/home/chapoton/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 896, in compile_and_execute self.update_digests(example) File "/home/chapoton/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 783, in update_digests self.running_global_digest.update(s) UnicodeEncodeError: 'ascii' codec can't encode characters in position 2026: ordinal not in range(128)
Another kind of example:
File "src/sage/ext/fast_callable.pyx", line 34, in sage.ext.fast_callable Failed example: timeit('wilk.subs(x=30)') # random, long time Exception raised: Traceback (most recent call last): File "/home/chapoton/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 520, in _run self.compile_and_execute(example, compiler, test.globs) File "/home/chapoton/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 883, in compile_and_execute exec(compiled, globs) File "<doctest sage.ext.fast_callable[5]>", line 1, in <module> timeit('wilk.subs(x=30)') # random, long time File "sage/misc/sage_timeit_class.pyx", line 118, in sage.misc.sage_timeit_class.SageTimeit.__call__ (build/cythonized/sage/misc/sage_timeit_class.c:1411) print(self.eval(code, globals, preparse=preparse, **kwds)) File "/home/chapoton/sage/local/lib/python2.7/codecs.py", line 369, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 26: ordinal not in range(128)
comment:25 Changed 2 years ago by
To be clear, is that a problem with the patchbot or the doctest runner itself? It seems more like the latter, but the question is does this happen when running the tests outside the patchbot?
comment:26 Changed 2 years ago by
Somewhat reminiscent of https://trac.sagemath.org/ticket/22756#comment:14 although the japanese stuff is definitely unicode and should be recognized as such by python.
comment:27 Changed 2 years ago by
just fails in the same way outside of the patchbot
comment:28 Changed 2 years ago by
 Commit changed from 489d13f0cd3640a28c05b42077b307173f0c495a to b9295dca7f1e4c7caa4f6173457a7560f500de40
Branch pushed to git repo; I updated commit sha1. New commits:
b9295dc  trac 14153 encoding required

comment:29 followup: ↓ 32 Changed 2 years ago by
 Commit changed from b9295dca7f1e4c7caa4f6173457a7560f500de40 to be11f06e98fc93888967ae991f4ea91ce33c50e0
Branch pushed to git repo; I updated commit sha1. New commits:
be11f06  trac 14153 fixing some doctests

comment:30 Changed 2 years ago by
 Cc jdemeyer hivert added
Less failing doctests, but still many. Is there an expert that could help ?
comment:31 Changed 2 years ago by
Depends on exactly what expertise is needed, but I'll have a look.
comment:32 in reply to: ↑ 29 Changed 2 years ago by
Replying to git:
Branch pushed to git repo; I updated commit sha1. New commits:
be11f06 trac 14153 fixing some doctests
Regarding this commit, and others like itis this not going to be a problem on Python 3, where the strings won't be repr'd as u'...'
?
In other projects I've modified the doctest checker to ignore the u
marker in strings by default. If a test needs to explicitly check whether or not a string is unicode there are other ways. But most of the time that isn't relevant.
comment:33 Changed 2 years ago by
 Commit changed from be11f06e98fc93888967ae991f4ea91ce33c50e0 to 0c611c5ad8aad31d5403155c40d87f2fb9e087ef
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
0c611c5  trac 14153 unicode in documentation

comment:34 Changed 2 years ago by
 Branch changed from public/14153 to u/chapoton/14153
 Commit changed from 0c611c5ad8aad31d5403155c40d87f2fb9e087ef to 080916e00f751500b75fba8c992a7c699ca61ed2
comment:35 Changed 2 years ago by
 Commit changed from 080916e00f751500b75fba8c992a7c699ca61ed2 to f3d375ed3aaa3b2c72c9fa2b30b89ac78d0d85f8
comment:36 Changed 2 years ago by
 Commit changed from f3d375ed3aaa3b2c72c9fa2b30b89ac78d0d85f8 to 7906a504011eb32f7890c0750d3d987568576655
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
7906a50  trac 14153 unicode in documentation

comment:37 Changed 2 years ago by
 Commit changed from 7906a504011eb32f7890c0750d3d987568576655 to 1813a63c6e37170300f76df3a8e818b6289b6a63
Branch pushed to git repo; I updated commit sha1. New commits:
1813a63  trac 14153 some details

comment:38 Changed 2 years ago by
 Branch changed from u/chapoton/14153 to u/chapoton/14153v2
 Commit changed from 1813a63c6e37170300f76df3a8e818b6289b6a63 to e3497e851f155914d54174ad18bd28123f6b37a0
comment:39 Changed 2 years ago by
 Commit changed from e3497e851f155914d54174ad18bd28123f6b37a0 to cc03d22738a327999db7162c095daac7d23f9c31
Branch pushed to git repo; I updated commit sha1. New commits:
cc03d22  motorhead detail

comment:40 Changed 2 years ago by
 Commit changed from cc03d22738a327999db7162c095daac7d23f9c31 to b091e6a60311ab5756d2fb744e60adabae120a5a
Branch pushed to git repo; I updated commit sha1. New commits:
b091e6a  trac 14153 fixing another doctest

comment:41 Changed 2 years ago by
There remains only one failing doctest in the r interface.
comment:42 Changed 2 years ago by
 Commit changed from b091e6a60311ab5756d2fb744e60adabae120a5a to 8394952acfe919aba99695f84248edbc212a3c9c
Branch pushed to git repo; I updated commit sha1. New commits:
8394952  trac 14153 fixing another doctest

comment:43 Changed 2 years ago by
Green Bot ! Please review !
I am not sure that this is the definitive fix, but at least good progress.
This needs to be tested also on Mac and Cygwin, and this I cannot do.
comment:44 Changed 2 years ago by
comment:45 Changed 2 years ago by
 Milestone changed from sage8.0 to sage8.1
 Priority changed from minor to major
 Status changed from new to needs_review
comment:46 Changed 2 years ago by
ping, anybody ?
comment:47 Changed 2 years ago by
Testing on OS X. But it overall looks good.
comment:48 Changed 2 years ago by
I don't think there's anything about this that specifically needs testing on Cygwin. I've never encountered any issues on Cygwin particularly pertaining to unicode in Python.
comment:49 Changed 2 years ago by
ok, well. Then please review.
comment:50 Changed 2 years ago by
This is maybe a minor point, but here:

src/sage/doctest/forker.py
diff git a/src/sage/doctest/forker.py b/src/sage/doctest/forker.py index 9a5d982..705b718 100644
a b class SageDocTestRunner(doctest.DocTestRunner): 514 514 finally: 515 515 if self.debugger is not None: 516 516 self.debugger.set_continue() # ==== Example Finished ==== 517 got = self._fakeout.getvalue() # the actual output 517 got = self._fakeout.getvalue().decode('utf8') 518 # the actual output 519
it might be good if this were wrapped in a try/except, and fall back on latin1 if utf8 decoding fails, like:
got = self._fakeout.getvalue() try: got = got.decode('utf8') except UnicodeDecodeError: got = got.decode('latin1')
Of course this shouldn't happen, but in the off chance some test produces garbage output, this will at least prevent the doctest runner from crashing with a UnicodeDecodeError
. Instead you'll get some text, likely wrong, that can still be compared to the expected result.
comment:51 Changed 2 years ago by
Likewise here:

src/sage/interfaces/r.py
diff git a/src/sage/interfaces/r.py b/src/sage/interfaces/r.py index 9db603f..e9096df 100644
a b class R(ExtraTabCompletion, Expect): 668 668 ... 669 669 ImportError: ... 670 670 """ 671 ret = self.eval('require("%s")'%library_name) 671 ret = self.eval('require("%s")'%library_name).decode('utf8') 672 672 # try hard to parse the message string in a localeindependent way 673 673 if ' library(' in ret: # localeindependent keyword 674 674 raise ImportError("%s"%ret)
In this case I'm not really sure what the best thing is to do if there's an encoding error, but it should probably be handled. Will have to give that more thought. Basically any time you're decoding bytes coming from an external source it's important not to assume they'll always have the expected encoding :(
comment:52 Changed 2 years ago by
The rest makes sense to me. I haven't done a thorough investigation into all the places where Sage needs better unicode handling, and I'm sure this isn't exhaustive either, but I know you've done a lot of testing. The next issue is just what the patchbots will say....
comment:53 Changed 2 years ago by
patchbots said green already..
EDIT oh, no.. I missed the last report.
comment:54 Changed 2 years ago by
 Commit changed from 8394952acfe919aba99695f84248edbc212a3c9c to 4dc084e452d83fe8bf80b105d25b404a21092be7
comment:55 followup: ↓ 57 Changed 2 years ago by
 Status changed from needs_review to needs_work
Damn, some of the patchbots see some problems with the gap interface, that I do not see locally..
comment:56 Changed 2 years ago by
All clear on OS X as far as I can tell.
comment:57 in reply to: ↑ 55 Changed 2 years ago by
Replying to chapoton:
Damn, some of the patchbots see some problems with the gap interface, that I do not see locally..
It probably has something to do with their locale settings. This is why I suggested not just assuming .decode('utf8')
will work :) Now the question is, how exactly does GAP determine what encoding to use...
comment:58 Changed 2 years ago by
Ah, I forgot to point that out here:

src/sage/interfaces/gap.py
diff git a/src/sage/interfaces/gap.py b/src/sage/interfaces/gap.py index dd7b27f..7e765a1 100644
a b from sage.interfaces.tab_completion import ExtraTabCompletion 188 188 from sage.structure.element import ModuleElement 189 189 import re 190 190 import os 191 import io 191 192 import pexpect 192 193 import time 193 194 import platform … … class Gap(Gap_generic): 1339 1340 (sline,) = match.groups() 1340 1341 if self.is_remote(): 1341 1342 self._get_tmpfile() 1342 F = open(self._local_tmpfile(),"r")1343 F = io.open(self._local_tmpfile(), "r", encoding='utf8') 1343 1344 help = F.read() 1344 1345 if pager: 1345 1346 from IPython.core.page import page
Let's see if we can determine what encoding GAP is actually using...
comment:59 Changed 2 years ago by
Maybe we should force the value of "GAPInfo.TermEncoding"
?
see https://www.gapsystem.org/Manuals/pkg/GAPDoc1.5.1/doc/chap5.html
Maybe using GAPInfo.TermEncoding := "UTF8";
I know nothing of GAP, some input from a GAP expert could be useful here..
comment:60 Changed 2 years ago by
I can reproduce the failing doctests locally by adding one line
self.eval('SetGAPDocTextTheme("none")') + self.eval('GAPInfo.TermEncoding := "latin1";') self.eval(r'\$SAGE.tempfile := "%s";'%tmp_to_use)
comment:61 Changed 2 years ago by
 Commit changed from 4dc084e452d83fe8bf80b105d25b404a21092be7 to 1c0dd43920d5c81b164992fbaf9ecd9f81a3154d
Branch pushed to git repo; I updated commit sha1. New commits:
1c0dd43  trac 14153 trying to fix the gap doc doctests

comment:62 Changed 2 years ago by
not sure at all that this is a good solution..
comment:63 Changed 2 years ago by
Interesting. It's not clear to me where TermEncoding
is used in the process of displaying some help docs. But if you were able to reproduce by setting that then it must be involved somewhere, somehow...
comment:64 Changed 2 years ago by
see the section 5.32 GAPDoc2Text
in the link https://www.gapsystem.org/Manuals/pkg/GAPDoc1.5.1/doc/chap5.html
I think this explain what GAP does when its doc is required as a text.
comment:65 Changed 2 years ago by
Great, that should explain it then.
Hmmmaybe rather than forcing TermEncoding
to a specific value we should read TermEncoding
and use that to determine how to decode text from GAP.
comment:66 Changed 2 years ago by
 Commit changed from 1c0dd43920d5c81b164992fbaf9ecd9f81a3154d to 0faa9436561f00cf1ed5b9c8a65ca4667032f2dd
Branch pushed to git repo; I updated commit sha1. New commits:
0faa943  trac 14153 detect gap encoding

comment:67 Changed 2 years ago by
indeed, here is a try..
comment:68 Changed 2 years ago by
 Status changed from needs_work to needs_review
comment:69 Changed 2 years ago by
bot seems to be stablegreen, please review
comment:70 Changed 2 years ago by
ping ? Jeroen ? Erik ?
comment:71 Changed 2 years ago by
Can you replace the unconditional except: with something more sensible? Probably just name = str(name) instead of the entire try/except + assertion. Rest lgtm
comment:72 Changed 2 years ago by
 Commit changed from 0faa9436561f00cf1ed5b9c8a65ca4667032f2dd to ed1e3b5bf17e6cc923e28b11862496dd0e1b9b04
comment:73 Changed 2 years ago by
Thanks Volker for having a look.
I have made a more precise except. I guess one could also have removed the tryexcept.
comment:74 Changed 2 years ago by
bot is morally green
comment:75 Changed 2 years ago by
 Reviewers changed from Erik Bray to Erik Bray, Volker Bruan, David Roe
 Status changed from needs_review to positive_review
Looks good. Sorry that this review took so long!
comment:76 Changed 2 years ago by
 Reviewers changed from Erik Bray, Volker Bruan, David Roe to Erik Bray, Volker Braun, David Roe
comment:77 Changed 2 years ago by
Sorry for the typo Volker!
comment:78 Changed 2 years ago by
 Branch changed from u/chapoton/14153v2 to ed1e3b5bf17e6cc923e28b11862496dd0e1b9b04
 Resolution set to fixed
 Status changed from positive_review to closed
I just tried this patch on the suspicion it could help with the test weirdness I reported in 5.9.beta0 on sagerelease
The R error was fixed by unsetting R_HOME before running the test. Then I applied the patch and here is the result