Opened 8 years ago

Closed 8 years ago

#14375 closed enhancement (fixed)

Indicate escape codes in doctest output

Reported by: jdemeyer Owned by: roed
Priority: minor Milestone: sage-5.10
Component: doctest framework Keywords:
Cc: vbraun Merged in: sage-5.10.beta3
Authors: Volker Braun Reviewers: Jeroen Demeyer, Volker Braun
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Attachments (4)

trac_14375_ansi_escapes_indication.patch (2.6 KB) - added by roed 8 years ago.
trac_14375_ansi_escapes_indication.2.patch (3.0 KB) - added by vbraun 8 years ago.
Version with the wrong regex
14375_inputenc.patch (2.9 KB) - added by vbraun 8 years ago.
Updated patch
14375_review.patch (2.9 KB) - added by jdemeyer 8 years ago.

Download all attachments as: .zip

Change History (37)

Changed 8 years ago by roed

comment:1 Changed 8 years ago by roed

Apparently CSI stands for Control Sequence Introducer (see wikipedia).

comment:2 follow-up: Changed 8 years ago by roed

As for \x9b, I don't have a strong opinion. I guess I would lean toward escaping it, since it is a valid CSI.

comment:3 in reply to: ↑ 2 Changed 8 years ago by jdemeyer

Replying to roed:

As for \x9b, I don't have a strong opinion. I guess I would lean toward escaping it, since it is a valid CSI.

And how are you going to distinguish between \x9b appearing as part of a UTF-8 code?

comment:4 follow-up: Changed 8 years ago by vbraun

Don't shoot the messenger about the \x9b CSI, I just implemented the specs ;-)

As far as I know the whole Python 2.x doctesting just operates on byte strings without any unicode awareness (see also #14359). How do you specify the encoding of the doctest output, for starters? Since when is it UTF-8?

comment:5 Changed 8 years ago by roed

  • Component changed from doctest to doctest framework
  • Owner changed from mvngu to roed

comment:6 in reply to: ↑ 4 Changed 8 years ago by jdemeyer

Replying to vbraun:

How do you specify the encoding of the doctest output, for starters? Since when is it UTF-8?

It's essentially undefined, it could be anything. But it could be UTF-8 with a \x9b byte appearing not from a CSI.

Unrelated: I really like Python's .*? for regular expressions, this is a nice use of that.

comment:7 follow-up: Changed 8 years ago by vbraun

We don't have a doctest that spews wide UTF-8 characters containing \x1b or \x9b. The proper solution is to make the doctest framework use unicode instead of byte strings. We will have to revisit this when we switch to python 3. But at this point I don't think there is anything that I can do on this ticket.

comment:8 in reply to: ↑ 7 Changed 8 years ago by jdemeyer

Replying to vbraun:

We don't have a doctest that spews wide UTF-8 characters containing \x1b or \x9b.

....yet

The proper solution is to make the doctest framework use unicode instead of byte strings.

Indeed, see #14153.

But at this point I don't think there is anything that I can do on this ticket.

Since those \x9b codes are rarely/never actually used, simply leave them alone. I think that would be a good compromise.

comment:9 follow-up: Changed 8 years ago by vbraun

Can we actually finish #14153 before switching to python 3? Without checking, I would guess that there are tons of things that will go wrong when feeding the python 2.7 doctest module with unicode strings.

If/when we switch to a unicode framework we just replace the byte string regex by its unicode version. There is still a single code point U+009B for CSI in addition to the two code point version U+001B[. There is no compromise needed since the byte string version of this ticket should never meet the unicode version of the doctest framework.

comment:10 in reply to: ↑ 9 Changed 8 years ago by jdemeyer

  • Authors set to Volker Braun

Replying to vbraun:

Can we actually finish #14153 before switching to python 3? Without checking, I would guess that there are tons of things that will go wrong when feeding the python 2.7 doctest module with unicode strings.

We're actually using quite little of the Python doctest module. Besides, with Python's duck-typing, there is not that much difference between strings and unicode strings.

There is no compromise needed since the byte string version of this ticket should never meet the unicode version of the doctest framework.

What I'm trying to say is that the current string version of the doctesting framework might see byte strings encoded as UTF-8, even when not using the unicode type. Remember that Sage uses UTF-8 encoded source files, so any strings defined in doctests will be encoded as UTF-8.

comment:11 Changed 8 years ago by vbraun

Mix & matching byte string and unicode mostly works if you are willing to ignore the UnicodeWarning ;-)

I agree that the doctest code can see arbitrary binary data (including UTF-8) right now as anything might be printed by the code being tested. Its possible (but unlikely) that some of the non-ascii garbage will be translated into <ESC-?> by this patch. So what? The end goal should be unicode + 100% coverage of ansi control sequences, and this patch is closer to that goal than leaving out some random control sequences because they aren't used much.

comment:12 Changed 8 years ago by jdemeyer

Put it a different way: imagine we see a \x9b in doctest output, which is more likely:

  1. It is part of a UTF-8 sequence.
  2. It is a CSI code in a different encoding.

I would say the first is more likely.

comment:13 Changed 8 years ago by vbraun

  • Status changed from new to needs_review

Since all doctests pass with this ticket I'm pretty sure we are talking about the empty set. Both are currently equally likely, with probability zero.

comment:14 follow-up: Changed 8 years ago by roed

Any conclusion on \x9b?

comment:15 in reply to: ↑ 14 Changed 8 years ago by jdemeyer

Replying to roed:

Any conclusion on \x9b?

Yes, unfortunately, two different conclusions depending on who you ask :-)

Changed 8 years ago by vbraun

Version with the wrong regex

comment:16 Changed 8 years ago by vbraun

Since incomplete coverage of escape codes is still better than doing nothing at all, here is a version of the patch with the wrong regex.

comment:17 Changed 8 years ago by jdemeyer

The "right" regex would obviously be the UTF-8 encoded version of \x9b, not the byte \x9b itself.

comment:18 follow-up: Changed 8 years ago by vbraun

The UTF-8 encoded \x1b / x9b won't match escape sequences unless we encode the output of the doctest in UTF-8. Right now its byte strings. Once we switch to unicode we have to match the U+001B / U+009B code points. At no point is it a good idea to match for UTF-8 encoded escape sequences imho.

comment:19 Changed 8 years ago by roed

If we ever have a \x9b byte coming out of a doctest, I'd like it to fail doctesting so that we can figure out what it actually was. Perhaps we should look for that byte and raise an error message referencing this ticket?

comment:20 follow-up: Changed 8 years ago by vbraun

Or you take the first version of the patch and it'll be printed as CSI.

comment:21 in reply to: ↑ 20 Changed 8 years ago by roed

Replying to vbraun:

Or you take the first version of the patch and it'll be printed as CSI.

Yeah. I'm fine with that, but apparently Jeroen isn't.

comment:22 in reply to: ↑ 18 ; follow-up: Changed 8 years ago by jdemeyer

Replying to vbraun:

Right now its byte strings.

Which simply means that the encoding is "undefined". It's not the opposite of UTF-8. It's maybe UTF-8, maybe something else.

Once we switch to unicode

"We" already partially switched to UTF-8: Sage source files are UTF-8 and some terminals (more and more these days) use UTF-8 encoding by default.

At no point is it a good idea to match for UTF-8 encoded escape sequences imho.

Fair enough, but I think matching literal \x9b bytes is even worse.

comment:23 in reply to: ↑ 22 Changed 8 years ago by vbraun

Replying to jdemeyer:

"We" already partially switched to UTF-8: Sage source files are UTF-8 and some terminals (more and more these days) use UTF-8 encoding by default.

The terminal has no say in doctesting. The "want" is UTF-8 (or at least comes from the UTF-8 encoded sage source), but the "got" is whatever byte string was printed by the worker.

comment:24 Changed 8 years ago by jdemeyer

  • Description modified (diff)
  • Reviewers set to Jeroen Demeyer

Please review my reviewer patch.

comment:25 Changed 8 years ago by jdemeyer

* ping *

14375_review.patch needs review.

comment:26 Changed 8 years ago by vbraun

  • Reviewers changed from Jeroen Demeyer to Jeroen Demeyer, Volker Braun
  • Status changed from needs_review to positive_review

I don't agree with it but its still an improvement over the current state, so positive review.

comment:27 Changed 8 years ago by jdemeyer

  • Status changed from positive_review to needs_work
! Package inputenc Error: Keyboard character used is undefined
(inputenc)                in inputencoding `utf8x'.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...                                              
                                                  
l.3239 ...is is ^^[[1mbold^^[[0m text}\PYG{l+s}{'}
                                                  
? 
! Emergency stop.
 ...                                              
                                                  
l.3239 ...is is ^^[[1mbold^^[[0m text}\PYG{l+s}{'}
                                                  
!  ==> Fatal error occurred, no output PDF file produced!
Transcript written on doctest.log.

comment:28 Changed 8 years ago by vbraun

Oh we are still building pdf docs, that uses the deprecated utf8x that doesn't even ship with TeXlive-2012 any more. Here is first version of a patch that fixes that. Still needs some work...

Last edited 8 years ago by vbraun (previous) (diff)

Changed 8 years ago by vbraun

Updated patch

comment:29 Changed 8 years ago by vbraun

  • Status changed from needs_work to needs_review

Everything works now for me...

Changed 8 years ago by jdemeyer

comment:30 Changed 8 years ago by jdemeyer

Volker, I confirm the utf8 problem, but that's for a different ticket (I don't know enough about TeX to review that): #14571.

I fixed the problem on this ticket by simply using a r""" string instead of """. Needs review.

Last edited 8 years ago by jdemeyer (previous) (diff)

comment:31 Changed 8 years ago by vbraun

Well I can't test it because TeXlive? doesn't ship utf8x any more.

comment:32 Changed 8 years ago by vbraun

  • Status changed from needs_review to positive_review

comment:33 Changed 8 years ago by jdemeyer

  • Merged in set to sage-5.10.beta3
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.