Opened 8 years ago

Closed 8 years ago

#13296 closed defect (worksforme)

unicode default encoding is not utf-8 in command line

Reported by: slabbe Owned by: jason, was
Priority: major Milestone: sage-duplicate/invalid/wontfix
Component: graphics Keywords: unicode, matplotlib
Cc: kcrisman, ddrake Merged in:
Authors: Reviewers: John Palmieri
Report Upstream: Fixed upstream, in a later stable release. Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Description (last modified by slabbe)

utf-8 seems to be the default encoding in Python 2.7 for unicode strings:

Python 2.7.2 (default, May 30 2012, 14:00:43) 
[GCC 4.6.3] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 'é'
'\xc3\xa9'
>>> u'é'
u'\xe9'
>>> unicode('é', encoding='utf-8')
u'\xe9'

But, in sage-5.2.rc0, latin1 seems to be the default :

sage: 'é'
'\xc3\xa9'
sage: u'é'
u'\xc3\xa9'
sage: unicode('é', encoding='latin1')
u'\xc3\xa9'
sage: unicode('é', encoding='utf-8')
u'\xe9'

As reported by John Palmieri, the problem seems to be from ipython:

$ sage --ipython
Python 2.7.2 (default, May 30 2012, 14:00:43) 
Type "copyright", "credits" or "license" for more information.

IPython 0.10.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object'. ?object also works, ?? prints more.

In [1]: 'é'
Out[1]: '\xc3\xa9'

In [2]: u'é'
Out[2]: u'\xc3\xa9'

In [3]: unicode('é', encoding='utf-8')
Out[3]: u'\xe9'

In [4]: unicode('é', encoding='latin1')
Out[4]: u'\xc3\xa9'

This bug was first reported as a problem in matplotlib used from the command line as unicode letter é gets replaced by "é" in matplotlib plot:

sage: text(u'an accent : é', (1,1), color='red')

With #13161 the same problem appears for axes labels :

sage: t = text(u'an accent : é', (1,1), color='red')
sage: t.axes_labels([u'an accent : é', 'Y'])       # broken without #13161
sage: t

But, as mentionned in ticket #13161, if the same code is written in a file (even without encoding declared), then unicode gets printed perfectly :

# this is file.sage
t = text(u'an accent : é', (1,1), color='red')
t.axes_labels([u'an accent : é', 'Y'])       # broken without #13161

Is perfect :

sage: attach file.sage
sage: t

What makes it work in a file but not for the command line?

Attachments (1)

trac_13296-unicode-text.patch (986 bytes) - added by jhpalmieri 8 years ago.

Download all attachments as: .zip

Change History (25)

comment:1 Changed 8 years ago by slabbe

  • Description modified (diff)

comment:2 Changed 8 years ago by slabbe

  • Dependencies #13161 deleted

comment:3 Changed 8 years ago by slabbe

  • Description modified (diff)

comment:4 Changed 8 years ago by kcrisman

  • Cc kcrisman ddrake added

comment:5 follow-up: Changed 8 years ago by ppurka

Not surprised why attach file.sage works. According to a note in sage/misc/interpreter.py, Sage defaults to utf-8 if no encoding is detected.

    - We default to UTF-8 encoding even though PEP 263 says that
      Python files should default to ASCII.

But sage file.sage doesn't work. Probably the functions in interpreter.py don't get called?

Last edited 8 years ago by ppurka (previous) (diff)

comment:6 in reply to: ↑ 5 ; follow-up: Changed 8 years ago by slabbe

Replying to ppurka:

According to a note in sage/misc/interpreter.py, Sage defaults to utf-8 if no encoding is detected.

Using sage-5.2.rc0, I don't see this note. It diseappeared? or appeared later?

But sage file.sage doesn't work. Probably the functions in interpreter.py don't get called?

But if the encoding is declared in file.sage :

# -*- coding: utf-8 -*-
# this is file.sage
t = text(u'an accent : é', (1,1), color='red')
t.axes_labels([u'an accent : é', 'Y'])       # broken without #13161
t.show()

then sage file.sage works perfectly : the accent "é" gets properly written. So the problem seems to be from the command line...

comment:7 in reply to: ↑ 6 ; follow-up: Changed 8 years ago by ppurka

Replying to slabbe:

Replying to ppurka:

According to a note in sage/misc/interpreter.py, Sage defaults to utf-8 if no encoding is detected.

Using sage-5.2.rc0, I don't see this note. It diseappeared? or appeared later?

It's there in line 336 of $SAGE_ROOT/devel/sage/sage/misc/interpreter.py.

But sage file.sage doesn't work. Probably the functions in interpreter.py don't get called?

But if the encoding is declared in file.sage :

Of course, it works then. All that the functions in interpreter.py do is put the string -*- coding: utf-8 -*- in the beginning of the python file they create. After that it "just works." Which is why I said that perhaps the necessary functions in the interpreter.py are not called when Sage is run in command line.

comment:8 in reply to: ↑ 7 ; follow-up: Changed 8 years ago by slabbe

Using sage-5.2.rc0, I don't see this note. It diseappeared? or appeared later?

It's there in line 336 of $SAGE_ROOT/devel/sage/sage/misc/interpreter.py.

My fault, I am using sage-5.2.rc0 with #12719 which changed this file.

But sage file.sage doesn't work. Probably the functions in interpreter.py don't get called?

But if the encoding is declared in file.sage :

Of course, it works then. All that the functions in interpreter.py do is put the string -*- coding: utf-8 -*- in the beginning of the python file they create. After that it "just works." Which is why I said that perhaps the necessary functions in the interpreter.py are not called when Sage is run in command line.

Sorry, I misunderstood the "doesn't work". Thanks for the clarifications.

So now, we need to understand how to put the string -*- coding: utf-8 -*- somewhere for the commande line?

Sébastien

comment:9 in reply to: ↑ 8 Changed 8 years ago by ppurka

Replying to slabbe:

So now, we need to understand how to put the string -*- coding: utf-8 -*- somewhere for the commande line?

Essentially, yes. I suppose (haven't looked at it completely) sage creates a python file and then runs that under python. If so, all that would be needed to be done is make sure that this file that is created always has a header that says the encoding is utf8.

comment:10 Changed 8 years ago by jhpalmieri

So now, we need to understand how to put the string -*- coding: utf-8 -*- somewhere for the commande line?

I'm not sure this is a good idea: we shouldn't change Python's default behavior without a really good reason. If you want to use accents in a Python string, then I think you need to specify the encoding explicitly (see http://docs.python.org/howto/unicode.html, in particular this section). So you should do

sage: text(unicode('an accent : é', encoding='utf-8'), (1,1), color='red')

This command works for me. Indeed,

sage: s = unicode('an accent : é', encoding='utf-8')
sage: ss = u'an accent : é'
sage: s == ss
False

and I think that the s version is the right way to do it. Alternatively, you can use unicode escape sequences, as in

text(u'an accent : \xe9', (1,1), color='red')

I'm not at all a unicode expert, though.

Anyway, I suggest instead adding some documentation. See the attached patch.

comment:11 Changed 8 years ago by jhpalmieri

  • Authors set to John Palmieri
  • Status changed from new to needs_review

Changed 8 years ago by jhpalmieri

comment:12 Changed 8 years ago by slabbe

Great, thanks for the documentation link. I just read all of it. I now see the difference between unicode and utf-8, which I was seeing vaguely as synonyms.

Surprisingly, whereas Sage returns False, in Python 2.6 and 2.7, ss == s returns True:

>>> s = unicode('an accent : é', encoding='utf-8')
>>> ss = u'an accent : é'
>>> s == ss
True

That means utf-8 is the default encoding in Python 2.7 for unicode strings. In more details:

Python 2.7.2 (default, May 30 2012, 14:00:43) 
[GCC 4.6.3] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 'é'
'\xc3\xa9'
>>> u'é'
u'\xe9'
>>> unicode('é', encoding='utf-8')
u'\xe9'

But, in sage-5.0, latin1 seems to be the default encoding for unicode strings :

sage: 'é'
'\xc3\xa9'
sage: u'é'
u'\xc3\xa9'
sage: unicode('é', encoding='latin1')
u'\xc3\xa9'
sage: unicode('é', encoding='utf-8')
u'\xe9'

I did not look at the patch yet.

Last edited 8 years ago by slabbe (previous) (diff)

comment:13 Changed 8 years ago by slabbe

Also, I just realized that in sage-5.2.rc0 with the new ipython at #12719, the default encoding seems to be utf-8, so that this ticket would be fixed by #12719 in its actual state :

sage: u'é'
u'\xe9'

comment:14 Changed 8 years ago by slabbe

I looked at the patch. All tests passed on sage/plot/text.py. If the behavior

sage: u'é'
u'\xc3\xa9'

is OK, I think the new documentation is a good fix and this ticket deserve a positive review. But, if we prefer the Python default way :

>>> u'é'
u'\xe9'

then I believe we need one more patch... What do you think?

comment:15 follow-up: Changed 8 years ago by jhpalmieri

It looks like it's IPython:

$ sage --ipython
Python 2.7.3 (default, Jul 26 2012, 17:22:39) 
Type "copyright", "credits" or "license" for more information.

IPython 0.10.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object'. ?object also works, ?? prints more.

In [1]: s = unicode('an accent : é', encoding='utf-8')

In [2]: ss = u'an accent : é'

In [3]: s == ss
Out[3]: False

In [4]: s
Out[4]: u'an accent : \xe9'

In [5]: ss
Out[5]: u'an accent : \xc3\xa9'

comment:16 in reply to: ↑ 15 Changed 8 years ago by jhpalmieri

Replying to jhpalmieri:

It looks like it's IPython

and the problem is no longer there with the IPython spkg from #12719. In fact, the text commands in the ticket description work fine for me, with #12719 in place. So maybe this ticket should be marked as invalid, and we'll just wait until #12719?

comment:17 Changed 8 years ago by ppurka

I agree. If ipython fixes this magically, then we should just close this ticket as invalid and get #12719 merged.

comment:18 Changed 8 years ago by slabbe

  • Description modified (diff)
  • Summary changed from matplotlib do not handle unicode properly from command line to unicode default encoding is not utf-8 in command line

I updated the title of the ticket and the description for what we learned so far, i.e. it is not a problem in matplotlib.

comment:19 Changed 8 years ago by slabbe

  • Report Upstream changed from N/A to Fixed upstream, in a later stable release.

The problem is fixed in IPython 0.13 (see ticket #12719) :

Python 2.7.3 (default, Jul 19 2012, 21:24:57) 
Type "copyright", "credits" or "license" for more information.

IPython 0.13 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: 'é'
Out[1]: '\xc3\xa9'

In [2]: u'é'
Out[2]: u'\xe9'

and was also fixed in IPython 0.12.1 :

Python 2.7.2 (default, May  9 2012, 21:52:56) 
Type "copyright", "credits" or "license" for more information.

IPython 0.12.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: 'é'
Out[1]: '\xc3\xa9'

In [2]: u'é'
Out[2]: u'\xe9'

For now, I think we don't know how long will take #12719 (yesterday, William Stein found serious bug with it and discouraged its use). So if it takes one or two Sage version before merging #12719, maybe meanwhile we can find a way to change the default encoding of ipython from 'latin1' to 'utf-8' ?

comment:20 Changed 8 years ago by jhpalmieri

I don't think the default encoding is latin1; instead, I think that IPython does some preprocessing on strings, and that preprocessing converts u'an accent : é' into something different – I think it replaces it by u'an accent : é'.encode('utf-8'). In any case, I think fixing this the right way will involve patching IPython.

comment:21 Changed 8 years ago by jdemeyer

  • Milestone changed from sage-5.8 to sage-duplicate/invalid/wontfix

Should this be closed now that IPython has been updated?

comment:22 Changed 8 years ago by jhpalmieri

Yes, the problems in the ticket description seem to be solved now.

comment:23 Changed 8 years ago by jdemeyer

  • Authors John Palmieri deleted
  • Reviewers set to John Palmieri
  • Status changed from needs_review to positive_review

comment:24 Changed 8 years ago by jdemeyer

  • Resolution set to worksforme
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.