Opened 4 years ago

Closed 4 years ago

#24475 closed enhancement (fixed)

str_to_bytes() should accept unicode

Reported by: jdemeyer Owned by:
Priority: major Milestone: sage-8.2
Component: python3 Keywords:
Cc: embray Merged in:
Authors: Erik Bray Reviewers: Jeroen Demeyer, Erik Bray
Report Upstream: N/A Work issues:
Branch: 6bdb52a (Commits, GitHub, GitLab) Commit: 6bdb52abba96a24a25685a56260f67f0d4a0b2b3
Dependencies: Stopgaps:

Status badges

Description (last modified by jdemeyer)

Instead of requiring that the input of str_to_bytes() is always str, it should accept both bytes and unicode (*). So, instead of a condition depending on the Python version, we should have a condition depending on the input type.

(*) in the Cython sense, so this means str on Python 3.

Change History (17)

comment:1 Changed 4 years ago by jdemeyer

  • Description modified (diff)

comment:2 Changed 4 years ago by embray

We might still have both. The implementation details depend on the Python version, and checking for unicode vs str is only useful on Python 2.

comment:3 Changed 4 years ago by jdemeyer

  • Description modified (diff)

comment:4 follow-up: Changed 4 years ago by embray

From https://trac.sagemath.org/ticket/24223#comment:20:

Instead of PyUnicode_EncodeLocale, you can just pass NULL as encoding to PyUnicode_AsEncodedString

That's not true. The latter uses UTF-8 as the default encoding.

comment:5 in reply to: ↑ 4 Changed 4 years ago by jdemeyer

Replying to embray:

From https://trac.sagemath.org/ticket/24223#comment:20:

Instead of PyUnicode_EncodeLocale, you can just pass NULL as encoding to PyUnicode_AsEncodedString

That's not true. The latter uses UTF-8 as the default encoding.

You are right as usual :-) Actually, the default is ASCII on Python 2.

But you have to admit that it is confusing: there is a notion of a "default encoding" (ASCII on py2 and UTF-8 on py3) which is different from the "preferred encoding" determined by the locale.

comment:6 follow-up: Changed 4 years ago by embray

I agree, it's confusing. The "default encoding" is a different thing, which is the default encoding used, for example, in str.encode and bytes.decode. On Python 2 it's "ascii" which is why you get so many encoding errors when, for example, you try to append a unicode string with a non-unicode string.

I'm starting to lean more towards maybe these functions should just use UTF-8 as a sane, predictable default (basically the Python 3 value for sys.getdefaultencoding()). As I've written before this may be a problem when interacting with locale-aware interfaces and libraries in situations where the user is in a non-UTF8 locale (which is more likely than one might think). But as I've found so far in my Python 3 work on Sage that seems to be a less common case than in general. In those cases we can take care to pass in the correct encoding for the current locale where necessary.

My intuition was that using the locale encoding would be a safer default to work with for now--and it probably is--but maybe explicit would be better than implicit in this case too.

comment:7 Changed 4 years ago by embray

  • Authors changed from Jeroen Demeyer to Erik Bray
  • Branch set to u/embray/python3/ticket-24475
  • Commit set to 9cf556b0a9ad8bdf708ab083f160e0fe674c4569
  • Status changed from new to needs_review

Here's my current attempt, including switch to UTF-8 by default.


New commits:

9cf556bRework str_to_bytes so that it accepts unicode on Python 2 and treats it same as str on Python 3.

comment:8 in reply to: ↑ 6 Changed 4 years ago by jdemeyer

Replying to embray:

I'm starting to lean more towards maybe these functions should just use UTF-8 as a sane, predictable default

After all the effort you put into convincing me that locale.getpreferredencoding() is the right default, why change it to UTF-8?

Personally, I still think that the best solution is still "Explicit is better than implicit", so always require choosing an encoding. As second best option, I would use locale.getpreferredencoding() as default since that is what Python mostly uses as default (open() is an important example)

comment:9 follow-up: Changed 4 years ago by jdemeyer

About the implementation: I would write the code to be as much as possible independent of the Python version. For example, for the type checking you could write

if isinstance(s, bytes):
    return <bytes>s
if not isinstance(s, unicode):
    raise TypeError(f"expected str or unicode, {type(s).__name__} found")

This will work as expected on both Python 2 and Python 3. As a bonus, the behaviour is also more consistent because it allows bytes on Python 2 and 3.

comment:10 Changed 4 years ago by embray

On the encoding: I thought I explained my change of thinking above, but there are a few reasons (and I'm torn either way):

1) On Python 2 encoding with the locale is going to be annoying, as there's no equivalent of PyUnicode_EncodeLocale. Instead, one has to import locale (which has to be done in the function, otherwise it won't work when Cython inlines it), and call locale.getpreferredencoding(). This in turn calls setlocale which isn't thread-safe (though maybe that doesn't matter for Sage's purposes, but I worried it would limit where this could be used).

2) You can avoid the setlocale call by calling locale.getpreferredencoding(False). However, this has unfortunate ramifications on Python 2. All programs start out in the "C" locale (i.e. ASCII) by default. In Python 2 the interpreter only temporarily enables the user's locale in order to determine what to use for "filesystem encoding", but then switches back to the default. That is, if you want Python to use anything other than the "C" locale (or at least any non-ASCII encoding from the locale) you have to at least manually call setlocale(LC_CTYPE, "") at least once. Python 3, however, does exactly this during interpreter startup, which makes a major difference.

3) While it's true that Python uses locale.getpreferredencoding() in lots of places as the default encoding (most notably open()) I've found so far with these utility functions that they don't have a lot of impact w.r.t. file I/O. In the past, the most challenging Python 3 ports I've worked on have had a lot to do with file I/O which is why this was heavily on my mind, but so far in porting Sage this has been less of any issue.

4) It might actually be saner and more predictable to defaulting to UTF-8, especially when a user explicitly passes some non-ASCII unicode text to a function that is supposed to support unicode, if the user's locale is non-unicode they may get encoding errors.

Relatedly, having a default does not necessarily violate "explicit is better than implicit" as long as that default is predictable and well-documented. With the locale-based encoding it isn't that, even though I personally would prefer it in most cases.

And yes, I realize I've previously argued for the opposite--I think with good reasons--but now I've convinced myself the other way around. Take it with a grain of salt though; if you'd prefer not to change things again I can undo that. It just makes the code a little more complicated unfortunately.

comment:11 in reply to: ↑ 9 Changed 4 years ago by embray

Replying to jdemeyer:

About the implementation: I would write the code to be as much as possible independent of the Python version. For example, for the type checking you could write

While I would also prefer to write the code as independent as possible I'm not sure I follow your suggestion here.

if isinstance(s, bytes):
    return <bytes>s

As a bonus, the behaviour is also more consistent because it allows bytes on Python 2 and 3.

I'm not sure what you're getting at here. str_to_bytes really shouldn't accept bytes on Python 3.

if not isinstance(s, unicode):
    raise TypeError(f"expected str or unicode, {type(s).__name__} found")

This will work as expected on both Python 2 and Python 3.

The error message would be wrong on Python 3 since there is no "unicode" (and this error message is intended for users who would normally be writing plain Python, not Cython). That's why I have slightly different duplicates of this code.

comment:12 Changed 4 years ago by jdemeyer

  • Branch changed from u/embray/python3/ticket-24475 to u/jdemeyer/python3/ticket-24475

comment:13 follow-up: Changed 4 years ago by jdemeyer

  • Commit changed from 9cf556b0a9ad8bdf708ab083f160e0fe674c4569 to 6bdb52abba96a24a25685a56260f67f0d4a0b2b3

I'll give you the benefit of the doubt for the encoding discussion. Do you agree with this review commit?


New commits:

6bdb52aVarious minor fixes

comment:14 in reply to: ↑ 13 ; follow-up: Changed 4 years ago by embray

Replying to jdemeyer:

I'll give you the benefit of the doubt for the encoding discussion. Do you agree with this review commit?

-IF PY_MAJOR_VERSION >= 3:
-    cdef extern from "Python.h":
-        # Missing from cpython.unicode in Cython 0.27.3
-        char* PyUnicode_AsUTF8(object s)
+
+cdef extern from "Python.h":
+    # Missing from cpython.unicode in Cython 0.27.3
+    char* PyUnicode_AsUTF8(object s)

PyUnicode_AsUTF8 doesn't exist on Python 2 so I'm not sure why you removed this check. I'm not sure if that actually matters from Cython's end though.

Looks good to me otherwise in principle; I just want to test it on my Python 3 branch.

comment:15 in reply to: ↑ 14 Changed 4 years ago by jdemeyer

Replying to embray:

PyUnicode_AsUTF8 doesn't exist on Python 2 so I'm not sure why you removed this check.

I removed it because I want to upstream that declaration and Cython doesn't differentiate between Python 2 and Python 3.

I'm not sure if that actually matters from Cython's end though.

No, it doesn't.

comment:16 Changed 4 years ago by embray

  • Reviewers set to Jeroen Demeyer, Erik Bray
  • Status changed from needs_review to positive_review

comment:17 Changed 4 years ago by vbraun

  • Branch changed from u/jdemeyer/python3/ticket-24475 to 6bdb52abba96a24a25685a56260f67f0d4a0b2b3
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.