#22044 closed defect (fixed)
fix incompatibility with py3 in autogen/pari
Reported by: | Frédéric Chapoton | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | sage-7.5 |
Component: | python3 | Keywords: | |
Cc: | Erik Bray, Matthias Köppe, Jeroen Demeyer | Merged in: | |
Authors: | Frédéric Chapoton | Reviewers: | Jeroen Demeyer |
Report Upstream: | N/A | Work issues: | |
Branch: | 02c5061 (Commits, GitHub, GitLab) | Commit: | |
Dependencies: | Stopgaps: |
Description (last modified by )
caused by #21613
Change History (25)
comment:1 Changed 6 years ago by
Branch: | → u/chapoton/22044 |
---|---|
Cc: | Erik Bray Matthias Köppe Jeroen Demeyer added |
Commit: | → 02c50619cbb13a5c8e32a604e2e55642f5fbad4e |
Status: | new → needs_review |
comment:2 Changed 6 years ago by
Reviewers: | → Jeroen Demeyer |
---|---|
Status: | needs_review → positive_review |
comment:3 follow-up: 6 Changed 6 years ago by
Description: | modified (diff) |
---|---|
Status: | positive_review → needs_work |
This does't look right to me. Instead you should change pari_share
to return a string instead of bytes (by decoding the output with the system's filesystem encoding): https://git.sagemath.org/sage.git/tree/src/sage_setup/autogen/pari/parser.py?id=02c50619cbb13a5c8e32a604e2e55642f5fbad4e#n41
comment:4 Changed 6 years ago by
Description: | modified (diff) |
---|
comment:6 Changed 6 years ago by
Replying to embray:
Instead you should change
pari_share
to return a string instead of bytes
Why introduce an extra decoding/encoding step? The native format for filenames is bytes
since that is what the system calls accept. If Python has to convert the filename to bytes
anyway, why not just feed it bytes
?
comment:7 Changed 6 years ago by
Status: | needs_work → needs_review |
---|
comment:8 follow-up: 10 Changed 6 years ago by
Well for one, the patch as it stands is creating an inhomogeneous list containing bytes
and str
object. Of course, you can do that, but it's a little surprising and confusing. Either treat filenames as byte strings everywhere, or nowhere (but I suggest nowhere because it is user- and developer-hostile to have to look at bytes
objects containing non-ASCII character data :) In fact it's more common to deal with filenames as str
objects in Python for exactly this reason, except for right at the system boundary, or when dealing directly with binary data.
In fact there is no "native format for filenames" in Python in part because it's more complicated than that especially when you consider Windows--there is no one right answer to that question. Which is why all the stdlib functions which take a filesystem path accept both bytes and unicode.
In any case, the general pattern in Python 3 with functions at the system boundary is to immediately convert from bytes to text, use text, and the convert back to bytes when going back out to the system (which in many cases, like opening files, happens transparently in Python). The only exception is when those bytes will never, ever be used in another context, like working with file and network protocols directly. The extra encoding/decoding steps are trivial otherwise.
comment:9 Changed 6 years ago by
FWIW an easy enough workaround is to pass universal_newlines=True
to the subprocess.Popen
. On Python 2 this doesn't make an enormous difference, especially in Sage since it's not likely going to be calling any programs that output CRLFs. But on Python 3 it does make a difference since it wraps the streams in TextIOWrapper
s, which decode bytes to text, by default using locale.getpreferredencoding()
which is generally appropriate.
comment:10 Changed 6 years ago by
Replying to embray:
Either treat filenames as byte strings everywhere, or nowhere (but I suggest nowhere because it is user- and developer-hostile to have to look at
bytes
objects containing non-ASCII character data :) In fact it's more common to deal with filenames asstr
objects in Python for exactly this reason, except for right at the system boundary, or when dealing directly with binary data.
I don't get what the problem is with treating filenames as bytes
everywhere. It seems to me that things are least error-prone if you avoid conversions bytes
<-> str
.
FWIW an easy enough workaround[...]
When you use bytes
, you don't need a workaround at all.
comment:11 follow-ups: 13 14 Changed 6 years ago by
It's not really a "workaround"--poor choice of words. What I mean is it's an easy way to go from binary to text in that context.
It seems to me that things are least error-prone if you avoid conversions bytes <-> str.
That's where you're wrong. The further away bytes get from their original source, the harder it is to know where they came from or how they should be interpreted. This compounded when you have bytes coming from multiple sources, in possibly different encodings, and you try to combine them while ignoring where they came from. Many examples of where this can go wrong are discussed here: http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#why-not-just-assume-utf-8-and-avoid-having-to-decode-at-system-boundaries and even in the rather recent PEP 529: https://www.python.org/dev/peps/pep-0529/
You're not in poor company having doubts about this. Armin Ronacher has written extensively about it, such as here: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/ I don't agree with him (and for some reasons he doesn't give in his list of possible objections), but he's not without a point, and in fact has since pushed for many useful updates to Python 3 to make dealing with bytes and strings a bit less of a hassle.
comment:12 Changed 6 years ago by
Or if you want just a more practical argument, going back to my point that this is mixing bytes and strings in a single list, this easily results in something like:
>>> paths = [b'/usr/lib', '/usr/local/lib'] >>> filename = [os.path.join(p, 'libfoo.so') for p in paths] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 1, in <listcomp> File "/usr/lib/python3.4/posixpath.py", line 89, in join "components") from None TypeError: Can't mix strings and bytes in path components
Although that may or may not apply in this specific case, it's just bad practice in general for reasons like this.
comment:13 Changed 6 years ago by
Replying to embray:
The further away bytes get from their original source, the harder it is to know where they came from or how they should be interpreted.
Easy: as a char *
in C. I don't see how one could get this wrong. With str
of course, it is possible to get the encoding wrong.
This compounded when you have bytes coming from multiple sources, in possibly different encodings
This is not the case here.
Many examples of where this can go wrong are discussed here: http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#why-not-just-assume-utf-8-and-avoid-having-to-decode-at-system-boundaries
That's a general discussion about using unicode in Python. I am specifically talking about filenames
and even in the rather recent PEP 529: https://www.python.org/dev/peps/pep-0529/
I see. If I understand things correctly, using bytes
for filenames is currently problematic in Windows, but that would be fixed by that PEP. So it's still not a problem in the end.
Replying to embray:
Or if you want just a more practical argument, going back to my point that this is mixing bytes and strings in a single list, this easily results in something like:
>>> paths = [b'/usr/lib', '/usr/local/lib'] >>> filename = [os.path.join(p, 'libfoo.so') for p in paths] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 1, in <listcomp> File "/usr/lib/python3.4/posixpath.py", line 89, in join "components") from None TypeError: Can't mix strings and bytes in path components
Right. But I would argue to just use bytes
everywhere.
comment:14 follow-ups: 19 20 Changed 6 years ago by
Replying to embray:
Armin Ronacher has written extensively about it, such as here: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/
I just read this and I agree almost with everything he says... Python 3 insists too much on unicode.
One particular pet peeve of mine not mentioned in that essay is os.environ
. At the system level, that's just char *
but Python 3 insists on interpreting these byte strings as unicode somehow.
comment:15 Changed 6 years ago by
Branch: | u/chapoton/22044 → 02c50619cbb13a5c8e32a604e2e55642f5fbad4e |
---|---|
Resolution: | → fixed |
Status: | needs_review → closed |
comment:16 follow-up: 25 Changed 6 years ago by
Commit: | 02c50619cbb13a5c8e32a604e2e55642f5fbad4e |
---|
Filenames are byte strings on Linux. I think there is a Windows issue somewhere here but hopefully thats ok if the path is plain ascii. Python is moving towards byte string filenames on Windows, too, to facilitate posix compatibility.
comment:17 follow-up: 23 Changed 6 years ago by
Resolution: | fixed |
---|---|
Status: | closed → new |
Sorry, but no, this is absolutely wrong. Why are you mixing bytes and str objects in a single list in application code?
comment:18 Changed 6 years ago by
It is absolutely, 100% against the philosophy and design of Python 3 to be passing around bytes
objects for filenames, or much anything else that isn't meant to be interacted with as binary data. As soon as a user tries to join a (unicode) string to it all hell breaks loose. In this specific example it is mostly irrelevant, but you're basically arguing that all filenames should be passed around in Sage as bytes
objects which:
a) You haven't implemented yet--either be 100% consistent or go home. To be clear, this also means that you have to fight a constant uphill battle with every Python API that returns
str
, notbytes
for filenames, such asos.listdir
,os.getcwd
,os.curdir
,os.sep
, and many others. Some of these are functions that when passed bytes return bytes (which means if you want bytes you have to make sure they are always passed bytes regardless what some users pastes into their terminal). In other cases they are not.
b) Is, as previously stated, not the right thing to do and is user- and developer-hostile.
You're confusing what filenames are as internally represented by the OS, and what they actually represent, semantically. If I'm a human being, and my home directory is "C:\Users\риго́рий Перельма́н", then my home directory is "C:\Users\риго́рий Перельма́н", not b'C:\\Users\\\xd1\x80\xd0\xb8\xd0\xb3\xd0\xbe\xcc\x81\xd1\x80\xd0\xb8\xd0\xb9 \xd0\x9f\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbb\xd1\x8c\xd0\xbc\xd0\xb0\xcc\x81\xd0\xbd\x00'
, or worse b'\xff\xfeC\x00:\x00\\\x00U\x00s\x00e\x00r\x00s\x00\\\x00@\x048\x043\x04>\x04\x01\x03@\x048\x049\x04 \x00\x1f\x045\x04@\x045\x04;\x04L\x04<\x040\x04\x01\x03=\x04'
. That's how the computer (maybe--depending on the system I'm on or how network filesystems are mounted, etc.) sees it, but that's not what it *is*. Decoding from bytes to unicode codepoints and operating solely on those until it has to go back to the OS the only consistent and safe way to treat such data.
I speak from hard-fought experience here having helped lead the Python 3 porting effort of a large, 10-15 year old codebase, which interacts with legacy ASCII-based applications and file formats, and is used by international users across platforms and had to have backwards compatibility with users' assumptions that strings (i.e., when they type "abc"
) are character strings, not opaque blobs of bytes.
comment:19 Changed 6 years ago by
Replying to jdemeyer:
Replying to embray:
Armin Ronacher has written extensively about it, such as here: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/
I just read this and I agree almost with everything he says... Python 3 insists too much on unicode.
One particular pet peeve of mine not mentioned in that essay is
os.environ
. At the system level, that's justchar *
but Python 3 insists on interpreting these byte strings as unicode somehow.
You can agree with Armin, and you wouldn't be altogether wrong. But where you are wrong is trying to defy the design and intent of Python 3. Armin brought up these issues in part to influence how Python 3 moved forward in treating these issue more sanely to people like him who frequently work on boundary code. However, nowhere does he argue that Python 3 should just be used incorrectly if you don't have to.
comment:20 follow-ups: 21 22 Changed 6 years ago by
Replying to jdemeyer:
Replying to embray:
Armin Ronacher has written extensively about it, such as here: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/
I just read this and I agree almost with everything he says... Python 3 insists too much on unicode.
One particular pet peeve of mine not mentioned in that essay is
os.environ
. At the system level, that's justchar *
but Python 3 insists on interpreting these byte strings as unicode somehow.
It's not "somehow"--it's using the sys.getfilesystemencoding()
with surrogateescape handling, which most of the time is right. os.environb
was introduced to read the environment directly as bytes.
comment:22 Changed 6 years ago by
Replying to embray:
It's not "somehow"--it's using the
sys.getfilesystemencoding()
with surrogateescape handling, which most of the time is right.
The "most of the time" is going to bite you someday. bytes
are never wrong.
comment:23 Changed 6 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Whether you or me are right or wrong, you should never re-open a ticket that the release manager has closed. Feel free to continue the discussion on a new ticket or to complain directly to the release manager to revert this.
comment:25 Changed 6 years ago by
Replying to vbraun:
Python is moving towards byte string filenames on Windows, too, to facilitate posix compatibility.
This doesn't make any sense. In Python filenames are neither "bytes" or "str". Python does not have a "filename" type (though it has been argued for in the past, and in fact there's a relevant PEP being drafted somewhere, though I think instead they went with yet another magic method of some kind :)
If you're writing code that is solely interfacing with POSIX interfaces then yes, filenames are just collections of bytes and can stay in bytes form. But outside that narrow context you have to think about filenames in the abstract, which more-often-than-not a text string of human-readable glyphs (I think this is one thing Windows got right, though the POSIX approach has its advantages as well). Filenames are a user-interface. That's why they're not even stored in inodes, but rather in dirents as a convenient way for humans (and programs written and used by humans) to locate files. While it's true in POSIX a filename can be any arbitrary sequence of bytes, and the kernel is agnostic to such sordid details as character encodings, that's an implementation choice.
In Python, although there are wrappers around POSIX interfaces, you're writing Python not POSIX, and there text is treated as text (especially in Python 3) and in most cases filenames too are text. In order wedge the POSIX notion of "filenames are char
arrays" the surrogateescape decoder allows round-tripping arbitrary bytes that can't be decoded to unicode codepoints. The need for this in real filenames is rare--it's more commonly applicable in other contexts where the assumed encoding is not correct.
New commits:
py3 fixing src/sage_setup/autogen/pari/__init__.py