Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#22044 closed defect (fixed)

fix incompatibility with py3 in autogen/pari

Reported by: chapoton Owned by:
Priority: major Milestone: sage-7.5
Component: python3 Keywords:
Cc: embray, mkoeppe, jdemeyer Merged in:
Authors: Frédéric Chapoton Reviewers: Jeroen Demeyer
Report Upstream: N/A Work issues:
Branch: 02c5061 (Commits) Commit:
Dependencies: Stopgaps:

Description (last modified by embray)

caused by #21613

Change History (25)

comment:1 Changed 3 years ago by chapoton

  • Branch set to u/chapoton/22044
  • Cc embray mkoeppe jdemeyer added
  • Commit set to 02c50619cbb13a5c8e32a604e2e55642f5fbad4e
  • Status changed from new to needs_review

New commits:

02c5061py3 fixing src/sage_setup/autogen/pari/__init__.py

comment:2 Changed 3 years ago by jdemeyer

  • Reviewers set to Jeroen Demeyer
  • Status changed from needs_review to positive_review

comment:3 follow-up: Changed 3 years ago by embray

  • Description modified (diff)
  • Status changed from positive_review to needs_work

This does't look right to me. Instead you should change pari_share to return a string instead of bytes (by decoding the output with the system's filesystem encoding): https://git.sagemath.org/sage.git/tree/src/sage_setup/autogen/pari/parser.py?id=02c50619cbb13a5c8e32a604e2e55642f5fbad4e#n41

Last edited 3 years ago by embray (previous) (diff)

comment:4 Changed 3 years ago by embray

  • Description modified (diff)

comment:5 Changed 3 years ago by chapoton

Please take care of that in whatever way you want.

comment:6 in reply to: ↑ 3 Changed 3 years ago by jdemeyer

Replying to embray:

Instead you should change pari_share to return a string instead of bytes

Why introduce an extra decoding/encoding step? The native format for filenames is bytes since that is what the system calls accept. If Python has to convert the filename to bytes anyway, why not just feed it bytes?

comment:7 Changed 3 years ago by jdemeyer

  • Status changed from needs_work to needs_review

comment:8 follow-up: Changed 3 years ago by embray

Well for one, the patch as it stands is creating an inhomogeneous list containing bytes and str object. Of course, you can do that, but it's a little surprising and confusing. Either treat filenames as byte strings everywhere, or nowhere (but I suggest nowhere because it is user- and developer-hostile to have to look at bytes objects containing non-ASCII character data :) In fact it's more common to deal with filenames as str objects in Python for exactly this reason, except for right at the system boundary, or when dealing directly with binary data.

In fact there is no "native format for filenames" in Python in part because it's more complicated than that especially when you consider Windows--there is no one right answer to that question. Which is why all the stdlib functions which take a filesystem path accept both bytes and unicode.

In any case, the general pattern in Python 3 with functions at the system boundary is to immediately convert from bytes to text, use text, and the convert back to bytes when going back out to the system (which in many cases, like opening files, happens transparently in Python). The only exception is when those bytes will never, ever be used in another context, like working with file and network protocols directly. The extra encoding/decoding steps are trivial otherwise.

comment:9 Changed 3 years ago by embray

FWIW an easy enough workaround is to pass universal_newlines=True to the subprocess.Popen. On Python 2 this doesn't make an enormous difference, especially in Sage since it's not likely going to be calling any programs that output CRLFs. But on Python 3 it does make a difference since it wraps the streams in TextIOWrappers, which decode bytes to text, by default using locale.getpreferredencoding() which is generally appropriate.

comment:10 in reply to: ↑ 8 Changed 3 years ago by jdemeyer

Replying to embray:

Either treat filenames as byte strings everywhere, or nowhere (but I suggest nowhere because it is user- and developer-hostile to have to look at bytes objects containing non-ASCII character data :) In fact it's more common to deal with filenames as str objects in Python for exactly this reason, except for right at the system boundary, or when dealing directly with binary data.

I don't get what the problem is with treating filenames as bytes everywhere. It seems to me that things are least error-prone if you avoid conversions bytes <-> str.

FWIW an easy enough workaround[...]

When you use bytes, you don't need a workaround at all.

comment:11 follow-ups: Changed 3 years ago by embray

It's not really a "workaround"--poor choice of words. What I mean is it's an easy way to go from binary to text in that context.

It seems to me that things are least error-prone if you avoid conversions bytes <-> str.

That's where you're wrong. The further away bytes get from their original source, the harder it is to know where they came from or how they should be interpreted. This compounded when you have bytes coming from multiple sources, in possibly different encodings, and you try to combine them while ignoring where they came from. Many examples of where this can go wrong are discussed here: http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#why-not-just-assume-utf-8-and-avoid-having-to-decode-at-system-boundaries and even in the rather recent PEP 529: https://www.python.org/dev/peps/pep-0529/

You're not in poor company having doubts about this. Armin Ronacher has written extensively about it, such as here: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/ I don't agree with him (and for some reasons he doesn't give in his list of possible objections), but he's not without a point, and in fact has since pushed for many useful updates to Python 3 to make dealing with bytes and strings a bit less of a hassle.

comment:12 Changed 3 years ago by embray

Or if you want just a more practical argument, going back to my point that this is mixing bytes and strings in a single list, this easily results in something like:

>>> paths = [b'/usr/lib', '/usr/local/lib']
>>> filename = [os.path.join(p, 'libfoo.so') for p in paths]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "/usr/lib/python3.4/posixpath.py", line 89, in join
    "components") from None
TypeError: Can't mix strings and bytes in path components

Although that may or may not apply in this specific case, it's just bad practice in general for reasons like this.

Version 0, edited 3 years ago by embray (next)

comment:13 in reply to: ↑ 11 Changed 3 years ago by jdemeyer

Replying to embray:

The further away bytes get from their original source, the harder it is to know where they came from or how they should be interpreted.

Easy: as a char * in C. I don't see how one could get this wrong. With str of course, it is possible to get the encoding wrong.

This compounded when you have bytes coming from multiple sources, in possibly different encodings

This is not the case here.

Many examples of where this can go wrong are discussed here: http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#why-not-just-assume-utf-8-and-avoid-having-to-decode-at-system-boundaries

That's a general discussion about using unicode in Python. I am specifically talking about filenames

and even in the rather recent PEP 529: https://www.python.org/dev/peps/pep-0529/

I see. If I understand things correctly, using bytes for filenames is currently problematic in Windows, but that would be fixed by that PEP. So it's still not a problem in the end.

Replying to embray:

Or if you want just a more practical argument, going back to my point that this is mixing bytes and strings in a single list, this easily results in something like:

>>> paths = [b'/usr/lib', '/usr/local/lib']
>>> filename = [os.path.join(p, 'libfoo.so') for p in paths]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "/usr/lib/python3.4/posixpath.py", line 89, in join
    "components") from None
TypeError: Can't mix strings and bytes in path components

Right. But I would argue to just use bytes everywhere.

comment:14 in reply to: ↑ 11 ; follow-ups: Changed 3 years ago by jdemeyer

Replying to embray:

Armin Ronacher has written extensively about it, such as here: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

I just read this and I agree almost with everything he says... Python 3 insists too much on unicode.

One particular pet peeve of mine not mentioned in that essay is os.environ. At the system level, that's just char * but Python 3 insists on interpreting these byte strings as unicode somehow.

comment:15 Changed 3 years ago by vbraun

  • Branch changed from u/chapoton/22044 to 02c50619cbb13a5c8e32a604e2e55642f5fbad4e
  • Resolution set to fixed
  • Status changed from needs_review to closed

comment:16 follow-up: Changed 3 years ago by vbraun

  • Commit 02c50619cbb13a5c8e32a604e2e55642f5fbad4e deleted

Filenames are byte strings on Linux. I think there is a Windows issue somewhere here but hopefully thats ok if the path is plain ascii. Python is moving towards byte string filenames on Windows, too, to facilitate posix compatibility.

comment:17 follow-up: Changed 3 years ago by embray

  • Resolution fixed deleted
  • Status changed from closed to new

Sorry, but no, this is absolutely wrong. Why are you mixing bytes and str objects in a single list in application code?

comment:18 Changed 3 years ago by embray

It is absolutely, 100% against the philosophy and design of Python 3 to be passing around bytes objects for filenames, or much anything else that isn't meant to be interacted with as binary data. As soon as a user tries to join a (unicode) string to it all hell breaks loose. In this specific example it is mostly irrelevant, but you're basically arguing that all filenames should be passed around in Sage as bytes objects which:

a) You haven't implemented yet--either be 100% consistent or go home. To be clear, this also means that you have to fight a constant uphill battle with every Python API that returns str, not bytes for filenames, such as os.listdir, os.getcwd, os.curdir, os.sep, and many others. Some of these are functions that when passed bytes return bytes (which means if you want bytes you have to make sure they are always passed bytes regardless what some users pastes into their terminal). In other cases they are not.

b) Is, as previously stated, not the right thing to do and is user- and developer-hostile.

You're confusing what filenames are as internally represented by the OS, and what they actually represent, semantically. If I'm a human being, and my home directory is "C:\Users\риго́рий Перельма́н", then my home directory is "C:\Users\риго́рий Перельма́н", not b'C:\\Users\\\xd1\x80\xd0\xb8\xd0\xb3\xd0\xbe\xcc\x81\xd1\x80\xd0\xb8\xd0\xb9 \xd0\x9f\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbb\xd1\x8c\xd0\xbc\xd0\xb0\xcc\x81\xd0\xbd\x00', or worse b'\xff\xfeC\x00:\x00\\\x00U\x00s\x00e\x00r\x00s\x00\\\x00@\x048\x043\x04>\x04\x01\x03@\x048\x049\x04 \x00\x1f\x045\x04@\x045\x04;\x04L\x04<\x040\x04\x01\x03=\x04'. That's how the computer (maybe--depending on the system I'm on or how network filesystems are mounted, etc.) sees it, but that's not what it *is*. Decoding from bytes to unicode codepoints and operating solely on those until it has to go back to the OS the only consistent and safe way to treat such data.

I speak from hard-fought experience here having helped lead the Python 3 porting effort of a large, 10-15 year old codebase, which interacts with legacy ASCII-based applications and file formats, and is used by international users across platforms and had to have backwards compatibility with users' assumptions that strings (i.e., when they type "abc") are character strings, not opaque blobs of bytes.

Last edited 3 years ago by embray (previous) (diff)

comment:19 in reply to: ↑ 14 Changed 3 years ago by embray

Replying to jdemeyer:

Replying to embray:

Armin Ronacher has written extensively about it, such as here: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

I just read this and I agree almost with everything he says... Python 3 insists too much on unicode.

One particular pet peeve of mine not mentioned in that essay is os.environ. At the system level, that's just char * but Python 3 insists on interpreting these byte strings as unicode somehow.

You can agree with Armin, and you wouldn't be altogether wrong. But where you are wrong is trying to defy the design and intent of Python 3. Armin brought up these issues in part to influence how Python 3 moved forward in treating these issue more sanely to people like him who frequently work on boundary code. However, nowhere does he argue that Python 3 should just be used incorrectly if you don't have to.

comment:20 in reply to: ↑ 14 ; follow-ups: Changed 3 years ago by embray

Replying to jdemeyer:

One particular pet peeve of mine not mentioned in that essay is os.environ. At the system level, that's just char * but Python 3 insists on interpreting these byte strings as unicode somehow.

It's not "somehow"--it's using the sys.getfilesystemencoding() with surrogateescape handling, which most of the time is right. os.environb was introduced to read the environment directly as bytes.

Last edited 3 years ago by embray (previous) (diff)

comment:21 in reply to: ↑ 20 Changed 3 years ago by embray

Last edited 3 years ago by embray (previous) (diff)

comment:22 in reply to: ↑ 20 Changed 3 years ago by jdemeyer

Replying to embray:

It's not "somehow"--it's using the sys.getfilesystemencoding() with surrogateescape handling, which most of the time is right.

The "most of the time" is going to bite you someday. bytes are never wrong.

comment:23 in reply to: ↑ 17 Changed 3 years ago by jdemeyer

  • Resolution set to fixed
  • Status changed from new to closed

Whether you or me are right or wrong, you should never re-open a ticket that the release manager has closed. Feel free to continue the discussion on a new ticket or to complain directly to the release manager to revert this.

comment:24 Changed 3 years ago by embray

That's fair.

comment:25 in reply to: ↑ 16 Changed 3 years ago by embray

Replying to vbraun:

Python is moving towards byte string filenames on Windows, too, to facilitate posix compatibility.

This doesn't make any sense. In Python filenames are neither "bytes" or "str". Python does not have a "filename" type (though it has been argued for in the past, and in fact there's a relevant PEP being drafted somewhere, though I think instead they went with yet another magic method of some kind :)

If you're writing code that is solely interfacing with POSIX interfaces then yes, filenames are just collections of bytes and can stay in bytes form. But outside that narrow context you have to think about filenames in the abstract, which more-often-than-not a text string of human-readable glyphs (I think this is one thing Windows got right, though the POSIX approach has its advantages as well). Filenames are a user-interface. That's why they're not even stored in inodes, but rather in dirents as a convenient way for humans (and programs written and used by humans) to locate files. While it's true in POSIX a filename can be any arbitrary sequence of bytes, and the kernel is agnostic to such sordid details as character encodings, that's an implementation choice.

In Python, although there are wrappers around POSIX interfaces, you're writing Python not POSIX, and there text is treated as text (especially in Python 3) and in most cases filenames too are text. In order wedge the POSIX notion of "filenames are char arrays" the surrogateescape decoder allows round-tripping arbitrary bytes that can't be decoded to unicode codepoints. The need for this in real filenames is rare--it's more commonly applicable in other contexts where the assumed encoding is not correct.

Note: See TracTickets for help on using tickets.