Opened 3 years ago

Closed 3 years ago

#22108 closed defect (invalid)

Don't mix bytes and strings for pathnames

Reported by: embray Owned by:
Priority: major Milestone: sage-duplicate/invalid/wontfix
Component: python3 Keywords:
Cc: Merged in:
Authors: Reviewers: Erik Bray
Report Upstream: N/A Work issues:
Branch: u/embray/ticket-22108 (Commits) Commit: c108395e7dc6742b1850b8fbf7ce66a253b9fe53
Dependencies: Stopgaps:

Description

This is a followup to #22044 to fix that fix. bytes and str objects should not be mixed together arbitrarily in Python 3, and having them both in the same list is generally a mistake except in some corner cases of code explicitly designed to consume both. We should not set a bad precedent.

Change History (22)

comment:1 Changed 3 years ago by embray

  • Component changed from PLEASE CHANGE to python3

comment:2 Changed 3 years ago by embray

  • Authors set to Erik Bray
  • Branch set to u/embray/ticket-22108
  • Commit set to c108395e7dc6742b1850b8fbf7ce66a253b9fe53
  • Status changed from new to needs_review

New commits:

c108395Don't mix bytes and strings sloppily.

comment:3 Changed 3 years ago by chapoton

There is another one, maybe:

git grep "b'pari" src
src/sage_setup/autogen/pari/__init__.py:
    src_files = [join(pari_share(), b'pari.desc')] + \
src/sage_setup/autogen/pari/parser.py:
    with open(os.path.join(pari_share(), b'pari.desc')) as f:

comment:4 Changed 3 years ago by embray

Ah, thanks for pointing it out. That one didn't come from #22044 so maybe that's what inspired it?

The latter example makes a little more sense since it stands alone, though it's better not to leave things as bytes when you don't have to in the first place.

comment:5 Changed 3 years ago by chapoton

yes, indeed, #22044 was inspired by that previous similar change (that also restore a broken py3 building process).

so do you want a positive review, or what ?

comment:6 Changed 3 years ago by embray

  • Status changed from needs_review to needs_work

Well, I'll need to fix the other example now too because it's probably broken.

comment:7 Changed 3 years ago by chapoton

  • Milestone changed from sage-7.5 to sage-7.6

comment:8 Changed 3 years ago by jdemeyer

Some thoughts... the module src/sage/libs/cypari2 and the corresponding src/sage_setup/autogen/pari is going to be split from Sage as an independent Python package. This work is now essentially done (in the sense that we could probably finish it tomorrow if we make a push for it). In that case, this ticket would not be relevant to Sage anymore (but it could be moved to a pull request of the new cypari2 package).

Note also that I am personally against the change that this ticket makes (see #22044 for some discussion). I think that bytes are the right thing to use for filenames and that Python3 simply insists too much on str where bytes are more appropriate. Perhaps mixing str and bytes is bad but I would rather use bytes everywhere.

comment:9 follow-up: Changed 3 years ago by embray

I would rather use bytes everywhere

Again, this is both user and developer hostile--you are free to disagree with the design of Python 3 (though I'll point out this argument is something of a dead horse by now--Python 3.0 was released in 2008 and this argument has been had ad infinitum since then), but trying to fight against it is going to result in a big mess.

comment:10 in reply to: ↑ 9 Changed 3 years ago by jdemeyer

Replying to embray:

you are free to disagree with the design of Python 3

I am certainly taking that freedom :-)

I looked a bit more into this and I disagree even more with you now (nothing personal though!). According to this and that, Python 3 uses sys.getfilesystemencoding() for filenames while it uses locale.getpreferredencoding() for Popen(universal_newlines=True). This means that your code might break in the case that these two encodings differ.

The existing code using bytes doesn't care about encodings at all, so it doesn't have this problem.

comment:11 Changed 3 years ago by jdemeyer

Also: the fact that we need to worry about encodings in the first place with Python 3 (regardless of whether the code is correct or not) is already sufficient to make me avoid using Unicode for filenames.

comment:12 follow-ups: Changed 3 years ago by embray

I'm afraid you don't have that freedom--the rest of the world does. I'd be surprised if I had to say that though.

This means that your code might break in the case that these two encodings differ.

This is actually a perfect reason for why you should decode to text. In this particular example you're calling a problem that is *outputting* the name of a file. I actually have no idea how gp handles encodings, if it even does at all. But *in principle* it can be outputting filenames as text in a different encoding than it was stored in the filesystem (in gp's case it's probably just trying the filenames are raw byte strings though and not dealing with encodings at all, this is true). On Windows, the locale encoding and the filesystem encoding can definitely be different (this is increasingly less the case but will still turn up, especially in Asian countries).

In fact, on *nix it doesn't generally matter because sys.getfilesystemencoding() is the same as the locale encoding at the time the Python interpreter is started as documented [here](https://docs.python.org/3.7/library/sys.html#sys.getfilesystemencoding) and you can see also [here](https://hg.python.org/cpython/file/tip/Python/pylifecycle.c#l984). This is because there is no reliable way to make assumptions about filesystem encoding on *nix, and of course files on Unix are just bytes anyways and that's more or less how Python treats them. If your locale happens to use the encoding that filenames were saved as (which it generally will if your system is properly configured, or you're not SSHing to strange machines) then you're in luck--you can actually read your filenames. If they don't match up you might get gibberish but at least the surrogateescape handler protects you from losing information. In the vast majority of cases you just get the right text though.

The idiom you're expected to use in Python 3 is referred to as the "unicode sandwich". Only use bytes objects immediately at the boundary of the system and the application. Use str everywhere else for processing, and then bytes again only when going back out to the system. Most users never even have to use bytes directly because library authors who have thought about this a lot more than the user has have already handled this (such as in much of Python's stdlib, or libraries like Paramiko or Werkzeug). To be clear--it's perfectly fine in some cases to take binary output from a subprocess call and manipulate it as binary if you're dealing with the data at a binary level and extracting information from it rather than using the data as a string. For example, if you ran gcc -v and parsed the version string you could do that all as bytes, and then parse the version into a tuple of ints. The data never gets treated as text.

If you don't see it yet then please just trust me on this--I have a lot of experience dealing with this. Unless you're directly manipulating binary protocols or file formats you don't want your code littered with bytes.

comment:13 in reply to: ↑ 12 ; follow-up: Changed 3 years ago by jdemeyer

First of all, on OS X, it seems that the encodings can be different: https://hg.python.org/cpython/file/tip/Python/bltinmodule.c#l25

Second (see also 11), why should I even care about such implementation details like whether or not Python thinks that the filesystem encoding and locale encodings are the same.

But it seems that you insisting on this, so I suggest the following solution: instead of using universal_newlines, explicitly decode with sys.getfilesystemencoding().

comment:14 in reply to: ↑ 12 ; follow-up: Changed 3 years ago by jdemeyer

Replying to embray:

The idiom you're expected to use in Python 3 is referred to as the "unicode sandwich". Only use bytes objects immediately at the boundary of the system and the application. Use str everywhere else for processing, and then bytes again only when going back out to the system.

I think this is a good principle, but it involves too much guessing. Python doesn't really know the filesystem encoding, it guesses that it's the same as the locale encoding. It doesn't know the encoding of sys.stdout, so it guesses something. Most of the time it will guess correctly and often there are no bad consequences for guessing wrongly. But it's not robust. You cannot blame Python for not knowing the right encodings, but then it shouldn't guess either.

comment:15 in reply to: ↑ 13 ; follow-up: Changed 3 years ago by embray

Replying to jdemeyer:

Second (see also 11), why should I even care about such implementation details like whether or not Python thinks that the filesystem encoding and locale encodings are the same.

Well I don't know if you mean in general or why I bring it up. In general you'd want to know so that you can understand and debug any encoding issues that might arise and it's helpful to know how Python is determining things on different platforms.

In particular I bring it up because you write:

According to ​this and ​that, Python 3 uses sys.getfilesystemencoding() for filenames while it uses locale.getpreferredencoding() for Popen(universal_newlines=True). This means that your code might break in the case that these two encodings differ.

And I'm pointing out to you that on Linux at least these are both the same thing. The only reason the different functions are being used in the different contexts is because on some systems (particularly Windows) they are not the same.

But it seems that you insisting on this, so I suggest the following solution: instead of using universal_newlines, explicitly decode with sys.getfilesystemencoding().

You could do that and in many cases it would be fine (on Linux it will be fine because it's the same as what it's already doing--but then why introduce the extra step unnecessarily?). But in the general case that is not necessarily the correct thing to do.

comment:16 in reply to: ↑ 14 Changed 3 years ago by embray

Replying to jdemeyer:

Replying to embray:

The idiom you're expected to use in Python 3 is referred to as the "unicode sandwich". Only use bytes objects immediately at the boundary of the system and the application. Use str everywhere else for processing, and then bytes again only when going back out to the system.

I think this is a good principle, but it involves too much guessing. Python doesn't really know the filesystem encoding, it guesses that it's the same as the locale encoding. It doesn't know the encoding of sys.stdout, so it guesses something. Most of the time it will guess correctly and often there are no bad consequences for guessing wrongly. But it's not robust. You cannot blame Python for not knowing the right encodings, but then it shouldn't guess either.

No, if anyone deserves "blame" here it would be the OS designers for either not caring (Unix) or caring but in a convoluted way (Windows). Literally decades of debate and discussion have gone into the current design, and it's still far from perfect due to the inherent difficulties you alluded to. That said, I wouldn't say it's not "robust". What does "robust" mean to you in this case? That it always gets the right encoding? We've established that that's presently not possible (especially on Unix). But you could also have a measure of "robustness" that is "does the right thing in the majority of cases that affect real users, and still round-trips data correctly regardless" which is what the goal the current defaults are designed toward.

comment:17 in reply to: ↑ 15 Changed 3 years ago by jdemeyer

Replying to embray:

You could do that and in many cases it would be fine (on Linux it will be fine because it's the same as what it's already doing--but then why introduce the extra step unnecessarily?). But in the general case that is not necessarily the correct thing to do.

So then what would be the right thing to do?

comment:18 Changed 3 years ago by embray

Let me show you a concrete example. This is in Windows, and isn't directly applicable to this ticket, though it is possible to construct such a scenario on Linux as well.

Let's say I make a filename with some unicode characters in it:

>>> import os
>>> os.mkdir('temp')
>>> os.chdir('temp')
>>> os.listdir('.')
[]
>>> filename = 'résumé.txt'
>>> open(filename, 'w').close()
>>> os.listdir('.')
['résumé.txt']

As you can see, os.listdir('.') correctly decodes the filename, in this case using sys.getfilesystemencoding() which happens to be:

>>> os.listdir(b'.')
[b'r\xe9sum\xe9.txt']
>>> sys.getfilesystemencoding()
'mbcs'
>>> os.listdir(b'.')[0].decode(sys.getfilesystemencoding())
'résumé.txt'

Here 'mbcs' is not a specific encoding--rather, it means to decode with the system default multi-byte codepage. Frequently this is the same as what locale.getpreferredencoding() is but it doesn't have to be.

But let's try a particularly ugly example of where you can go wrong with bad assumptions (and in this case Python's default assumption is wrong too--I'm not arguing that it will always be right):

>>> import subprocess as sp
>>> p = sp.Popen(['cmd.exe', '/c', 'dir', '/B'], stdout=sp.PIPE)
>>> out, _ = p.communicate()
>>> out
b'r\x82sum\x82.txt\r\n'

Well, what it is printing are filenames, but does that mean the encoding has anything to do with sys.getfilesystemencoding()? No!

>>> out.decode(sys.getfilesystemencoding())
'r\u201asum\u201a.txt\r\n'

It happens to decode without errors, but...

>>> import unicodedata
>>> unicodedata.name('\u201a')
'SINGLE LOW-9 QUOTATION MARK'

Whatever that is. So if I just assumed "well, this program prints filenames, and filenames are just bytes, so that must be the filename and I'll pass it around the application as bytes", once I try to pass it back to a system call:

#!
>>> out_filename = out.rstrip()
>>> open(out_filename).read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: b'r\x82sum\x82.txt'

Part of what makes this a particularly nasty example is that here universal_newlines=True isn't necessarily right either. This is particular to cmd.exe on Windows, which is allowed to default to any codepage, not just the one associated with the system's current locale. This is for the sake of supporting legacy DOS-based applications that are not aware of or are not compatible with the OS's locale codepage. In fact my cmd.exe is currently set to 'cp850', so:

>>> out_filename.decode('cp850')
'résumé.txt'

So in the particular case of running 'cmd.exe' (to which the subprocess module is agnostic) I do have to take output as bytes and worry about the correct encoding to decode with. With most "modern" software--including modern command line applications built on Windows--this is less likely to be a problem. Regardless, sys.getfilesystemencoding() was the wrong assumption. (Annoyingly, universal_newlines=True defaults to locale.getpreferredencoding() which may, in cases like this, not be correct either. Somehow an explicit encoding argument was not added to subprocess.Popen until Python 3.6.)

You can concoct a similar example on Linux by setting your current locale to some encoding and writing out some files with non-ASCII filenames, then changing your locale to use a different encoding and run Popen(['ls']) or something like that. That said, this is not the average case, so it's not worth throwing the baby out with the bathwater.

Version 0, edited 3 years ago by embray (next)

comment:19 Changed 3 years ago by jdemeyer

Just a link to a similar discussion: https://github.com/jupyter/nbformat/issues/76

comment:20 Changed 3 years ago by embray

I don't think it's really that similar. Having APIs that write to files accept bytes objects for filenames makes sense (generally) within the unicode sandwich framework--especially if those filenames are just being passed to lower-level system APIs.

comment:21 Changed 3 years ago by chapoton

  • Milestone changed from sage-7.6 to sage-8.0

comment:22 Changed 3 years ago by jdemeyer

  • Authors Erik Bray deleted
  • Milestone changed from sage-8.0 to sage-duplicate/invalid/wontfix
  • Resolution set to invalid
  • Reviewers set to Erik Bray
  • Status changed from needs_work to closed

Superseded by #20238.

Note: See TracTickets for help on using tickets.