Opened 5 years ago

Closed 5 years ago

#24186 closed enhancement (duplicate)

py3 : introducing string conversion tools

Reported by: Frédéric Chapoton Owned by:
Priority: major Milestone: sage-duplicate/invalid/wontfix
Component: python3 Keywords: unicode
Cc: Erik Bray, Jeroen Demeyer, Travis Scrimshaw, François Bissey Merged in:
Authors: Reviewers: Frédéric Chapoton
Report Upstream: N/A Work issues:
Branch: u/chapoton/24186 Commit: 0e6a9e555ea5ba9593e3eddefae82de42da997ec
Dependencies: Stopgaps:

Status badges

Description


Change History (22)

comment:1 Changed 5 years ago by Frédéric Chapoton

Branch: u/chapoton/24186
Cc: Erik Bray Jeroen Demeyer Travis Scrimshaw François Bissey added
Commit: 0e6a9e555ea5ba9593e3eddefae82de42da997ec
Keywords: unicode added
Status: newneeds_review

New commits:

0e6a9e5py3: introducing string conversion tools

comment:2 Changed 5 years ago by Erik Bray

This is definitely the right direction. I'm not 100% sure how I feel about string_to_str. Usually it's not desirable on Python 2, if you already have a unicode object, to convert it to str. This function has very orthogonal behavior between Python 2 and Python 3 because on Python 2 it takes a unicode string and returns a bytes string, whereas on Python 3 it takes a unicode string and returns the same unicode string--very confusing and not clear where that behavior is useful.

Rather than "string_to_..." I think it's more useful to have a set of four functions:

bytes_to_str, str_to_bytes, unicode_to_str, str_to_unicode.

On Python 2 bytes_to_str and str_to_bytes are no-ops because bytes is str. Keeping cases that were already str on Python 2 is usually preferable than converting to unicode everywhere. This is more backwards compatible. You don't want to start converting str to unicode everywhere that wasn't unicode before if you don't have to.

Similarly on Python 3 unicode_to_str and str_to_unicode are no-ops. This pair is also not usually as useful but it can be sometimes I suppose.

comment:3 Changed 5 years ago by Erik Bray

comment:4 Changed 5 years ago by Erik Bray

That said these functions might be useful too, though I'm curious to see some examples of what you have in mind as to where they'd be useful.

comment:5 Changed 5 years ago by Erik Bray

If it helps clarify--the reason for my logic here is that the areas where it is most common to want to convert between bytes/unicode is at the boundary of system functions. On Python 2 the tradition here is "str in str out". This can cause problems of course, which is why the unicode model was changed so drastically on Python 3. But the fact remains, unless there was a specific reason to convert from str at the "out" end (for example, we know we're receiving non-ASCII encoded text), usually "str in str out" is the default on Python 2.

However on Python 3 things are very different--we still want, in most cases "str in str out". This is because users and developers are simply more used to dealing with str (that is, the object created with "" or '' literals without any prefixes). But on Python 3 this means we have to explicitly handle the "unicode" sandwich "str -> bytes -> <system> -> bytes -> str".

If we want to write the same code on Python 2 and Python 3 we can write to_str(system_api(to_bytes(my_str))). Where to_str and to_bytes are defined as in my comment linked above. This properly preserves the "str in str out" idiom on both Python 2 and 3.

Of course there are exceptions to this case but they are less common.

Version 0, edited 5 years ago by Erik Bray (next)

comment:6 Changed 5 years ago by Jeroen Demeyer

And you really want to implement these in Cython, not plain Python.

comment:7 Changed 5 years ago by Erik Bray

Indeed--for example for the Python to str_to_bytes() this is a no-op. It just immediately returns its input. In C this could just be a macro but unfortunately Cython doesn't make it possible to define things as macros. That said, if this is a cpdef function, between Cython and the C compiler it should be able to optimize it away entirely. I'm going to have a look if that's actually the case though...

comment:8 in reply to:  7 Changed 5 years ago by Jeroen Demeyer

Replying to embray:

between Cython and the C compiler it should be able to optimize it away entirely.

...if it's a cdef inline function.

Something like

cdef extern from *:
    int PY_MAJOR_VERSION

cdef inline bytes_to_str(x):
    if PY_MAJOR_VERSION <= 2:
        return <str>x
    return x.decode(encoding)

Ideally, that would be optimized away entirely in Python 2. This is not entirely true, because Cython still changes some refcounts. But the effect of that should be negligible.

comment:9 Changed 5 years ago by Erik Bray

Indeed since PY_MAJOR_VERSION is a constant that branch gets optimized away completely.

You could make it a cpdef so that the function can be used from pure Python as well.

comment:10 Changed 5 years ago by Frédéric Chapoton

Are you going to propose a branch ?

comment:11 Changed 5 years ago by Erik Bray

Sure if you don't mind.

comment:12 Changed 5 years ago by Frédéric Chapoton

Of course I don't mind. I am asking for that since long.

comment:13 Changed 5 years ago by Jeroen Demeyer

Erik, will you do that or should I?

One suggestion: I suggest to implement conversion char* -> str and then do conversion bytes -> str using that. This is because in many cases bytes in Cython code comes from char* in C and the direct conversion char* -> str will be faster than char* -> bytes -> str. And char* -> str is easy using PyString_FromString (Python 2) or PyUnicode_DecodeFSDefault (Python 3).

comment:14 Changed 5 years ago by Jeroen Demeyer

Status: needs_reviewneeds_work

comment:15 Changed 5 years ago by Erik Bray

Do you mean as separate functions (I'm not sure how else one would do that)?

comment:16 in reply to:  15 Changed 5 years ago by Jeroen Demeyer

Replying to embray:

Do you mean as separate functions (I'm not sure how else one would do that)?

I was thinking something like

cdef str array_to_str(char* x):
    ...

cpdef str bytes_to_str(b):
    return array_to_str(PyBytes_AsString(b))

comment:17 Changed 5 years ago by Erik Bray

Okay, I have an implementation of this I'll post to a new ticket in a bit. I just want to try it out a bit in practice more.

comment:18 Changed 5 years ago by Samuel Lelièvre

Erik's implementation of bytes_to_str and str_to_bytes is at #24222 and needs review.

comment:19 Changed 5 years ago by Jeroen Demeyer

What is the point of this ticket in the light of #24222? Can we consider this a duplicate?

comment:20 in reply to:  19 Changed 5 years ago by Erik Bray

Replying to jdemeyer:

What is the point of this ticket in the light of #24222? Can we consider this a duplicate?

I mean in fairness this came before #24222, which was my offer of an alternative. I think we can do without this for now. One thing I do like about it is that it supports conversion of unicode objects on Python 2 to "bytes". #24222 eschews any attempt at general unicode support on Python 2, but it might be worth adding support for passing unicode through my str_to_bytes() function at some point.

Other than that though I don't think this ticket itself is needed.

comment:21 Changed 5 years ago by Frédéric Chapoton

Milestone: sage-8.1sage-duplicate/invalid/wontfix
Status: needs_workneeds_review

then let us close this one as duplicate

comment:22 Changed 5 years ago by Jeroen Demeyer

Authors: Frédéric Chapoton
Resolution: duplicate
Reviewers: Frédéric Chapoton
Status: needs_reviewclosed
Note: See TracTickets for help on using tickets.