#24186 closed enhancement (duplicate)

py3 : introducing string conversion tools

Reported by: chapoton Owned by:
Priority: major Milestone: sage-duplicate/invalid/wontfix
Component: python3 Keywords: unicode
Cc: embray, jdemeyer, tscrim, fbissey Merged in:
Authors: Reviewers: Frédéric Chapoton
Report Upstream: N/A Work issues:
Branch: u/chapoton/24186 Commit: 0e6a9e555ea5ba9593e3eddefae82de42da997ec
Dependencies: Stopgaps:

Description


Change History (22)

comment:1 Changed 22 months ago by chapoton

  • Branch set to u/chapoton/24186
  • Cc embray jdemeyer tscrim fbissey added
  • Commit set to 0e6a9e555ea5ba9593e3eddefae82de42da997ec
  • Keywords unicode added
  • Status changed from new to needs_review

New commits:

0e6a9e5py3: introducing string conversion tools

comment:2 Changed 22 months ago by embray

This is definitely the right direction. I'm not 100% sure how I feel about string_to_str. Usually it's not desirable on Python 2, if you already have a unicode object, to convert it to str. This function has very orthogonal behavior between Python 2 and Python 3 because on Python 2 it takes a unicode string and returns a bytes string, whereas on Python 3 it takes a unicode string and returns the same unicode string--very confusing and not clear where that behavior is useful.

Rather than "string_to_..." I think it's more useful to have a set of four functions:

bytes_to_str, str_to_bytes, unicode_to_str, str_to_unicode.

On Python 2 bytes_to_str and str_to_bytes are no-ops because bytes is str. Keeping cases that were already str on Python 2 is usually preferable than converting to unicode everywhere. This is more backwards compatible. You don't want to start converting str to unicode everywhere that wasn't unicode before if you don't have to.

Similarly on Python 3 unicode_to_str and str_to_unicode are no-ops. This pair is also not usually as useful but it can be sometimes I suppose.

comment:3 Changed 22 months ago by embray

comment:4 Changed 22 months ago by embray

That said these functions might be useful too, though I'm curious to see some examples of what you have in mind as to where they'd be useful.

comment:5 Changed 22 months ago by embray

If it helps clarify--the reason for my logic here is that the areas where it is most common to want to convert between bytes/unicode is at the boundary of system functions. On Python 2 the tradition here is "str in str out". This can cause problems of course, which is why the unicode model was changed so drastically on Python 3. But the fact remains, unless there was a specific reason to convert from str to unicode at the "out" end (for example, we know we're receiving non-ASCII encoded text), usually "str in str out" is the default on Python 2.

However on Python 3 things are very different--we still want, in most cases "str in str out". This is because users and developers are simply more used to dealing with str (that is, the object created with "" or '' literals without any prefixes). But on Python 3 this means we have to explicitly handle the "unicode" sandwich "str -> bytes -> <system> -> bytes -> str".

If we want to write the same code on Python 2 and Python 3 we can write to_str(system_api(to_bytes(my_str))). Where to_str and to_bytes are defined as in my comment linked above. This properly preserves the "str in str out" idiom on both Python 2 and 3.

Of course there are exceptions to this case but they are less common.

Last edited 22 months ago by embray (previous) (diff)

comment:6 Changed 22 months ago by jdemeyer

And you really want to implement these in Cython, not plain Python.

comment:7 follow-up: Changed 22 months ago by embray

Indeed--for example for the Python to str_to_bytes() this is a no-op. It just immediately returns its input. In C this could just be a macro but unfortunately Cython doesn't make it possible to define things as macros. That said, if this is a cpdef function, between Cython and the C compiler it should be able to optimize it away entirely. I'm going to have a look if that's actually the case though...

comment:8 in reply to: ↑ 7 Changed 22 months ago by jdemeyer

Replying to embray:

between Cython and the C compiler it should be able to optimize it away entirely.

...if it's a cdef inline function.

Something like

cdef extern from *:
    int PY_MAJOR_VERSION

cdef inline bytes_to_str(x):
    if PY_MAJOR_VERSION <= 2:
        return <str>x
    return x.decode(encoding)

Ideally, that would be optimized away entirely in Python 2. This is not entirely true, because Cython still changes some refcounts. But the effect of that should be negligible.

comment:9 Changed 22 months ago by embray

Indeed since PY_MAJOR_VERSION is a constant that branch gets optimized away completely.

You could make it a cpdef so that the function can be used from pure Python as well.

comment:10 Changed 22 months ago by chapoton

Are you going to propose a branch ?

comment:11 Changed 22 months ago by embray

Sure if you don't mind.

comment:12 Changed 21 months ago by chapoton

Of course I don't mind. I am asking for that since long.

comment:13 Changed 21 months ago by jdemeyer

Erik, will you do that or should I?

One suggestion: I suggest to implement conversion char* -> str and then do conversion bytes -> str using that. This is because in many cases bytes in Cython code comes from char* in C and the direct conversion char* -> str will be faster than char* -> bytes -> str. And char* -> str is easy using PyString_FromString (Python 2) or PyUnicode_DecodeFSDefault (Python 3).

comment:14 Changed 21 months ago by jdemeyer

  • Status changed from needs_review to needs_work

comment:15 follow-up: Changed 21 months ago by embray

Do you mean as separate functions (I'm not sure how else one would do that)?

comment:16 in reply to: ↑ 15 Changed 21 months ago by jdemeyer

Replying to embray:

Do you mean as separate functions (I'm not sure how else one would do that)?

I was thinking something like

cdef str array_to_str(char* x):
    ...

cpdef str bytes_to_str(b):
    return array_to_str(PyBytes_AsString(b))

comment:17 Changed 21 months ago by embray

Okay, I have an implementation of this I'll post to a new ticket in a bit. I just want to try it out a bit in practice more.

comment:18 Changed 21 months ago by slelievre

Erik's implementation of bytes_to_str and str_to_bytes is at #24222 and needs review.

comment:19 follow-up: Changed 21 months ago by jdemeyer

What is the point of this ticket in the light of #24222? Can we consider this a duplicate?

comment:20 in reply to: ↑ 19 Changed 21 months ago by embray

Replying to jdemeyer:

What is the point of this ticket in the light of #24222? Can we consider this a duplicate?

I mean in fairness this came before #24222, which was my offer of an alternative. I think we can do without this for now. One thing I do like about it is that it supports conversion of unicode objects on Python 2 to "bytes". #24222 eschews any attempt at general unicode support on Python 2, but it might be worth adding support for passing unicode through my str_to_bytes() function at some point.

Other than that though I don't think this ticket itself is needed.

comment:21 Changed 21 months ago by chapoton

  • Milestone changed from sage-8.1 to sage-duplicate/invalid/wontfix
  • Status changed from needs_work to needs_review

then let us close this one as duplicate

comment:22 Changed 21 months ago by jdemeyer

  • Authors Frédéric Chapoton deleted
  • Resolution set to duplicate
  • Reviewers set to Frédéric Chapoton
  • Status changed from needs_review to closed
Note: See TracTickets for help on using tickets.