#24222 closed defect (fixed)

py3: simplified string conversion utilities

Reported by: embray Owned by:
Priority: major Milestone: sage-8.2
Component: python3 Keywords:
Cc: chapoton, jdemeyer Merged in:
Authors: Erik Bray, Jeroen Demeyer Reviewers: Jeroen Demeyer, Erik Bray
Report Upstream: N/A Work issues:
Branch: dec9f3a (Commits) Commit: dec9f3ac79a992d3152366ae266f1065c70c7867
Dependencies: Stopgaps:

Description

A possible alternative to #24186, implementing simple conversion from C char arrays or bytes objects to str objects, and of str objects to bytes objects. Here "str" and "bytes" are to be read exactly for either Python 2 or Python 3, so on Python 2 this means no conversion is performed since str is bytes == True.

One thing this does not do is implement any kind of conversion from Python 2 unicode objects to bytes. This functionality might be worth adding, in some form, to str_to_bytes. But this would add a new feature on Python 2, whereas for now I'm only trying to preserve the existing functionality on Python 2 exactly, while transparently supporting Python 3 strs everywhere that Python 2 strs are supported.

Change History (96)

comment:1 Changed 22 months ago by git

  • Commit set to 4c787436df8bad6ebf1edac65ba97e080295e3bc

Branch pushed to git repo; I updated commit sha1. New commits:

4c78743Add new string conversion utilities.

comment:2 Changed 22 months ago by embray

  • Status changed from new to needs_review

comment:3 Changed 22 months ago by jdemeyer

  • Status changed from needs_review to needs_info

OK, so now we have to discuss encodings... first of all, I really don't like the encoding argument in the functions, that will just slow everything down. Better pick one (or a few) good encoding(s) and use those.

I don't see why you would pick locale.getpreferredencoding() as default encoding. Relevant encodings should be:

  1. ascii. In many cases we know that output from a C library is an ASCII string (for example, the string representation of a number). Luckily, ASCII is compatible with most other encodings, so this isn't an issue.
  1. filesystemencoding. This is the encoding that Python internally uses for most conversions char* <-> str. For this reason, I think that this is the most important use case and it should be the default when in doubt.
  1. utf-8. This is good if you just want to store unicode as char* (say, an internal user-specified name for something) and you don't care about the exact encoding but you do care that round-trip conversion works properly.
  1. locale.getpreferredencoding(). I believe that this is mainly relevant for subprocesses, provided that those subprocesses also look at the encoding.

For efficiency, I would rather have several sets of functions, each with a hard-coded encoding.

I would also suggest to implement these functions (or at least the functions where you care about performance) as cdef inline functions in the .pxd file. That way, they will actually be inlined when calling them from an external Cython module.

comment:4 follow-up: Changed 22 months ago by jdemeyer

You don't need six.PY2. You can just replace

if six.PY2:
    from cpython.string cimport PyString_FromString
else:
    from cpython.bytes cimport PyBytes_FromString as PyString_FromString

by

from cpython.bytes cimport PyBytes_FromString as PyString_FromString

You could also just use PyBytes_FromString in the code.

comment:5 in reply to: ↑ 4 Changed 22 months ago by embray

Replying to jdemeyer:

You don't need six.PY2. You can just replace

That's what I thought originally too, but I was having some problems trying to do it as a one-liner. I'll try your suggestion though and double-check whether it works.

comment:6 follow-ups: Changed 22 months ago by embray

I don't think the "encoding" argument slows anything down by any significant amount, especially in cases where it isn't used. I'd be amenable to encoding-specific functions for some cases, as Python has those as well in its API. But right now I'm really trying to avoid breadth of API surface. It is good to have generic versions of these functions that accept any encoding, and this is the bare minimum needed to get Python 3 support off the ground. I think that later we can encoding-specific functions where specific use cases for them can be demonstrated. I don't want to get into the weeds with this right now.

As for the default encoding, locale.getpreferredencoding() makes more sense. sys.getfilesystemencoding() is specifically about guessing (more or less) what encoding the filesystem specifically is using for encoding e.g. of filenames, and has nothing to do with encodings of file contents or of arbitrary strings output by software. (Well, that's not exactly true since on *nix the two are essentially the same, although the former can change if the process's locale is changed, I think).

So on *NIX platforms like we care about the most it's actually mostly a moot point. But a lot of the software Sage interfaces with is locale-aware, so it's better to take that into account as the default than not.

comment:7 follow-up: Changed 22 months ago by embray

I would also suggest to implement these functions (or at least the functions where you care about performance) as cdef inline functions in the .pxd file.

I was wondering about that. But in that case the .pyx file will be empty (does it even need to exist at all)? What about the default encoding variables?

comment:8 in reply to: ↑ 6 ; follow-up: Changed 22 months ago by jdemeyer

Maybe the best solution would be to not have a default encoding and require users to think what encoding they want?

comment:9 in reply to: ↑ 6 Changed 22 months ago by jdemeyer

Replying to embray:

I don't think the "encoding" argument slows anything down by any significant amount

The Python 3 C API has specific functions to encode/decode for certainly particular encodings:

I haven't profiled, but I would hope that these are faster than the generic API.

comment:10 in reply to: ↑ 7 Changed 22 months ago by jdemeyer

Replying to embray:

But in that case the .pyx file will be empty (does it even need to exist at all)?

It doesn't need to exist.

What about the default encoding variables?

I propose to remove those variables anyway. If you do need them for some reason, they must be in the .pyx file.

comment:11 follow-up: Changed 22 months ago by jdemeyer

Two more things:

  1. it is safer to use the Cython-compile-time constant PY_VERSION_HEX instead of the C-compile-time constant PY_MAJOR_VERSION. It is dubious C code to write something like
    if (0)
        nonexisting_function()
    

when nonexisting_function() doesn't actually exist like PyUnicode_AsUTF8 on Python 2. The only reason that the above code compiles is compiler optimization.

See also #24215 for Cython-compile-time constants in general.

  1. Could you move this to sage/cpython/string.pxd? I recently created the sage/cpython directory to deal with low-level Python internals and I think that these functions would fit there.

comment:12 follow-up: Changed 22 months ago by chapoton

fails to build, see patchbot report:

from string import Template

comment:13 Changed 22 months ago by chapoton

oh, I see. Name conflict between the global "string" module and the local "string" module..

comment:14 in reply to: ↑ 8 ; follow-ups: Changed 22 months ago by embray

Replying to jdemeyer:

Maybe the best solution would be to not have a default encoding and require users to think what encoding they want?

That's far too onerous and anxiety-provoking for the user :) Python itself uses default encodings all over the place when in doubt (hence e.g. PyUnicode_DecodeFSDefault). It's an unfortunate fact that there's isn't always one "right answer" here; the best we can do is provide sensible defaults and the ability for user-specified encoding where applicable; that is, where we know we want a specific encoding. Moving forward we can also do more, for example, to ensure that any locale-aware code run by Sage is handled well.

I could definitely agree to adding more encoding-specific helper functions, especially for ASCII and UTF-8. But as a first pass, for the sake of getting Python 3 support off the ground, I'd prefer to leave this as is and then make adjustments as specific use cases arise. It will be difficult to even find those specific use cases until and unless we get further along on getting Python 3 working in general (with these functions, plus a few other fixes I'll be posting soon, I've gotten the Sage doctest runner working, so that will help expose a lot of interesting cases quickly).

I'll look at the rest of your suggestions; they seem reasonable.

comment:15 in reply to: ↑ 11 ; follow-up: Changed 22 months ago by embray

Replying to jdemeyer:

Two more things:

  1. it is safer to use the Cython-compile-time constant PY_VERSION_HEX instead of the C-compile-time constant PY_MAJOR_VERSION. It is dubious C code to write something like

Ah, I was actually really looking for something like this but I couldn't find it anywhere in the Cython documentation. Am I just blind?

comment:16 in reply to: ↑ 15 ; follow-up: Changed 22 months ago by embray

Replying to embray:

Replying to jdemeyer:

Two more things:

  1. it is safer to use the Cython-compile-time constant PY_VERSION_HEX instead of the C-compile-time constant PY_MAJOR_VERSION. It is dubious C code to write something like

Ah, I was actually really looking for something like this but I couldn't find it anywhere in the Cython documentation. Am I just blind?

Nevermind; I see now that we explicitly pass that in to cythonize in the setup.py. For that matter, any reason not to add simply PY2 and PY3 compile-time constants? Comparing against PY_VERSION_HEX is a little annoying.

comment:17 in reply to: ↑ 14 ; follow-up: Changed 22 months ago by chapoton

It will be difficult to even find those specific use cases until and unless we get >further along on getting Python 3 working in general (with these functions, plus a >few other fixes I'll be posting soon, I've gotten the Sage doctest runner working, so >that will help expose a lot of interesting cases quickly).

This sounds great ! I was hoping that the doctest framework could be made to work at some point, but was not expecting it soon.

Last edited 22 months ago by chapoton (previous) (diff)

comment:18 in reply to: ↑ 17 Changed 22 months ago by embray

Replying to chapoton:

It will be difficult to even find those specific use cases until and unless we get >further along on getting Python 3 working in general (with these functions, plus a >few other fixes I'll be posting soon, I've gotten the Sage doctest runner working, so >that will help expose a lot of interesting cases quickly).

This sounds great ! I was hoping that the doctest framework could be made to work at some point, but was not expecting it soon.

I had to get it working, in part, so that I could run the doctests for this module :)

I think it will help things go much faster.

comment:19 Changed 22 months ago by embray

I'm seeing now how wanting to have module-level global variables in conjunction with inline c(p)def functions in Cython gets problematic. It just compiles them as __Pyx_GetModuleGlobalName(name). I feel like this should be a bug (I think not too difficult to fix?) in Cython, since it's really counter to how Python should work in this case.

For functions inlined from another module--at least if those functions access global variables from their original module--it should import that module during module initialization and use the correct module dict for globals lookups (just as normal Python functions do, basically).

That's an issue beyond this one though, so I'll rework things for now to get rid use of the global variables by these functions (I still want to have default encodings though).

comment:20 Changed 22 months ago by embray

So it turns out PyUnicode_AsEncodedString and PyUnicode_Decode do allow the encoding argument to be NULL, in which case they default to UTF-8. So if it's a good enough default for Python maybe it's a good enough default for us, for now.

Perhaps I'll just stick with that functionality, and take care to use things like locale.getpreferredencoding() when necessary, such as interfacing with external software. To that end it would be good to come up with some list of exactly what software Sage interfaces with is locale aware. Certainly ECL and GAP are strong candidates; probably others. It will be good to add some tests to that effect but we can worry about it when we come to it.

comment:21 in reply to: ↑ 14 Changed 22 months ago by jdemeyer

Replying to embray:

Python itself uses default encodings all over the place when in doubt

Right, but there are several defaults (each of UTF-8, sys.getfilesystemencoding() and locale.getpreferredencoding() could be considered as defaults). And I don't think that any of those 3 has the large majority. So I still feel that forcing the user to pick one is the best solution.

comment:22 follow-up: Changed 22 months ago by jdemeyer

Also, I feel that error handling should be different for the different cases: if you are communicating with locale-aware software using locale.getpreferredencoding() you probably want decoding errors to be actual errors. With the file system encoding on the other hand, surrogateescape is typically a better default.

comment:23 in reply to: ↑ 16 ; follow-up: Changed 22 months ago by jdemeyer

Replying to embray:

For that matter, any reason not to add simply PY2 and PY3 compile-time constants? Comparing against PY_VERSION_HEX is a little annoying.

No problem for me, although I would prefer PY_MAJOR_VERSION then. If you want to do that, please make a new ticket and make it depend on #24215 please.

comment:24 Changed 22 months ago by git

  • Commit changed from 4c787436df8bad6ebf1edac65ba97e080295e3bc to bb073e885eec1eeae8eac4c4776a81d84c2c61c1

Branch pushed to git repo; I updated commit sha1. New commits:

bb073e8Move sage.misc.string to sage.cpython.string and rework how default encoding is handled

comment:25 in reply to: ↑ 23 Changed 22 months ago by embray

Replying to jdemeyer:

Replying to embray:

For that matter, any reason not to add simply PY2 and PY3 compile-time constants? Comparing against PY_VERSION_HEX is a little annoying.

No problem for me, although I would prefer PY_MAJOR_VERSION then. If you want to do that, please make a new ticket and make it depend on #24215 please.

Okay hold on...

comment:26 in reply to: ↑ 22 Changed 22 months ago by embray

Replying to jdemeyer:

Also, I feel that error handling should be different for the different cases: if you are communicating with locale-aware software using locale.getpreferredencoding() you probably want decoding errors to be actual errors. With the file system encoding on the other hand, surrogateescape is typically a better default.

Yeah, I'm inclined to agree here--I think I'll change that. But in that case I'd also want an optional argument for error handling.

comment:27 in reply to: ↑ 12 Changed 22 months ago by jdemeyer

Replying to chapoton:

fails to build, see patchbot report:

from string import Template

Fixed in #24242.

comment:28 Changed 22 months ago by git

  • Commit changed from bb073e885eec1eeae8eac4c4776a81d84c2c61c1 to a5d521980e6ce42afc2b127e8f7d892f32488ee4

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

34ef6dcAdd HAVE_GMPY2 compile-time constant
158d984Force absolute import in have_module()
7f06e71Fix documentation
3ac146cAdd a Cython compile-time constant for Python major version.
f71fd41Add new string conversion utilities.
a5d5219Add support for optional error handling.

comment:29 follow-up: Changed 22 months ago by embray

  • Dependencies set to #24246
  • Status changed from needs_info to needs_review

I'm pretty happy with this now as-is. I'm up for minor tweaking but would prefer not to make any other major changes to how this works, for now, since it's proving pretty useful in getting other code working.

I'd be up for adding additional shortcut functions for certain special cases if they come up often enough--we'll see.

(Also, it turns out if you put cpdef function definitions into a .pxd file it won't generate a module for them unless you also have a corresponding .pyx file, even if it's empty--annoying. That said, it's useful to have a shortcut for sys.getfilesystemencoding() which is annoying to write :)

comment:30 Changed 22 months ago by chapoton

I have just noticed that we use a similar kind of function "from sagenb.misc.misc import encoded_str" in src/sage/misc/latex.py..

comment:31 follow-up: Changed 22 months ago by jdemeyer

If you want to pass encoding and errors arguments, can you pass them as the C type const char*? That avoids needless slow manipulation of Python objects.

comment:32 in reply to: ↑ 29 ; follow-ups: Changed 22 months ago by jdemeyer

Replying to embray:

since it's proving pretty useful in getting other code working.

If you have such use cases, I'd love to see them.

I'm not yet convinced that we should use locale.getpreferredencoding() as default. Since this is just a preliminary interface, why can't we require an encoding argument until we have a clearer idea that most Unicode conversions are indeed using locale.getpreferredencoding?

comment:33 follow-up: Changed 22 months ago by jdemeyer

The code is a bit inconsistent in how it checks that the input has the correct type, for example passing non-bytes to bytes_to_str(). In any case, I would drop the typing on the argument (bytes b -> b). If you want checks, make them explicit by using checked casts (<str?> instead of <str>).

comment:34 in reply to: ↑ 32 Changed 22 months ago by jdemeyer

Replying to jdemeyer:

Replying to embray:

since it's proving pretty useful in getting other code working.

If you have such use cases, I'd love to see them.

Never mind, I just noticed #24223.

comment:35 in reply to: ↑ 31 ; follow-up: Changed 22 months ago by embray

Replying to jdemeyer:

If you want to pass encoding and errors arguments, can you pass them as the C type const char*? That avoids needless slow manipulation of Python objects.

Doesn't that mean that from Python you'd have to pass those arguments in as bytes then?

Let's please not get too hung up about micro-optimization here. Compared to actual string decoding (except in some trivial cases) this kind of point is pretty trivial.

comment:36 in reply to: ↑ 32 ; follow-up: Changed 22 months ago by embray

Replying to jdemeyer:

I'm not yet convinced that we should use locale.getpreferredencoding() as default. Since this is just a preliminary interface, why can't we require an encoding argument until we have a clearer idea that most Unicode conversions are indeed using locale.getpreferredencoding?

No, for exactly the reason that in all cases it's not clear what the best choice is. Anyways the end result is going to be everyone just writing "utf-8" everywhere which I'd prefer to avoid. I'd like to see how far we get with a sensible default for now. There are some cases where it's definitely good to specify an encoding--for example all files written by sage can be encoded utf-8 in general. In general that's not the case though.

comment:37 in reply to: ↑ 33 ; follow-up: Changed 22 months ago by embray

Replying to jdemeyer:

The code is a bit inconsistent in how it checks that the input has the correct type, for example passing non-bytes to bytes_to_str(). In any case, I would drop the typing on the argument (bytes b -> b). If you want checks, make them explicit by using checked casts (<str?> instead of <str>).

"make them explicit" I don't know what isn't explicit about specifying the argument type?

comment:38 in reply to: ↑ 37 ; follow-up: Changed 22 months ago by jdemeyer

Replying to embray:

I don't know what isn't explicit about specifying the argument type?

It's explicit but not what you want. Cython always allows None to be passed as any type. If you really want bytes and not None, you need to check for it. And if you do an explicit check, there is no reason for the typing of the argument.

comment:39 in reply to: ↑ 36 Changed 22 months ago by jdemeyer

Replying to embray:

Anyways the end result is going to be everyone just writing "utf-8" everywhere which I'd prefer to avoid.

And now we're in a situation where everybody uses locale.getpreferredencoding() which is also not what you want.

comment:40 follow-up: Changed 22 months ago by jdemeyer

I owe you an apology... I read some Python docs and sources and it turns out that Python uses locale.getpreferredencoding() for more things than I thought. So with that in mind, locale.getpreferredencoding() might actually be a reasonable default after all.

comment:41 in reply to: ↑ 40 Changed 22 months ago by embray

Replying to jdemeyer:

I owe you an apology... I read some Python docs and sources and it turns out that Python uses locale.getpreferredencoding() for more things than I thought. So with that in mind, locale.getpreferredencoding() might actually be a reasonable default after all.

That's okay, I'm just going off experience here. But it's a hazy topic so I don't mind having my assumptions challenged.

What's nice about having a sensible default is that, especially starting out, one needs to use a lot of these and it's not always immediately obvious what to do. It's easier to not worry about it until you know you need to (e.g. something produces some text in an encoding you weren't expecting). The locale encoding is a good bet though since it will typically match what the terminal is using.

comment:42 in reply to: ↑ 35 Changed 22 months ago by jdemeyer

Replying to embray:

Let's please not get too hung up about micro-optimization here. Compared to actual string decoding (except in some trivial cases) this kind of point is pretty trivial.

Another good point: PyUnicode_AsUTF8 caches its own representation as UTF-8 string. So the PyUnicode_AsUTF8() calls for encoding and errors should be fast.

comment:43 Changed 22 months ago by jdemeyer

  • Reviewers set to Jeroen Demeyer

comment:44 follow-up: Changed 22 months ago by embray

Yes, especially since they'll likely be interned anyways.

I'm still going to look into your point about argument types.

comment:45 in reply to: ↑ 44 Changed 22 months ago by jdemeyer

Replying to embray:

I'm still going to look into your point about argument types.

No. I'm preparing a small reviewer patch.

comment:46 Changed 22 months ago by jdemeyer

  • Branch changed from u/embray/python3/string-conversions to u/jdemeyer/python3/string-conversions

comment:47 Changed 22 months ago by jdemeyer

  • Commit changed from a5d521980e6ce42afc2b127e8f7d892f32488ee4 to 4a209df305a721e9c30bc1aa384d128e265c419c

I made some minor changes, please review.


New commits:

4a209dfMinor fixes to cpython.string

comment:48 Changed 22 months ago by jdemeyer

Note that my commit does not affect the API of these functions, so this shouldn't require any changes for tickets depending on this one.

comment:49 in reply to: ↑ 38 Changed 22 months ago by embray

  • Status changed from needs_review to needs_work

Replying to jdemeyer:

Replying to embray:

I don't know what isn't explicit about specifying the argument type?

It's explicit but not what you want. Cython always allows None to be passed as any type. If you really want bytes and not None, you need to check for it. And if you do an explicit check, there is no reason for the typing of the argument.

I'm still not following you here. Dropping the type specification seems to make things worse, not better. I do see what you're saying about None which is weird. I don't like that. But specifying the argument of the type provides some compile-time static checking that's already been helpful a few times. It's not foolproof of course because in many cases Cython can't be sure what type a Python object will be. But in cases where it can (e.g. something "obvious" like foo = b'abc'; str_to_bytes(foo)) you get:

Error compiling Cython file:
------------------------------------------------------------
...
    foo = b'abc'
    bar = str_to_bytes(foo)
                      ^
------------------------------------------------------------

Cannot convert 'bytes' object to str implicitly. This is not portable to Py3.

While this seems to be specific to bytes -> str, it is very useful in this case.

Also, without this type check you get much less useful errors if you pass in the wrong type. With the type specification:

>>> str_to_bytes(b'a')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Argument 's' has incorrect type (expected str, got bytes)

Without it

>>> str_to_bytes(b'a')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "sage/cpython/string.pxd", line 70, in sage.cpython.string.str_to_bytes (build/cythonized/sage/cpython/string.c:1660)
  File "sage/cpython/string.pxd", line 99, in sage.cpython.string.str_to_bytes (build/cythonized/sage/cpython/string.c:1523)
TypeError: bad argument type for built-in operation

This error happens to be raising from PyUnicode_EncodeLocale, but one wouldn't guess that just from looking at it. It seems sloppy.

One thing I did recently run into, however, is that str_to_bytes(str s, ...) does not allow subclasses of str which is inconvenient. Your version does seem allow str subclasses (and I guess likewise bytes subclasses for bytes_to_str). It would be nice to have a middle ground between accepting subclasses, but also having better type checking.

comment:50 follow-up: Changed 22 months ago by embray

As you suggested earlier, using explicit type-checking casts like PyUnicode_EncodeLocale(<str?>s, err) gets part of the way there--it at least provides better exceptions when the wrong type is passed in.

Unfortunately in the case of str Cython wants to use PyString_CheckExact, instead of some more generic check, and this disallows str subclasses (even though for user-defined types it will use a check that allows subclasses). I feel like this a bug in Cython--why use PyString_CheckExact instead of PyString_Check? The latter still does the equivalent of the former first as a special case. This is also especially nicely optimized on Python 3 where they added a Py_TPFLAGS_UNICODE_SUBCLASS flag as a special case :)

comment:51 in reply to: ↑ 50 ; follow-up: Changed 22 months ago by jdemeyer

Replying to embray:

I feel like this a bug in Cython--why use PyString_CheckExact instead of PyString_Check?

Feature, not a bug :-)

Cython uses this typing as an optimization tool, not as a checking tool. It happens to do some checking (like assigning cdef str x = [] will fail) but that is just to avoid segfaults. For certain builtin types, Cython accesses the internals of those types. But accessing those internals is only valid for the exact type, it might not be valid for subtypes.

comment:52 in reply to: ↑ 51 Changed 22 months ago by embray

Replying to jdemeyer:

Replying to embray:

I feel like this a bug in Cython--why use PyString_CheckExact instead of PyString_Check?

Feature, not a bug :-)

Yep, I just confirmed this myself--although it doesn't really make this explicit the way the code is written definitely implies that it's intentional.

Cython uses this typing as an optimization tool, not as a checking tool. It happens to do some checking (like assigning cdef str x = [] will fail) but that is just to avoid segfaults. For certain builtin types, Cython accesses the internals of those types. But accessing those internals is only valid for the exact type, it might not be valid for subtypes.

Right. It depends. I feel like one should be able to ask for allowing subtypes explicitly though, if nothing else. If Cython needs to muck around with the internals it can still make an "exact" check...

comment:53 follow-ups: Changed 22 months ago by embray

Also I'd say Cython does use types for checking, not just optimization, for example it obviously won't let you do things like

cdef int i
i = "abc"

I guess it depends on what you mean by "typing"--is this referring to Python types or C types?

So in the mean time what do we do? Write the explicit checks manually? Or just not allow str subclasses and require a manual str() when passing in objects that might be str subclasses (ew, very un-Pythonic).

comment:54 in reply to: ↑ 53 ; follow-up: Changed 22 months ago by jdemeyer

Replying to embray:

So in the mean time what do we do?

I would argue that the current branch works and that nothing needs to be changed. Even if the error message from PyUnicode_Decode is obscure, the traceback would easily point to the right place.

comment:55 in reply to: ↑ 53 Changed 22 months ago by jdemeyer

Replying to embray:

I guess it depends on what you mean by "typing"--is this referring to Python types or C types?

I was specifically talking about builtin Python types (not C types nor cdef classes).

comment:56 in reply to: ↑ 54 ; follow-up: Changed 22 months ago by embray

Replying to jdemeyer:

Replying to embray:

So in the mean time what do we do?

I would argue that the current branch works and that nothing needs to be changed. Even if the error message from PyUnicode_Decode is obscure, the traceback would easily point to the right place.

Okay--I'm really not happy about that--given that the <str?> syntax is the programmer explicitly asking for a type check it should allow some flexibility in how that's performed. But that's a Cython issue. I can live with this in the meantime. However, one thing that still needs to be fixed then is to remove the type checks in the Python 2 cases. I'm not sure what good they're doing, and they are restricting the use of subtypes on Python 2...

comment:57 in reply to: ↑ 56 Changed 22 months ago by jdemeyer

Replying to embray:

given that the <str?> syntax is the programmer explicitly asking for a type check it should allow some flexibility in how that's performed. But that's a Cython issue.

Again, feature, not a bug. It is meant as "optimize this code assuming that this is a str but check it just in case". Of course, it is too tempting to use this to replace isinstance() checks as I have done. I'll fix it right now.

comment:58 Changed 22 months ago by git

  • Commit changed from 4a209df305a721e9c30bc1aa384d128e265c419c to 64567aa772d737b6ff052f99a67cc6e8ad72c21c

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

64567aaMake type checks explicit

comment:59 Changed 22 months ago by jdemeyer

  • Status changed from needs_work to needs_review

comment:60 follow-up: Changed 22 months ago by embray

Okay, I take your word for it, but in that case I think the the Cython documentation is a little deceptive about how this works / is meant to be used. Not deliberately of course, it's just unclear. This is like with importlib.import_module--the documentation is unhelpful here, but I'm sure you're right...

comment:61 in reply to: ↑ 60 Changed 22 months ago by jdemeyer

Replying to embray:

Okay, I take your word for it, but in that case I think the the Cython documentation is a little deceptive about how this works / is meant to be used.

https://github.com/cython/cython/pull/2021

comment:62 Changed 22 months ago by embray

  • Authors changed from Erik Bray to Erik Bray, Jeroen Demeyer
  • Reviewers changed from Jeroen Demeyer to Jeroen Demeyer, Erik Bray
  • Status changed from needs_review to positive_review

Alright, I'm good with this.

comment:63 Changed 22 months ago by embray

  • Status changed from positive_review to needs_work

I think it would actually be best if char_to_str took a const char * instead of just char *. I was lazy about that at first, but for the sake of correctness I think this should be updated.

comment:64 Changed 22 months ago by jdemeyer

Right. But don't rewrite history, just add a new commit.

comment:65 Changed 22 months ago by embray

Sure--I can do it if you want. I'll just set it back to positive_review then since we're in agreement.

comment:66 Changed 22 months ago by embray

  • Branch changed from u/jdemeyer/python3/string-conversions to u/embray/python3/string-conversions
  • Commit changed from 64567aa772d737b6ff052f99a67cc6e8ad72c21c to a358aeaec8612efe19678482fb510751a1b7b33a
  • Status changed from needs_work to positive_review

New commits:

a358aeaRequire a const char*--this is more consistent for use in cases where

comment:67 Changed 22 months ago by vbraun

  • Status changed from positive_review to needs_work

PDF docs don't build

comment:68 Changed 21 months ago by embray

It would be helpful if you linked to a build result page demonstrating the issue.

comment:69 Changed 21 months ago by vbraun

That was before putting it on the buildbot...

comment:70 Changed 21 months ago by embray

  • Milestone changed from sage-8.1 to sage-8.2

comment:71 Changed 21 months ago by embray

Well I have unrelated problems building the PDF docs (is there a list somewhere of the dependencies? I have a pretty big texlive install and it still doesn't seem to have all the necessary fonts or something, even for some characters not related to this ticket).

Still if I had to guess it seems likely the problem is my use of the Snowman character (my favorite go-to unicode character). I'll just change it to something a little more likely to be already supported...

comment:72 Changed 21 months ago by git

  • Commit changed from a358aeaec8612efe19678482fb510751a1b7b33a to 64a1c2d58713636a51c60e3be508204c88513658

Branch pushed to git repo; I updated commit sha1. New commits:

64a1c2dUse a less unusual character in the tests

comment:73 Changed 21 months ago by embray

  • Status changed from needs_work to needs_review

comment:74 Changed 21 months ago by chapoton

maybe you should add r"""} when the doctests contains something that can perturb the pdf doc ?

comment:75 Changed 21 months ago by embray

Ah, there are some slashes in there too; I should make it an r-string just in case.

comment:76 Changed 21 months ago by git

  • Commit changed from 64a1c2d58713636a51c60e3be508204c88513658 to 1bf6709cb5b256537bab4058ba3fa578996f521b

Branch pushed to git repo; I updated commit sha1. New commits:

1bf6709Use r-strings for the docstrings since they contain slashes

comment:77 Changed 21 months ago by vbraun

Have a look at build/pkgs/texlive/package-list.txt

You can add custom codepoints in src/doc/common/conf.py

comment:78 Changed 21 months ago by chapoton

  • Status changed from needs_review to positive_review

pdf doc seems to be ok now

comment:79 Changed 21 months ago by fbissey

Breaks the python 3.6 build for me in sage-on-gentoo

[477/483] Cythonizing sage/symbolic/function.pyx

Error compiling Cython file:
------------------------------------------------------------
...
            err = NULL  # implies "strict"
        else:
            err = PyUnicode_AsUTF8(errors)

        if encoding is None:
            return PyUnicode_DecodeLocale(c, err)
                                        ^
------------------------------------------------------------

sage/cpython/string.pxd:37:41: Cannot convert Unicode string to 'str' implicitly. This is not portable and requires explicit encoding.
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/Cython/Build/Dependencies.py", line 1179, in cythonize_one_helper
    return cythonize_one(*m)
  File "/usr/lib64/python3.6/site-packages/Cython/Build/Dependencies.py", line 1161, in cythonize_one
    raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: sage/cpython/string.pyx

cython 0.27.3 and python-3.6.3. Since this is from python3 specific section it won't show up on Volker's build bots.

comment:80 Changed 21 months ago by embray

  • Status changed from positive_review to needs_work

Hrm. It seems I never actually merged/tested Jeroen's changes on my Python 3 branch. I assumed he tested it.

comment:81 Changed 21 months ago by git

  • Commit changed from 1bf6709cb5b256537bab4058ba3fa578996f521b to 2bb839655c50259c8beec273cab5773eb5158265

Branch pushed to git repo; I updated commit sha1. New commits:

2bb8396Clean up these declarations:

comment:82 Changed 21 months ago by embray

  • Status changed from needs_work to needs_review

comment:83 Changed 21 months ago by jdemeyer

  • Status changed from needs_review to needs_work

Instead of removing the return type specifications, it would be better to fix them. Replacing the unicode return value by str should work.

comment:84 Changed 21 months ago by embray

Fine but it really doesn't make much difference.

comment:85 Changed 21 months ago by git

  • Commit changed from 2bb839655c50259c8beec273cab5773eb5158265 to 6d73a416b18f9d0edec280009495ae77f7b0a49a

Branch pushed to git repo; I updated commit sha1. New commits:

6d73a41Use more specific type declarations here.

comment:86 Changed 21 months ago by chapoton

  • Status changed from needs_work to needs_review

comment:87 Changed 21 months ago by jdemeyer

Good for me if it passes testing.

comment:88 Changed 21 months ago by chapoton

  • Status changed from needs_review to positive_review

one green bot. I am setting to positive this important ticket.

comment:89 Changed 21 months ago by vbraun

  • Status changed from positive_review to needs_work

Merge conflict

comment:90 Changed 21 months ago by chapoton

Volker, do you by chance have any idea about what was the conflicting ticket ?

comment:91 follow-up: Changed 21 months ago by embray

Merge conflict with what? If there were a normal "master" branch into which tickets were merged regularly against which I could compare that would be one thing, but you can't just secretly merge a bunch of tickets all at once, claim "merge conflict", and expect me to guess what it conflicts with.

comment:92 Changed 21 months ago by git

  • Commit changed from 6d73a416b18f9d0edec280009495ae77f7b0a49a to dec9f3ac79a992d3152366ae266f1065c70c7867

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

a285af4Add new string conversion utilities.
5b3719aAdd support for optional error handling.
bf5f893Make type checks explicit
1bd27fcRequire a const char*--this is more consistent for use in cases where
bd2d931Use a less unusual character in the tests
b32f000Use r-strings for the docstrings since they contain slashes
ccf7bcdClean up these declarations:
dec9f3aUse more specific type declarations here.

comment:93 Changed 21 months ago by embray

Rebased on current develop branch if it helps, but there was no merge conflict.


New commits:

a285af4Add new string conversion utilities.
5b3719aAdd support for optional error handling.
bf5f893Make type checks explicit
1bd27fcRequire a const char*--this is more consistent for use in cases where
bd2d931Use a less unusual character in the tests
b32f000Use r-strings for the docstrings since they contain slashes
ccf7bcdClean up these declarations:
dec9f3aUse more specific type declarations here.
Last edited 21 months ago by embray (previous) (diff)

comment:94 in reply to: ↑ 91 Changed 21 months ago by jdemeyer

Replying to embray:

Merge conflict with what? If there were a normal "master" branch into which tickets were merged regularly against which I could compare that would be one thing, but you can't just secretly merge a bunch of tickets all at once, claim "merge conflict", and expect me to guess what it conflicts with.

It's not secret: https://github.com/vbraun/sage/tree/develop

comment:95 Changed 21 months ago by jdemeyer

  • Dependencies #24246 deleted
  • Status changed from needs_work to positive_review

I don't see any conflict...

comment:96 Changed 21 months ago by vbraun

  • Branch changed from u/embray/python3/string-conversions to dec9f3ac79a992d3152366ae266f1065c70c7867
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.