Opened 2 years ago

Closed 2 years ago

#21600 closed enhancement (fixed)

Use custom build_ext to compile Cython code

Reported by: jdemeyer Owned by:
Priority: major Milestone: sage-7.4
Component: build Keywords:
Cc: mkoeppe, embray Merged in:
Authors: Jeroen Demeyer Reviewers: Matthias Koeppe, Erik Bray
Report Upstream: N/A Work issues:
Branch: dfcd817 (Commits) Commit: dfcd8177447c7eab491569f08777f585deda93ee
Dependencies: #21604 Stopgaps:

Description (last modified by jdemeyer)

In src/setup.py, we should not run cythonize() manually. It would be better to use a custom build_ext command for that.

This is inspired by the build_ext from Cython 0.25: https://github.com/cython/cython/blob/master/Cython/Distutils/build_ext.py Unfortunately, we cannot really use that since it doesn't allow to pass options to cythonize().

Due to a Cython bug, this warning gets displayed even though the cache option is valid: UserWarning: got unknown compilation option, please remove: cache.

Change History (47)

comment:1 Changed 2 years ago by jdemeyer

  • Description modified (diff)

comment:2 Changed 2 years ago by jdemeyer

  • Summary changed from Use Cython's build_ext to Use custom build_ext to compile Cython code

comment:3 Changed 2 years ago by jdemeyer

  • Description modified (diff)

comment:4 Changed 2 years ago by jdemeyer

  • Branch set to u/jdemeyer/ticket/21600

comment:5 Changed 2 years ago by jdemeyer

  • Commit set to 6d6d3490d1aa434acfdb9ebed4730df38aae6a37
  • Dependencies changed from #20596 to #21604

New commits:

5a55862Fix build of pyzmq with Cython 0.25
85634c2Upgrade Cython to version 0.25
5474247Move old_style_globals to modules; other directives to cythonize() call
6d6d349Run cythonize() inside build_ext command

comment:6 Changed 2 years ago by git

  • Commit changed from 6d6d3490d1aa434acfdb9ebed4730df38aae6a37 to 35ecf4b0fca0c13dad3e38826fc33211bbd66c81

Branch pushed to git repo; I updated commit sha1. This was a forced push. Last 10 new commits:

3a8cc0eFix typo in comment
0dd9c50Respect environment variable MAKE
17f90d8beautification
e5f9065More comments
7791cd9Remove --buildbase code
74169e7Pass SAGE_SRC to generate_py_source.mk
0394333Add new file to MANIFEST.in
fdedb02Install SAGE_LOCAL/bin/sage instead of SAGE_ROOT/sage as Jupyter kernel
5ba95edClean up stale installed files in install command
35ecf4bRun cythonize() inside build_ext command

comment:7 Changed 2 years ago by jdemeyer

  • Cc embray added

comment:8 follow-up: Changed 2 years ago by embray

If I'm to understand correctly, this is mostly just taking the existing Cython build code and moving into the build_ext command, right? Or was anything substantive changed in the build code too (beyond what was already from #21604)?

comment:9 in reply to: ↑ 8 Changed 2 years ago by jdemeyer

Replying to embray:

If I'm to understand correctly, this is mostly just taking the existing Cython build code and moving into the build_ext command, right?

Exactly. But it interacts in non-trivial ways with other parts of setup.py (in particular, finding the data files), so this isn't ready yet.

comment:10 Changed 2 years ago by embray

Could you be more specific? I can probably help.

comment:11 Changed 2 years ago by jdemeyer

Well, I haven't figured out precisely what to do with this part from src/setup.py:

from sage_setup.find import find_python_sources, find_extra_files
python_packages, python_modules = find_python_sources(
    SAGE_SRC, ['sage', 'sage_setup'])
python_data_files = find_extra_files(python_packages,
    ".", SAGE_CYTHONIZED, SAGE_LIB, ["ntlwrap.cpp"])

This code determines some inputs to the setup() call, so logically it should come before calling setup().

On the other hand, the last line assumes that Cython has run because it looks for files in SAGE_CYTHONIZED. So, if we run Cython in the build_ext command of setup() (what this ticket already implements), then we can no longer determine the python_data_files before calling setup().

comment:12 Changed 2 years ago by embray

One secret about distutils, which is both a bug and a feature--and the fact that it's often exploited is why it's so hard to change anything about distutils--is that in many cases it doesn't really matter what you pass to setup(). It's mostly important for static metadata, but most other details about building the package can be tweaked in various places along the line (albeit sometimes requiring care).

So for example, the data_files passed to setup() is stored on the Distribution object in the .data_files attribute. So this can be accessed from any command via self.distribution.data_files. If a command, like build_py or build_ext is also responsible for generating files, some of which might result in installable resource files (for Cython modules this is mainly header files), it can also append those generated files, as appropriate, to self.distribution.data_files for later use.

Where this requires care is it's sometimes important to make sure you're using those attributes in ways that are compatible with how other commands use them, for which the best documentation is the source code (another misfeature). data_files is an easy one because it's only used by a couple other commands, mainly install/install_data, and sdist. This is easy because the build command is normally run automatically before install, so as long as the build_ext command updates the data_files appropriately they will be handled correctly.

sdist is a little trickier--it does not normally run build first, though if you invoke setup.py with multiple commands it will work correctly (like ./setup.py build sdist). I normally subclass sdist to have it automatically run build first if I have Cython modules, because I also want to include the C sources in my source dist. But I'm not too worried about that right now since ./setup.py sdist doesn't work right for Sage for other reasons too. That will be a separate issue.

comment:13 Changed 2 years ago by jdemeyer

Thanks a lot! I think I have enough information now to work on this.

comment:14 Changed 2 years ago by git

  • Commit changed from 35ecf4b0fca0c13dad3e38826fc33211bbd66c81 to 2b5fa98244c5d9f3481c7d15061e3e0f4c4b0ee6

Branch pushed to git repo; I updated commit sha1. New commits:

2b5fa98Copy extra files to build directory instead of using data_files

comment:15 follow-up: Changed 2 years ago by embray

I'm not quite sure why you're not not using data_files at all, though for files installed with the Python modules themselves it's more customary to use package_data but that's neither here nor there. I suspect the problem might have to do in part with oddity in how find_extra_files is expected to work.

I'm not sure why the dist.extra_files = either. There's no reason now to go assigning arbitrary new attributes to the Distribution instance.

comment:16 in reply to: ↑ 15 ; follow-up: Changed 2 years ago by jdemeyer

Replying to embray:

I'm not quite sure why you're not not using data_files at all

I could have done that, but the usage of data_files in Sage has always been a bit fishy, see #20108. So given that I'm changing the data file handling anyway, I decided to stop using data_files.

though for files installed with the Python modules themselves it's more customary to use package_data

That doesn't work either: package_data assumes that the package data files are in the same directory as the Python sources. This is not the case for the Cython-generated files, so I cannot use package_data. But I am processing the data files similarly as what distutils does with package_data: copying them to the build directory. Since the data files appear in the build directory, they are automatically installed in site-packages.

I suspect the problem might have to do in part with oddity in how find_extra_files is expected to work.

No, we can always change find_extra_files to do whatever we want it to do.

I'm not sure why the dist.extra_files = either. There's no reason now to go assigning arbitrary new attributes to the Distribution instance.

These extra_files are also needed for cleaning in the sage_install class. So I need to pass them somehow and assigning an attribute to the Distribution instance seemed like the most natural way.

comment:17 Changed 2 years ago by git

  • Commit changed from 2b5fa98244c5d9f3481c7d15061e3e0f4c4b0ee6 to 7e3a492edaa54fc9c724d183534d27d171c8cf60

Branch pushed to git repo; I updated commit sha1. New commits:

7e3a492Rename extra_files -> sage_extra_files

comment:18 Changed 2 years ago by jdemeyer

  • Status changed from new to needs_review

comment:19 in reply to: ↑ 16 ; follow-up: Changed 2 years ago by embray

Replying to jdemeyer:

Replying to embray:

though for files installed with the Python modules themselves it's more customary to use package_data

That doesn't work either: package_data assumes that the package data files are in the same directory as the Python sources. This is not the case for the Cython-generated files, so I cannot use package_data. But I am processing the data files similarly as what distutils does with package_data: copying them to the build directory. Since the data files appear in the build directory, they are automatically installed in site-packages.

Since the purpose of package_data is specifically to list data files that should be installed in the Python packages, I think it makes sense to use anyways. Normally this wouldn't be a problem even for Cython generated files, and the only reason it is is because we're handling those in a "weird" (to distutils) and unexpected way. That's fine--I mean, I don't agree with it but it's not abhorrent or anything--we just need to make some tweaks to how package_data is handled so that it searches additional file hierarchies.

But for coherency I would rather use the existing option that is specifically for this purpose, and use our command subclasses to enhance its functionality to work for our purposes, rather than make up something entirely new.

comment:20 Changed 2 years ago by jdemeyer

  • Description modified (diff)

comment:21 in reply to: ↑ 19 Changed 2 years ago by jdemeyer

Replying to embray:

we just need to make some tweaks to how package_data is handled so that it searches additional file hierarchies.

There is another issue: package_data is handled in the build_py command, which happens before build_ext, which runs Cython. So even if you manage to handle the Cython-generated files using package_data, you either need to reorder the build subcommands or you need to run Cython in build_py.

But for coherency I would rather use the existing option that is specifically for this purpose

You mean: going back to use data_files?

comment:22 follow-ups: Changed 2 years ago by embray

I see, that's a bit annoying. I'm surprised I haven't run into that before. If I have I don't know what solution I used--maybe just manually copying the generated Cython files to the appropriate build path. It would still not be a bad idea to add relevant entries to package_data though since that's also used by the sdist command to ensure that all those files are included in the source tarball (though this can also be achieved simply with the correct MANIFEST.in rule).

Looking at the source again, I think the simplest thing to do would be after Cythonization to copy the appropriate generated files to the appropriate build/ directory (basically emulating what build_py.build_package_data is doing, which you basically already have. But then also add entries for either of:

  1. `self.get_finalized_command('build_py').data_files or
  2. `self.distribution.data_files

The former is the same as if those files were included in package_data in the first place.

An alternative I kind of like, but that's more complicated, is to make cythonize and entirely separate command (build_cython or something) and insert that into the front of the build command's sub-commands. But that's more complicated and probably not worth it for now.

My only real problem at this point is with the new attribute you're tacking on. It seems easily avoidable.

comment:23 in reply to: ↑ 22 ; follow-up: Changed 2 years ago by jdemeyer

Replying to embray:

My only real problem at this point is with the new attribute you're tacking on.

Seriously, why is that a problem? I even explicitly added a sage_ prefix.

comment:24 in reply to: ↑ 22 Changed 2 years ago by jdemeyer

Replying to embray:

I see, that's a bit annoying. I'm surprised I haven't run into that before. If I have I don't know what solution I used--maybe just manually copying the generated Cython files to the appropriate build path. It would still not be a bad idea to add relevant entries to package_data though since that's also used by the sdist command to ensure that all those files are included in the source tarball (though this can also be achieved simply with the correct MANIFEST.in rule).

Looking at the source again, I think the simplest thing to do would be after Cythonization to copy the appropriate generated files to the appropriate build/ directory (basically emulating what build_py.build_package_data is doing, which you basically already have. But then also add entries for either of:

  1. `self.get_finalized_command('build_py').data_files or
  2. `self.distribution.data_files

The former is the same as if those files were included in package_data in the first place.

Do you want me to make those changes here? Or do you accept the current code on this branch and should we leave that for another ticket?

comment:25 in reply to: ↑ 23 ; follow-ups: Changed 2 years ago by embray

Replying to jdemeyer:

Replying to embray:

My only real problem at this point is with the new attribute you're tacking on.

Seriously, why is that a problem? I even explicitly added a sage_ prefix.

1) Because it's not at all clear why it's even necessary, and it only confuses things.

2) Just because distutils is "meant to be" tinkered with in a flexible manner, monkeypatching the Distribution option is definitely an anti-pattern. More commonly, if some command has some data it would like to share with other commands, the pattern is to add an attribute to that command instead (and usually at least initialize that attribute with a default value in the finalize_command method). Then, if other commands need that data, they would first ensure that the command it comes from is finalized, and may also run that command if necessary (using self.distribution.run_command, which doesn't run the command a second time if it has already been run). That way if command B uses A.data, it ensures that A.data has been generated first.

I'm just looking out for distutils being used the way it's intended. This is unfortunately very undocumented, but if you spend enough time in it you'll see that there are various conventions that are best to follow. That said, I don't see why you need to add this in the first place.

Last edited 2 years ago by embray (previous) (diff)

comment:26 in reply to: ↑ 25 Changed 2 years ago by jdemeyer

Replying to embray:

Because it's not at all clear why it's even necessary, and it only confuses things.

It's necessary because I need that list of files in more than one place.

comment:27 follow-up: Changed 2 years ago by embray

Hmmm...I think I can appreciate what the clean_stale_files stuff is for. In most projects I would just blow away the build directory, or just individual files in it if applicable. But sage is so large this is a real pain to do--that's something which itself could use a better approach but that's well beyond this ticket.

I don't think that should just happen when you "install" though, but really prior to any build.

comment:28 in reply to: ↑ 27 Changed 2 years ago by jdemeyer

Replying to embray:

Hmmm...I think I can appreciate what the clean_stale_files stuff is for.

This isn't part of this ticket, but #21604.

comment:29 Changed 2 years ago by mkoeppe

  • Reviewers set to Matthias Koeppe

I've tested it (on top of 7.4.beta6), and it seems to work for me. Erik, objections to setting this ticket to "positive review"?

comment:30 in reply to: ↑ 25 Changed 2 years ago by jdemeyer

Replying to embray:

2) Just because distutils is "meant to be" tinkered with in a flexible manner, monkeypatching the Distribution option is definitely an anti-pattern.

Adding attributes to a class is a completely normal thing when subclassing. In this case, I could make a trivial subclass

class SageDistribution(Distribution):
    pass

and define that SageDistribution supports a sage_extra_files attribute. Because of the way Python works though, there is no need for such a subclass.

So, I still don't get why it's perfectly fine to have dist.data_files (what distutils itself uses) but not dist.sage_extra_files. They serve very analogous purposes, so why not implement them analogously?

comment:31 follow-up: Changed 2 years ago by embray

Please--you don't have to explain to me how Python works.

The point I'm trying to make is that the way you're doing this is antithetical to the common programming patterns used specifically in the context of writing distutils commands. I'm still not convinced that you need this extra attribute at all. But for now I'd be satisfied if it were implemented as I described above. If that's not clear I'll just do it.

comment:32 in reply to: ↑ 31 Changed 2 years ago by jdemeyer

Replying to embray:

Please--you don't have to explain to me how Python works.

Of course I don't need to explain Python. I wanted to explain the philosophy behind what I was doing.

I'm still not convinced that you need this extra attribute at all.

And I'm still not convinced that what I'm doing is somehow wrong.

But for now I'd be satisfied if it were implemented as I described above.

I'll try it.

comment:33 Changed 2 years ago by embray

BTW I should add--if what I suggested is not clear that's of course my fault for explaining poorly. My point is that it's not good practice in general to use classes as dumping grounds for arbitrary attributes. Python allows it, sure, but it's almost always confusing and error-prone. For example it's possible to run the "install" command without running "build_py" first, in which case you'll get an attribute error. There are myriad little corner cases like that in distutils.

I think instead of "sage_extra_files" it might be fine to call this "cythonized_files" or something like that, and have it be an attribute of the custom "build_ext". That way it's clearer exactly what this is about (and that the only reason it needs to exist at all is due to how we're putting the cythonized files in a separate, parallel directory structure).

comment:34 Changed 2 years ago by git

  • Commit changed from 7e3a492edaa54fc9c724d183534d27d171c8cf60 to dfcd8177447c7eab491569f08777f585deda93ee

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

dfcd817extra_files -> sage_build_ext.cythonized_files

comment:35 Changed 2 years ago by jdemeyer

Needs review.

comment:36 follow-up: Changed 2 years ago by embray

This looks along the right lines. It occurs to me there's something kind of problematic here, which is that it runs run_cython in finalize_command. finalize_command is basically supposed to be like an __init__--it should just set default values for things and not actually have much in the way of side-effects.

run_cython() should happen at the beginning of the run(). finalize_command should just set cythonized_files to a default value (such as an empty list or None). I think I see why you did things the way you did--you wanted to make sure one way or another that find_extra_files was run. I might instead move the find_extra_files call into a method on build_ext (such as find_cythonized_files() or something). When called after run_cython() that could update the cythonized_files attribute, and then be used to copy the files. But from the install command you would just called find_cythonized_files() if cythonized_files isn't already set.

comment:37 Changed 2 years ago by embray

Something like

class sage_build_ext(build_ext):
    def finalize_options(self):
        ...
        self.cythonized_files = None

    def run(self):
        self.run_cython()

        # The following two lines, or the equivalent thereof, could be called
        # directly from run_cython() too
        self.find_cythonized_files()
        self.copy_cythonized_files()
        build_ext.run(self)

    def find_cythonized_files(self):
        if self.cythonized_files is None:
            ...
            self.cythonized_files = ...


class sage_install(install):
    def clean_stale_files(self):
        ...
        cmd_build_ext = self.get_finalized_command('build_ext')
        cmd.find_cythonized_files()

        # Do stuff with cmd.find_cythonized_files()

And the rest pretty much exactly as you have it. Would that work?

It's not pretty but with distutils you have to write code like you're writing for Python 2.1 or something :)

Last edited 2 years ago by embray (previous) (diff)

comment:38 in reply to: ↑ 36 Changed 2 years ago by jdemeyer

Replying to embray:

This looks along the right lines. It occurs to me there's something kind of problematic here, which is that it runs run_cython in finalize_command.

That's exactly what upstream Cython 0.25 does (which was the inspiration for this ticket), so I'm not inclined to change this.

comment:39 Changed 2 years ago by mkoeppe

Erik, on this ticket (and elsewhere) you have mentioned various subtle issues of setup.py commands and their interaction. I've opened ticket #21678 (Testsuite for src/setup.py) to formalize this, and to make sure that everything that is implemented correctly on the present ticket, according to your comments, will not be lost by regressions in the future.

comment:40 follow-up: Changed 2 years ago by embray

Then I'll have to bother upstream Cython about this. There may be a reason they're doing it that way, but it doesn't make a lot of sense on its face.

The point of finalize_command is to initialize any options / set up defaults, not actually do anything. "But they do it" is not a good reason if we don't know why they're doing it.

comment:41 Changed 2 years ago by embray

If I had to guess, the reason they're doing it that way is because of how the "sdist" command uses the "build_ext" command to get a list of source files to include in the sdist. I think the right way to do that is not really to run cythonize() in the finalize_command() method, but rather to modify the sdist command so that it runs cythonize() as a sub-command.

comment:42 in reply to: ↑ 40 ; follow-up: Changed 2 years ago by jdemeyer

Replying to embray:

Then I'll have to bother upstream Cython about this.

Please do that.

comment:43 in reply to: ↑ 42 Changed 2 years ago by embray

Replying to jdemeyer:

Replying to embray:

Then I'll have to bother upstream Cython about this.

Please do that.

I will try to work with them more closely on that.

In the interest of harmony and in making progress, if this is working for you as is, then I give it a positive review. I'll open a separate ticket with my ideas for how to build on this to make it a bit cleaner, but I don't think it's bad as is as long as it works. I just know we can do better.

comment:44 Changed 2 years ago by jdemeyer

  • Reviewers changed from Matthias Koeppe to Matthias Koeppe, Erik Bray

I'll let the other reviewer (Matthias Koeppe) decide on giving positive review.

comment:45 Changed 2 years ago by embray

See #21682

comment:46 Changed 2 years ago by mkoeppe

  • Status changed from needs_review to positive_review

Tested on top of 7.4.rc0.

comment:47 Changed 2 years ago by vbraun

  • Branch changed from u/jdemeyer/ticket/21600 to dfcd8177447c7eab491569f08777f585deda93ee
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.