#13826 closed defect (fixed)
Race condition in star_imports cache
Reported by: | vbraun | Owned by: | GeorgSWeber |
---|---|---|---|
Priority: | critical | Milestone: | sage-5.9 |
Component: | build | Keywords: | |
Cc: | nbruin, robertwb | Merged in: | sage-5.9.rc0 |
Authors: | Volker Braun, Jeroen Demeyer | Reviewers: | Volker Braun |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | #14292 | Stopgaps: |
Description (last modified by )
See https://groups.google.com/d/topic/sage-devel/SN88f9qEIV8/discussion
The patchbot there sporadically fails on various tests with errors of the type
Traceback (most recent call last): File "/mnt/storage2TB/patchbot/.sage/tmp/ volker_desktop.stp.dias.ie-14095/multireplace_29004.py", line 6, in <module> from sage.all_cmdline import *; File "/mnt/storage2TB/patchbot/Sage/sage-5.5.rc0/local/lib/python/ site-packages/sage/all_cmdline.py", line 14, in <module> from sage.all import * File "/mnt/storage2TB/patchbot/Sage/sage-5.5.rc0/local/lib/python/ site-packages/sage/all.py", line 72, in <module> from sage.rings.all import * File "/mnt/storage2TB/patchbot/Sage/sage-5.5.rc0/local/lib/python/ site-packages/sage/rings/all.py", line 169, in <module> lazy_import("sage.rings.universal_cyclotomic_field.all","*") File "lazy_import.pyx", line 850, in sage.misc.lazy_import.lazy_import (sage/misc/lazy_import.c:5168) names[ix:ix+1] = get_star_imports(module) File "lazy_import.pyx", line 900, in sage.misc.lazy_import.get_star_imports (sage/misc/lazy_import.c:5924) star_imports = pickle.load(open(cache_file)) EOFError
The patchbot is on a separate harddisk, mounted under /mnt/storage2TB. But my temp directory is tmpfs. So the pickle is moved across block devices, which is of course not atomic. Hence the patchbot sometimes dies here while opening a half-written file. The temporary file should be created in the target directory, only then can we be sure that the move is atomic.
Attachments (2)
Change History (29)
comment:1 Changed 10 years ago by
Authors: | → Volker Braun |
---|---|
Priority: | blocker → critical |
Status: | new → needs_review |
comment:2 Changed 10 years ago by
Cc: | nbruin added |
---|
comment:3 Changed 10 years ago by
comment:4 Changed 10 years ago by
I agree that one should always open()
and then ask for forgiveness. Fixed, too.
comment:5 Changed 10 years ago by
I think you're guarding a little too much in the try: ... except
now. Wouldn't we want to know if unpickling went wrong? We have a good reason to catch errors from open
: That's how we catch the file being legitimately non-existent. I think it's a real error we want reported if the file exists but doesn't contain a valid pickle. With your proposed code one could end up with a corrupted cache and consistently, silently, lose the efficiency of having the cache.
To some extent it's a matter of taste: Do you want to continue on a best-effort basis as much as possible or do you want the user to know that something didn't go as expected? I think the latter leads to an easier to debug program and therefore is more suitable for sage.
comment:6 Changed 10 years ago by
I've added a warning if the pickle is corrupted and converted sage.sandpiles
to lazily import so we actually have something to test. Though neither of these changes is anywhere near as important as fixing the race condition.
comment:7 follow-up: 8 Changed 10 years ago by
Reviewers: | → Nils Bruin |
---|---|
Status: | needs_review → positive_review |
Looks good to me (mostly). Few small issues:
- whitespace plugin complains about introduced whitespace
- the code writing the cache file first deletes the cachefile, writes the new one in a temporary file, and then moves it in place. Why not leave the old cache file in place while the new one gets written?
I'm setting review to positive anyway.
comment:8 follow-up: 9 Changed 10 years ago by
Replying to nbruin:
- whitespace plugin complains about introduced whitespace
So? Go ahead and design a workflow that takes care of it automatically if it bothers you.
- the code writing the cache file first deletes the cachefile, writes the new one in a temporary file, and then moves it in place.
Because os.rename() will fail if the destination exists on Windows (thanks for the well-thought out filesystem semantics, Bill G.!)
comment:9 follow-up: 10 Changed 10 years ago by
Status: | positive_review → needs_work |
---|
OK, it seems the issues I pointed out are a little less straightforward than I thought, so they should probably be cleared up before this ticket gets merged.
So? Go ahead and design a workflow that takes care of it automatically if it bothers you.
It's a space after a colon on a line you introduce, not some whitespace on an empty line left by an auto-indenting editor.
Because os.rename() will fail if the destination exists on Windows (thanks for the well-thought out filesystem semantics, Bill G.!)
That's not what I meant. You can first write the temp file and then unlink the old one and move the new one in place. Presently the old cache file is already gone while you're writing the new one. An old cache file is better than none at all, it seems in this setting?
comment:10 Changed 10 years ago by
Replying to nbruin:
It's a space after a colon on a line you introduce, not some whitespace on an empty line left by an auto-indenting editor.
So?
That's not what I meant. You can first write the temp file and then unlink the old one and move the new one in place. Presently the old cache file is already gone while you're writing the new one. An old cache file is better than none at all, it seems in this setting?
Or one could argue that a potentially stale cache is worse than a non-existent one. Doesn't really matter either way, serializing a dict takes no time.
comment:11 Changed 10 years ago by
Cc: | robertwb added |
---|---|
Reviewers: | Nils Bruin |
Status: | needs_work → needs_review |
OK, I don't know what we're supposed to do with whitespace anymore. I hope someone else can sign off on that.
Concerning the caching strategy: I'm not so sure the current code is particularly safe in the face of now-current development practices: If a cache file is present and contains an entry for a module then a lazy star import of that module will use that entry to initialize the namespace. I don't think the cache ever gets updated should it be discovered the data is out-of-sync.
The cache-file is only specific to the "branch" of sage. Now that people mostly use mercurial queues, the branch name hardly identifies the particular sage version. In particular, if I apply a patch that changes a lazily imported module, the cache will not be marked as stale and I will continue working with that. The cache will be resaved upon startup (see all.py
), but only the entries of modules that weren't in the cache before will be updated, because the cache is blindly trusted for the other ones.
I don't quite know what the best solution is. Perhaps sage -b
should just always mark the cache as invalid (i.e., delete it)? In that case, sage -b
should perhaps create the cache as well. In that case, there is no need to rewrite the cache every time and there's no race condition either.
It means that the cache would only reflect the sage "system library" and would need to be located somewhere in the tree rather than in a user's .sage
. The problem is that in general the only way to know if the cache is stale is by looking at the modification dates of the module files and once we're statting all of those, we've lost what we were trying to gain. For the sage library we can reasonably assume that only sage -b
changes that, so there we can cache state without having to stat.
I realize that this is escalating the scope of this ticket a bit, but if lazy_import
lies about objects available in the to-be-imported namespace (due to using a stale cache), we have a rather critical problem. I'm including Robert on the CC here in the hope he can explain why the current approach is safe.
Or perhaps we should simply not do lazy star imports, only specific ones. Then the whole cache is not necessary.
comment:12 Changed 10 years ago by
sage -b
deletes the lazy import cache already. Are you saying that it doesn't work on your system?
[vbraun@volker-desktop ~]$ ll .sage/cache/ total 4 -rw-------. 1 vbraun vbraun 464 Dec 13 18:11 _home_vbraun_opt_sage-5.5.rc0_devel_sage-main-lazy_import_cache.pickle [vbraun@volker-desktop ~]$ sage -b >& /dev/null [vbraun@volker-desktop ~]$ ll .sage/cache/ total 0 [vbraun@volker-desktop ~]$
Its better to be specific about what you import if you know what you need. On the other hand, sage.all
of course star-imports other .all
files.
comment:13 Changed 10 years ago by
Yes, sage -b
deletes the cache file, so the cache should never be stale. Making the replacement atomic should be fixed though, and this patch does that.
The code looks good, except that I think the explicit unlink of the cache file before writing a new one is unnecessary (and could cause issues with multiple stage startups at the same time). We've decided to ignore whitespace, so modulo that one issue, positive review.
comment:14 Changed 10 years ago by
As I said before, os.rename()
will fail on Windows if the target exists. We could count on os.rename()
fail on Windows and succeed on Unix, but thats also a bit wonky. Explicitly deleting the cache ensures that save_cache_file()
does what you would think it does.
comment:15 Changed 10 years ago by
OK, code is probably less ambiguous than trying to explain in words. Why do you do
+ try: + os.unlink(cache_file) + except OSError: + pass # cache_file does not exist, fine + # important: create temp dir in cache_dir (trac #13826) to move atomically + _, tmp_file = tempfile.mkstemp(dir=cache_dir) + with open(tmp_file, "w") as tmp: + pickle.dump(star_imports, tmp) + try: + os.rename(tmp_file, cache_file) + except OSError: + pass # Windows can end up here, ignore
instead of
+ # important: create temp dir in cache_dir (trac #13826) to move atomically + _, tmp_file = tempfile.mkstemp(dir=cache_dir) + with open(tmp_file, "w") as tmp: + pickle.dump(star_imports, tmp) + try: + os.unlink(cache_file) + except OSError: + pass # cache_file does not exist, fine + try: + os.rename(tmp_file, cache_file) + except OSError: + remove tmp_file if rename failed.
Replying to vbraun:
sage -b
deletes the lazy import cache already. Are you saying that it doesn't work on your system?
Ah, ok, it does. That alleviates the problem to some degree. I haven't been able to locate the code that does it, but it will only delete the cache in the .sage/cache
of the UID running sage -b
. So other UIDs are still out of luck. It seems to me that if sage -b
knows when the cache needs to be refreshed then it should do so and the cache should probably be somewhere in sage_root rather than in a user directory.
Perhaps rationalizing where the cache is held is something for a follow-up ticket.
comment:16 Changed 10 years ago by
The cache is deleted in setup.py
, so yes this will cause problem if you run a Sage version that somebody else built.
I don't care about where to put the unlink, I put it right after the mkdir since it is part of preparing the directory. Of course other people like their bike shed pink. Or are you saying that you want to clean up after a potential race on Windows? Feel free to implement that if you want. Erasing can again fail on Windows for existing files. Doesn't strike me as the most urgent problem either.
comment:17 Changed 10 years ago by
I think we should plug this obvious race with the patch I posted. So please review if you care about patchbot results :-P
comment:18 Changed 10 years ago by
See #14292 for a general approach to solving the problem of race conditions when writing to a file, which should probably be used here.
comment:19 Changed 10 years ago by
Authors: | Volker Braun → Volker Braun, Jeroen Demeyer |
---|---|
Dependencies: | → #14292 |
Description: | modified (diff) |
comment:20 Changed 10 years ago by
Reviewers: | → Volker Braun |
---|---|
Status: | needs_review → positive_review |
comment:21 Changed 10 years ago by
Milestone: | sage-5.9 → sage-5.10 |
---|
comment:22 Changed 10 years ago by
Added
-
sage/misc/lazy_import.pyx
diff --git a/sage/misc/lazy_import.pyx b/sage/misc/lazy_import.pyx
a b 909 909 """ 910 910 global star_imports 911 911 if star_imports is None: 912 star_imports = {} 912 913 try: 913 914 with open(get_cache_file()) as cache_file: 914 915 star_imports = pickle.load(cache_file) … … 917 918 except Exception: # unpickling failed 918 919 import warnings 919 920 warnings.warn('star_imports cache is corrupted') 920 star_imports = {}921 921 try: 922 922 return star_imports[module_name] 923 923 except KeyError:
Changed 10 years ago by
Attachment: | 13826_star_imports_race_v2.patch added |
---|
comment:23 Changed 10 years ago by
Status: | positive_review → needs_work |
---|
comment:24 Changed 10 years ago by
Status: | needs_work → needs_review |
---|
comment:25 Changed 10 years ago by
Status: | needs_review → positive_review |
---|
comment:26 Changed 10 years ago by
Merged in: | → sage-5.10.beta0 |
---|---|
Resolution: | → fixed |
Status: | positive_review → closed |
comment:27 Changed 10 years ago by
Merged in: | sage-5.10.beta0 → sage-5.9.rc0 |
---|---|
Milestone: | sage-5.10 → sage-5.9 |
While we're nitpicking on possible race conditions anyway, isn't this unsafe?
The file could get deleted between the existence check and the open. Shouldn't one do
This is of course assuming that either the cachefile is valid or does not exist. A "mv" on the filename shouldn't affect an already opened file.