Opened 8 years ago

Closed 8 years ago

Last modified 8 years ago

#13826 closed defect (fixed)

Race condition in star_imports cache

Reported by: vbraun Owned by: GeorgSWeber
Priority: critical Milestone: sage-5.9
Component: build Keywords:
Cc: nbruin, robertwb Merged in: sage-5.9.rc0
Authors: Volker Braun, Jeroen Demeyer Reviewers: Volker Braun
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: #14292 Stopgaps:

Description (last modified by jdemeyer)

See https://groups.google.com/d/topic/sage-devel/SN88f9qEIV8/discussion

The patchbot there sporadically fails on various tests with errors of the type

Traceback (most recent call last): 
  File "/mnt/storage2TB/patchbot/.sage/tmp/ 
volker_desktop.stp.dias.ie-14095/multireplace_29004.py", line 6, in 
<module> 
    from sage.all_cmdline import *; 
  File "/mnt/storage2TB/patchbot/Sage/sage-5.5.rc0/local/lib/python/ 
site-packages/sage/all_cmdline.py", line 14, in <module> 
    from sage.all import * 
  File "/mnt/storage2TB/patchbot/Sage/sage-5.5.rc0/local/lib/python/ 
site-packages/sage/all.py", line 72, in <module> 
    from sage.rings.all      import * 
  File "/mnt/storage2TB/patchbot/Sage/sage-5.5.rc0/local/lib/python/ 
site-packages/sage/rings/all.py", line 169, in <module> 
    lazy_import("sage.rings.universal_cyclotomic_field.all","*") 
  File "lazy_import.pyx", line 850, in 
sage.misc.lazy_import.lazy_import (sage/misc/lazy_import.c:5168) 
    names[ix:ix+1] = get_star_imports(module) 
  File "lazy_import.pyx", line 900, in 
sage.misc.lazy_import.get_star_imports (sage/misc/lazy_import.c:5924) 
    star_imports = pickle.load(open(cache_file)) 
EOFError 

The patchbot is on a separate harddisk, mounted under /mnt/storage2TB. But my temp directory is tmpfs. So the pickle is moved across block devices, which is of course not atomic. Hence the patchbot sometimes dies here while opening a half-written file. The temporary file should be created in the target directory, only then can we be sure that the move is atomic.

Apply 13826_star_imports_race_v2.patch

Attachments (2)

trac_13826_star_imports_race.patch (3.7 KB) - added by vbraun 8 years ago.
Updated patch
13826_star_imports_race_v2.patch (3.5 KB) - added by jdemeyer 8 years ago.

Download all attachments as: .zip

Change History (29)

comment:1 Changed 8 years ago by vbraun

  • Authors set to Volker Braun
  • Priority changed from blocker to critical
  • Status changed from new to needs_review

comment:2 Changed 8 years ago by vbraun

  • Cc nbruin added

comment:3 Changed 8 years ago by nbruin

While we're nitpicking on possible race conditions anyway, isn't this unsafe?

    if star_imports is None:
        cache_file = get_cache_file()
        if os.path.exists(cache_file):
            star_imports = pickle.load(open(cache_file))
        else:
            star_imports = {}

The file could get deleted between the existence check and the open. Shouldn't one do

    if star_imports is None:
        cache_file = get_cache_file()
        try:
            opened_cache_file=open(cache_file)
        except IOError, <whatever other errors should be caught>:
            opened_cache_file=None
        if opened_cache_file:
            star_imports = pickle.load(opened_cache_file)
            opened_cache_file.close()
        else:
            star_imports = {}

This is of course assuming that either the cachefile is valid or does not exist. A "mv" on the filename shouldn't affect an already opened file.

comment:4 Changed 8 years ago by vbraun

I agree that one should always open() and then ask for forgiveness. Fixed, too.

comment:5 Changed 8 years ago by nbruin

I think you're guarding a little too much in the try: ... except now. Wouldn't we want to know if unpickling went wrong? We have a good reason to catch errors from open: That's how we catch the file being legitimately non-existent. I think it's a real error we want reported if the file exists but doesn't contain a valid pickle. With your proposed code one could end up with a corrupted cache and consistently, silently, lose the efficiency of having the cache.

To some extent it's a matter of taste: Do you want to continue on a best-effort basis as much as possible or do you want the user to know that something didn't go as expected? I think the latter leads to an easier to debug program and therefore is more suitable for sage.

Changed 8 years ago by vbraun

Updated patch

comment:6 Changed 8 years ago by vbraun

I've added a warning if the pickle is corrupted and converted sage.sandpiles to lazily import so we actually have something to test. Though neither of these changes is anywhere near as important as fixing the race condition.

comment:7 follow-up: Changed 8 years ago by nbruin

  • Reviewers set to Nils Bruin
  • Status changed from needs_review to positive_review

Looks good to me (mostly). Few small issues:

  • whitespace plugin complains about introduced whitespace
  • the code writing the cache file first deletes the cachefile, writes the new one in a temporary file, and then moves it in place. Why not leave the old cache file in place while the new one gets written?

I'm setting review to positive anyway.

comment:8 in reply to: ↑ 7 ; follow-up: Changed 8 years ago by vbraun

Replying to nbruin:

  • whitespace plugin complains about introduced whitespace

So? Go ahead and design a workflow that takes care of it automatically if it bothers you.

  • the code writing the cache file first deletes the cachefile, writes the new one in a temporary file, and then moves it in place.

Because os.rename() will fail if the destination exists on Windows (thanks for the well-thought out filesystem semantics, Bill G.!)

comment:9 in reply to: ↑ 8 ; follow-up: Changed 8 years ago by nbruin

  • Status changed from positive_review to needs_work

OK, it seems the issues I pointed out are a little less straightforward than I thought, so they should probably be cleared up before this ticket gets merged.

So? Go ahead and design a workflow that takes care of it automatically if it bothers you.

It's a space after a colon on a line you introduce, not some whitespace on an empty line left by an auto-indenting editor.

Because os.rename() will fail if the destination exists on Windows (thanks for the well-thought out filesystem semantics, Bill G.!)

That's not what I meant. You can first write the temp file and then unlink the old one and move the new one in place. Presently the old cache file is already gone while you're writing the new one. An old cache file is better than none at all, it seems in this setting?

comment:10 in reply to: ↑ 9 Changed 8 years ago by vbraun

Replying to nbruin:

It's a space after a colon on a line you introduce, not some whitespace on an empty line left by an auto-indenting editor.

So?

That's not what I meant. You can first write the temp file and then unlink the old one and move the new one in place. Presently the old cache file is already gone while you're writing the new one. An old cache file is better than none at all, it seems in this setting?

Or one could argue that a potentially stale cache is worse than a non-existent one. Doesn't really matter either way, serializing a dict takes no time.

comment:11 Changed 8 years ago by nbruin

  • Cc robertwb added
  • Reviewers Nils Bruin deleted
  • Status changed from needs_work to needs_review

OK, I don't know what we're supposed to do with whitespace anymore. I hope someone else can sign off on that.

Concerning the caching strategy: I'm not so sure the current code is particularly safe in the face of now-current development practices: If a cache file is present and contains an entry for a module then a lazy star import of that module will use that entry to initialize the namespace. I don't think the cache ever gets updated should it be discovered the data is out-of-sync.

The cache-file is only specific to the "branch" of sage. Now that people mostly use mercurial queues, the branch name hardly identifies the particular sage version. In particular, if I apply a patch that changes a lazily imported module, the cache will not be marked as stale and I will continue working with that. The cache will be resaved upon startup (see all.py), but only the entries of modules that weren't in the cache before will be updated, because the cache is blindly trusted for the other ones.

I don't quite know what the best solution is. Perhaps sage -b should just always mark the cache as invalid (i.e., delete it)? In that case, sage -b should perhaps create the cache as well. In that case, there is no need to rewrite the cache every time and there's no race condition either.

It means that the cache would only reflect the sage "system library" and would need to be located somewhere in the tree rather than in a user's .sage. The problem is that in general the only way to know if the cache is stale is by looking at the modification dates of the module files and once we're statting all of those, we've lost what we were trying to gain. For the sage library we can reasonably assume that only sage -b changes that, so there we can cache state without having to stat.

I realize that this is escalating the scope of this ticket a bit, but if lazy_import lies about objects available in the to-be-imported namespace (due to using a stale cache), we have a rather critical problem. I'm including Robert on the CC here in the hope he can explain why the current approach is safe.

Or perhaps we should simply not do lazy star imports, only specific ones. Then the whole cache is not necessary.

Last edited 8 years ago by nbruin (previous) (diff)

comment:12 Changed 8 years ago by vbraun

sage -b deletes the lazy import cache already. Are you saying that it doesn't work on your system?

[vbraun@volker-desktop ~]$ ll .sage/cache/
total 4
-rw-------. 1 vbraun vbraun 464 Dec 13 18:11 _home_vbraun_opt_sage-5.5.rc0_devel_sage-main-lazy_import_cache.pickle
[vbraun@volker-desktop ~]$ sage -b >& /dev/null
[vbraun@volker-desktop ~]$ ll .sage/cache/
total 0
[vbraun@volker-desktop ~]$ 

Its better to be specific about what you import if you know what you need. On the other hand, sage.all of course star-imports other .all files.

comment:13 Changed 8 years ago by robertwb

Yes, sage -b deletes the cache file, so the cache should never be stale. Making the replacement atomic should be fixed though, and this patch does that.

The code looks good, except that I think the explicit unlink of the cache file before writing a new one is unnecessary (and could cause issues with multiple stage startups at the same time). We've decided to ignore whitespace, so modulo that one issue, positive review.

comment:14 Changed 8 years ago by vbraun

As I said before, os.rename() will fail on Windows if the target exists. We could count on os.rename() fail on Windows and succeed on Unix, but thats also a bit wonky. Explicitly deleting the cache ensures that save_cache_file() does what you would think it does.

comment:15 Changed 8 years ago by nbruin

OK, code is probably less ambiguous than trying to explain in words. Why do you do

+    try:
+        os.unlink(cache_file)
+    except OSError:
+        pass   # cache_file does not exist, fine
+    # important: create temp dir in cache_dir (trac #13826) to move atomically
+    _, tmp_file = tempfile.mkstemp(dir=cache_dir)
+    with open(tmp_file, "w") as tmp:
+        pickle.dump(star_imports, tmp)
+    try:
+        os.rename(tmp_file, cache_file)
+    except OSError:
+        pass   # Windows can end up here, ignore

instead of

+    # important: create temp dir in cache_dir (trac #13826) to move atomically
+    _, tmp_file = tempfile.mkstemp(dir=cache_dir)
+    with open(tmp_file, "w") as tmp:
+        pickle.dump(star_imports, tmp)
+    try:
+        os.unlink(cache_file)
+    except OSError:
+        pass   # cache_file does not exist, fine
+    try:
+        os.rename(tmp_file, cache_file)
+    except OSError:
+        remove tmp_file if rename failed.

Replying to vbraun:

sage -b deletes the lazy import cache already. Are you saying that it doesn't work on your system?

Ah, ok, it does. That alleviates the problem to some degree. I haven't been able to locate the code that does it, but it will only delete the cache in the .sage/cache of the UID running sage -b. So other UIDs are still out of luck. It seems to me that if sage -b knows when the cache needs to be refreshed then it should do so and the cache should probably be somewhere in sage_root rather than in a user directory.

Perhaps rationalizing where the cache is held is something for a follow-up ticket.

comment:16 Changed 8 years ago by vbraun

The cache is deleted in setup.py, so yes this will cause problem if you run a Sage version that somebody else built.

I don't care about where to put the unlink, I put it right after the mkdir since it is part of preparing the directory. Of course other people like their bike shed pink. Or are you saying that you want to clean up after a potential race on Windows? Feel free to implement that if you want. Erasing can again fail on Windows for existing files. Doesn't strike me as the most urgent problem either.

comment:17 Changed 8 years ago by vbraun

I think we should plug this obvious race with the patch I posted. So please review if you care about patchbot results :-P

comment:18 Changed 8 years ago by jdemeyer

See #14292 for a general approach to solving the problem of race conditions when writing to a file, which should probably be used here.

comment:19 Changed 8 years ago by jdemeyer

  • Authors changed from Volker Braun to Volker Braun, Jeroen Demeyer
  • Dependencies set to #14292
  • Description modified (diff)

comment:20 Changed 8 years ago by vbraun

  • Reviewers set to Volker Braun
  • Status changed from needs_review to positive_review

comment:21 Changed 8 years ago by jdemeyer

  • Milestone changed from sage-5.9 to sage-5.10

comment:22 Changed 8 years ago by jdemeyer

Added

  • sage/misc/lazy_import.pyx

    diff --git a/sage/misc/lazy_import.pyx b/sage/misc/lazy_import.pyx
    a b  
    909909    """
    910910    global star_imports
    911911    if star_imports is None:
     912        star_imports = {}
    912913        try:
    913914            with open(get_cache_file()) as cache_file:
    914915                star_imports = pickle.load(cache_file)
     
    917918        except Exception:  # unpickling failed
    918919            import warnings
    919920            warnings.warn('star_imports cache is corrupted')
    920         star_imports = {}
    921921    try:
    922922        return star_imports[module_name]
    923923    except KeyError:

Changed 8 years ago by jdemeyer

comment:23 Changed 8 years ago by jdemeyer

  • Status changed from positive_review to needs_work

comment:24 Changed 8 years ago by jdemeyer

  • Status changed from needs_work to needs_review

comment:25 Changed 8 years ago by vbraun

  • Status changed from needs_review to positive_review

comment:26 Changed 8 years ago by jdemeyer

  • Merged in set to sage-5.10.beta0
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:27 Changed 8 years ago by jdemeyer

  • Merged in changed from sage-5.10.beta0 to sage-5.9.rc0
  • Milestone changed from sage-5.10 to sage-5.9
Note: See TracTickets for help on using tickets.