Opened 8 years ago

Closed 8 years ago

#15457 closed defect (fixed)

sage-cleaner does not quit

Reported by: vbraun Owned by:
Priority: major Milestone: sage-6.2
Component: scripts Keywords:
Cc: Merged in:
Authors: Volker Braun Reviewers: Jeroen Demeyer
Report Upstream: N/A Work issues:
Branch: 15c3964 (Commits, GitHub, GitLab) Commit: 15c3964e23fe55ef3f96a7ac7b6b5ae8292a9006
Dependencies: Stopgaps:

Status badges

Description

If there are files in DOT_SAGE/temp/HOSTNAME then the sage cleaner process never stops. Note that Sage temp files must(!) be under DOT_SAGE/temp/HOSTNAME/<pid>.

Change History (24)

comment:1 Changed 8 years ago by vbraun

  • Branch set to u/vbraun/cleaner_cleaner

comment:2 Changed 8 years ago by vbraun

  • Branch u/vbraun/cleaner_cleaner deleted

Also, clean up the cleaner sources a bit. And move the pid file to SAGE_TMP_ROOT/cleaner.pid. And log the output to SAGE_TMP_ROOT/cleaner.log.

comment:3 Changed 8 years ago by vbraun

  • Branch set to u/vbraun/cleaner_cleaner

comment:4 Changed 8 years ago by vbraun

  • Commit set to cab931b669b1f7dff0b9a99f5e7394726384e6fe
  • Status changed from new to needs_review

New commits:

cab931bFix bug where sage-cleaner does not quit under certain circumstances

comment:5 Changed 8 years ago by jdemeyer

Who or what is writing to DOT_SAGE/temp/HOSTNAME? I haven't seen this problem appear before.

comment:6 follow-up: Changed 8 years ago by vbraun

The files were fairly old afair. Of course it is hard to say where they came from, could even have been a ticket that was not merged that way. I guess you just haven't noticed since you didn't try to delete DOT_SAGE (which will fail on NFS if files are still open). In any case, the sage cleaner shoudn't break just because there is a temp file in the wrong location. By outright deleting the illegal file we hopefully hurt whoever put it there so they notice the error in their ways.

For the record, I configured the new buildbot so that full builds start with a fresh DOT_SAGE and incremental builds keep using the previous one.

comment:7 in reply to: ↑ 6 Changed 8 years ago by jdemeyer

Replying to vbraun:

The files were fairly old afair. Of course it is hard to say where they came from, could even have been a ticket that was not merged that way.

Thanks, that's a good explanation.

comment:8 Changed 8 years ago by vbraun

Can you review this ticket?

comment:9 Changed 8 years ago by jdemeyer

  • Reviewers set to Jeroen Demeyer

comment:10 Changed 8 years ago by jdemeyer

What's the motivation for always re-opening the file in LogFile?

comment:12 Changed 8 years ago by jdemeyer

And why not replace

    if not os.path.isdir(SAGE_TMP_ROOT):
        mkdir_p(SAGE_TMP_ROOT)

by

    mkdir_p(SAGE_TMP_ROOT)

(or at least, move the isdir check inside mkdir_p)

comment:13 Changed 8 years ago by jdemeyer

Working on review patch, hang on...

comment:14 Changed 8 years ago by jdemeyer

And what was wrong with

open(pidfile,'w').write(str(os.getpid()))

???

comment:15 Changed 8 years ago by jdemeyer

  • Branch changed from u/vbraun/cleaner_cleaner to u/jdemeyer/ticket/15457
  • Created changed from 11/26/13 22:52:06 to 11/26/13 22:52:06
  • Modified changed from 01/01/14 16:09:48 to 01/01/14 16:09:48

comment:16 Changed 8 years ago by jdemeyer

  • Commit changed from cab931b669b1f7dff0b9a99f5e7394726384e6fe to 7a16247bf02e94e1522e8728894082cee8c825b5

Additional commit needs review.


New commits:

7a16247sage-cleaner review patch
abfca17Merge branch 'develop' into ticket/15457

comment:17 follow-up: Changed 8 years ago by vbraun

IMHO it is bad to keep log files open for extended amounts of time. This is just going to cause problems (depending on the file system) if multiple processes try do do that. And before you say that there is only one cleaner process, the whole point of the logging is to have a log to prove that if things go south again. The Python logging module does not support multiple processes logging to the same file, for the record. But at least it flushes the output.

The open(pidfile,'w').write(str(os.getpid())) construct works but is IMHO bad style. For starters, it does an implicit close (explicit is better than implicit). The point in the program flow where the close occurs is also not specified (CPython implementation detail closes it immediately, I think, but the specs don't specify). And until the close there is (probably, again implementation/fs detail) nothing written to disk.

comment:18 in reply to: ↑ 17 Changed 8 years ago by jdemeyer

Replying to vbraun:

IMHO it is bad to keep log files open for extended amounts of time. This is just going to cause problems (depending on the file system) if multiple processes try do do that. And before you say that there is only one cleaner process, the whole point of the logging is to have a log to prove that if things go south again. The Python logging module does not support multiple processes logging to the same file, for the record. But at least it flushes the output.

Having multiple processes append to the same file is indeed subject to race conditions on NFS (but almost everything with NFS has race conditions). However, this is independent of how many processes have the file open, I see no evidence that simply having a file open makes things worse.

comment:19 Changed 8 years ago by git

  • Commit changed from 7a16247bf02e94e1522e8728894082cee8c825b5 to 15c3964e23fe55ef3f96a7ac7b6b5ae8292a9006

Branch pushed to git repo; I updated commit sha1. New commits:

15c3964Use "with" to write pidfile

comment:20 Changed 8 years ago by vbraun

NFS isn't even close to being the worst case here. For example on the PanFS network file system opening a file also locks it, so all other processes get EACCESS. Its not about a race, its that other processes can't even log that something is wrong.

comment:21 Changed 8 years ago by jdemeyer

Should we really jump through hoops to support broken file systems?

comment:22 Changed 8 years ago by vbraun

Afaik there is nothing in the spec that disallows automatic mandatory file locking. Its not done in the usual local filesystems, but that doesn't mean that Panasas is broken. But even if it goes agains POSIX then its still a widely-used filesystem in the HPC world, so we would only hurt ourselves if we were to intentionally break Sage on it.

comment:23 Changed 8 years ago by vbraun_spam

  • Milestone changed from sage-6.1 to sage-6.2

comment:24 Changed 8 years ago by vbraun

  • Branch changed from u/jdemeyer/ticket/15457 to 15c3964e23fe55ef3f96a7ac7b6b5ae8292a9006
  • Resolution set to fixed
  • Status changed from needs_review to closed
Note: See TracTickets for help on using tickets.