Opened 11 years ago

Closed 11 years ago

#10098 closed defect (fixed)

Flaky doctest in sage/interfaces/expect.py

Reported by: mpatel Owned by: mvngu
Priority: blocker Milestone: sage-4.6
Component: doctest coverage Keywords:
Cc: boothby, drkirkby, malb, mariah, SimonKing, vbraun, was Merged in: sage-4.6.rc0
Authors: Volker Braun Reviewers: Mitesh Patel
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by mpatel)

With #10004 merged into 4.6.alpha3, I get the doctest failures

sage -t -long  -force_lib devel/sage/sage/interfaces/expect.py
**********************************************************************
File "/home/buildbot/build/sage/cleo-1/cleo_full/build/sage-4.6.alpha3/devel/sage-main/sage/interfaces/expect.py", line 596:
    sage: L = [t[1] for t in f(range(5))]
Expected nothing
Got:
    [Errno 16] Device or resource busy: '/home/buildbot/build/sage/cleo-1/cleo_full/build/sage-4.6.alpha3/.sage/temp/cleo/32313/dir_0/.nfs00000000007d541a0000abe9'
**********************************************************************
File "/home/buildbot/build/sage/cleo-1/cleo_full/build/sage-4.6.alpha3/devel/sage-main/sage/interfaces/expect.py", line 603:
    sage: L = [t[1] for t in f(range(5))]
Expected nothing
Got:
    [Errno 16] Device or resource busy: '/home/buildbot/build/sage/cleo-1/cleo_full/build/sage-4.6.alpha3/.sage/temp/cleo/32313/dir_1/.nfs00000000007d541a0000abf1'

on the Skynet machines cleo, eno, fulvia, sextus, and taurus.

The @fork decorator tickets #9501, #9616, and #9631 are related.

Attachments (1)

trac_10098-use_fork_verbose.patch (668 bytes) - added by mpatel 11 years ago.
Use verbose=False by default for forking @parallel.

Download all attachments as: .zip

Change History (24)

comment:1 follow-up: Changed 11 years ago by mpatel

  • Summary changed from Flaky doctest in to Flaky doctest in sage/interfaces/expect.py

This is reminiscent of #9501 / #9616. Since the fix at #10004 works (right?), maybe we should tag these tests as "not tested" instead of reverting the patch? Or is there another way?

comment:2 in reply to: ↑ 1 Changed 11 years ago by SimonKing

Replying to mpatel:

This is reminiscent of #9501 / #9616. Since the fix at #10004 works (right?),

I hope so...

maybe we should tag these tests as "not tested" instead of reverting the patch? Or is there another way?

What doees "device busy" exactly mean? Can one trac this down? Actually the purpose of #10004 was to avoid conflicts (race conditions) on the hard disk.

comment:3 follow-up: Changed 11 years ago by mpatel

  • Cc drkirkby mariah vbraun added
  • Description modified (diff)

I think this is a filesystem quirk (a problem?), but I'm not sure. David, Mariah, and Volker, any thoughts?

comment:4 follow-up: Changed 11 years ago by vbraun

Did that appear at the same time as the "sync" was removed from NFS? Or did we upgrade the expect python module?

I think something is tripping over the following fact: In UNIX, you can open a file, then delete it, then delete the containing directory. The file is still open for business until you close it. But NFS doesn't have inodes, so if you delete an open file the NFS volume will contain a fake directory entry .nfs00000000007d541a0000abf1 that only disappears when you close the file. I guess that the "Device or resource busy" error is from trying to delete the non-empty directory. Either because the open file descriptor was leaked or because of some concurrency issue with NFS.

comment:5 in reply to: ↑ 3 Changed 11 years ago by drkirkby

Replying to mpatel:

I think this is a filesystem quirk (a problem?), but I'm not sure. David, Mariah, and Volker, any thoughts?

I tend to agree in this case - it looks like file systems issues.

However, I've seen problems with the pexpect interface on systems which use local file systems. I'm not convinced it is a very reliable tool myself.

I made a point on sage-devel half an hour or so ago, but I'll make it briefly here. The SQlite developers take testing very seriously

http://www.sqlite.org/testing.html

Their test code is more than 600 times as big as code in the SQlite database!

They advise against the use of NFS as fcntl() file locking is broken on many NFS implementations.

http://www.sqlite.org/faq.html#q5

My own money would be that if a Sun running Solaris was serving another Sun running Solaris, it would work! But perhaps I'm biased - I tend to trust Solaris more than random collections of Linux. (Of course, when William disabled the ZIL log on a ZFS file system on disk.math to improve performance, all bets were off. That is well documented to be a bad policy, and should never be used except for test purposes).

Dave

comment:6 in reply to: ↑ 4 ; follow-up: Changed 11 years ago by mpatel

  • Description modified (diff)

Replying to vbraun:

Did that appear at the same time as the "sync" was removed from NFS? Or did we upgrade the expect python module?

As far as I'm aware, the Skynet cluster is administered independently (by Mariah Lenox) from the Sage cluster. I don't know whether Mariah has made any recent changes to Skynet's NFS configuration.

We made some changes to sage.interfaces.expect at #10004.

I think something is tripping over the following fact: In UNIX, you can open a file, then delete it, then delete the containing directory. The file is still open for business until you close it. But NFS doesn't have inodes, so if you delete an open file the NFS volume will contain a fake directory entry .nfs00000000007d541a0000abf1 that only disappears when you close the file. I guess that the "Device or resource busy" error is from trying to delete the non-empty directory. Either because the open file descriptor was leaked or because of some concurrency issue with NFS.

Thanks! I've tried to make an example:

sage: import os
sage: os.mkdir('dir')
sage: f = open('dir/foo', 'w')
sage: os.listdir('dir')
['foo']
sage: # Now 'rm dir/foo' in a separate window.
sage: sage: os.listdir('dir')
['.nfs000000000143f92500013636']
sage:  os.unlink('dir/.nfs000000000143f92500013636')
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)

/home/mpatel/<ipython console> in <module>()

OSError: [Errno 16] Device or resource busy: 'dir/.nfs000000000143f92500013636'
sage: os.rmdir('dir')
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)

/home/mpatel/<ipython console> in <module>()

OSError: [Errno 39] Directory not empty: 'dir'
sage: f.close()
sage: os.rmdir('dir')
sage: 

For the problem in the description, I suppose Sage tries to empty a temporary directory. What if we wrap that in a try-except block?

comment:7 in reply to: ↑ 6 Changed 11 years ago by drkirkby

Replying to mpatel:

As far as I'm aware, the Skynet cluster is administered independently (by Mariah Lenox) from the Sage cluster. I don't know whether Mariah has made any recent changes to Skynet's NFS configuration.

Thinking about it, Skynet was made unavailable a few days ago for scheduled maintenance, so some changes have been made there.

comment:8 Changed 11 years ago by drkirkby

I think this shows the value of a suggestion I've made before on other ticket(s).

If we knew the exact date/time of the tests (pass or fail), we could look at the logs on the machine and see if there was a problem. Problems with NFS are often logged - I've seen them logged on t2.math, when William had disabled the ZIL. We can even tie them up with user name - here's a few errors logged for both Simon and John.

Sep 12 09:05:40 t2 nfs: [ID 814820 kern.info] NOTICE: [NFS4][Server: disk][Mntpt: /home]Lost OP_CLOSE request for fs /home, file ./palmieri/.sage-t2/temp/t2/23760/qsieve_0/lprels.1005.23843 (rn
ode_pt: 0x30009d028d8), dir <null string> (0x0) for server disk
Sep 12 09:05:40 t2 nfs: [ID 581112 kern.info] NOTICE: [NFS4][Server: disk][Mntpt: /home]NFS Starting recovery for mount /home (mi 0x60040c42000 mi_recovflags [0x40]) on server disk, rnode_pt1 .
/palmieri/.sage-t2/temp/t2/23760/qsieve_0/lprels.1005.23843 (0x30009d028d8), rnode_pt2 <null string> (0x0)
Sep 12 09:05:40 t2 nfs: [ID 273629 kern.info] NOTICE: [NFS4][Server: disk][Mntpt: /home]NFS Recovery done for mount /home (mi 0x60040c42000) on server disk, rnode_pt1 ./palmieri/.sage-t2/temp/t
2/23760/qsieve_0/lprels.1005.23843 (0x30009d028d8), rnode_pt2 <null string> (0x0)
Sep 12 09:05:40 t2 nfs: [ID 220977 kern.info] NOTICE: [NFS4][Server: disk][Mntpt: /home]NFS op OP_CLOSE got error 4 causing recovery action NR_LOST_STATE_RQST.
Sep 12 09:05:40 t2 nfs: [ID 814820 kern.info] NOTICE: [NFS4][Server: disk][Mntpt: /home]Lost OP_CLOSE request for fs /home, file ./palmieri/.sage-t2/ipython/history-sage (rnode_pt: 0x300375c68f
0), dir <null string> (0x0) for server disk
Sep 12 09:05:40 t2 nfs: [ID 581112 kern.info] NOTICE: [NFS4][Server: disk][Mntpt: /home]NFS Starting recovery for mount /home (mi 0x60040c42000 mi_recovflags [0x40]) on server disk, rnode_pt1 .
/palmieri/.sage-t2/ipython/history-sage (0x300375c68f0), rnode_pt2 <null string> (0x0)
Sep 12 09:05:40 t2 nfs: [ID 273629 kern.info] NOTICE: [NFS4][Server: disk][Mntpt: /home]NFS Recovery done for mount /home (mi 0x60040c42000) on server disk, rnode_pt1 ./palmieri/.sage-t2/ipytho
n/history-sage (0x300375c68f0), rnode_pt2 <null string> (0x0)
Sep 12 09:05:40 t2 nfs: [ID 236337 kern.info] NOTICE: [NFS4][Server: disk][Mntpt: /home]NFS op OP_CLOSE got error NFS4ERR_STALE causing recovery action NR_STALE.
Sep 12 09:05:40 t2 nfs: [ID 286389 kern.info] NOTICE: [NFS4][Server: disk][Mntpt: /home]File ./SimonKing/.sage/temp/t2/13276/dir_2/.nfsD44308 (rnode_pt: 3000da37bd8) was closed due to NFS recov
ery error on server disk(failed to recover from NFS4ERR_STALE NFS4ERR_STALE)

If the doc tests had the exact times, it would go some way to helping determine if the system detected a problem.

comment:9 Changed 11 years ago by mpatel

Replying to mpatel:

For the problem in the description, I suppose Sage tries to empty a temporary directory. What if we wrap that in a try-except block?

It's already wrapped in a such block. From near the end of sage.parallel.use_fork.p_iter_fork.__call__:

       finally:

            # Clean up all temporary files.                                     
            try:
                for X in os.listdir(dir):
                    os.unlink(os.path.join(dir, X))
                os.rmdir(dir)
            except OSError, msg:
                if self.verbose:
                    print msg

Changing the doctest to

@parallel(verbose=False)
def f(n):
    return gap._local_tmpfile()

L = [t[1] for t in f(range(5))]
len(set(L))

should "fix" the problem. This example is not intended to test use_fork, which has it's own tests, so we could justify doing this here.

comment:10 Changed 11 years ago by mpatel

Or, perhaps, we could optionally re-raise in __call__ (see the previous comment) and wrap L = [t[1] for t in f(range(5))] in a try-except block. We'd just pass if the exception is of the form in the description. This might also help with the very similar problem at #9501, #9616, and #9631.

comment:11 follow-up: Changed 11 years ago by vbraun

The only use of verbose is to print something in case of a timeout and if the temp files could not be cleaned up. The former is pretty much intentional (you have to set a timeout), the latter is a fact of life on some OS/filesystem combinations. I think we should just change the default to verbose=False.

comment:12 in reply to: ↑ 11 Changed 11 years ago by mpatel

  • Cc was added

Replying to vbraun:

The only use of verbose is to print something in case of a timeout and if the temp files could not be cleaned up. The former is pretty much intentional (you have to set a timeout), the latter is a fact of life on some OS/filesystem combinations. I think we should just change the default to verbose=False.

Users of forking @parallel could activate the potentially useful timeout messages with verbose=True. If they didn't do this, they'd still learn about timeouts from the results, although not until all subprocesses had ended or been killed. Is this OK? I don't use @parallel very often.

Changed 11 years ago by mpatel

Use verbose=False by default for forking @parallel.

comment:13 Changed 11 years ago by mpatel

  • Authors set to Volker Braun
  • Status changed from new to needs_review

I've added a patch using Volker's suggestion.

comment:14 Changed 11 years ago by mpatel

  • Cc boothby added

comment:15 Changed 11 years ago by mpatel

  • Cc malb added

comment:16 Changed 11 years ago by mpatel

  • Description modified (diff)

comment:17 follow-up: Changed 11 years ago by malb

Hi, did you CC me because you need a reviewer? Or is there a particular question I should answer?

comment:18 in reply to: ↑ 17 ; follow-ups: Changed 11 years ago by mpatel

Replying to malb:

Hi, did you CC me because you need a reviewer? Or is there a particular question I should answer?

Oops. I should have added a comment like this:

Martin, Simon, Tom, and William, do you have any objections to changing the verbose parameter of the forking @parallel decorator to default to False? The patch above does this, but I'm not sure if it's OK to some of the potential heavy users of the decorator (see this comment). With the patch, we'll suppress the NFS-related doctest messages in the description.

Alternatively, we could use @parallel(verbose=False) in individual doctests.

comment:19 Changed 11 years ago by mpatel

We can use a similar workaround at #9501/#9616/#9631, if we add a verbose parameter to fork and propagate its value to Parallel.

comment:20 in reply to: ↑ 18 Changed 11 years ago by SimonKing

Replying to mpatel:

Martin, Simon, Tom, and William, do you have any objections to changing the verbose parameter of the forking @parallel decorator to default to False? The patch above does this, but I'm not sure if it's OK to some of the potential heavy users of the decorator (see this comment). With the patch, we'll suppress the NFS-related doctest messages in the description.

Alternatively, we could use @parallel(verbose=False) in individual doctests.

I think something like "verbose" should not be default. A warning about a device being busy is not necessary from my point of view if the program is designed to deal with that problem (so that the "problem" is in fact no problem).

comment:21 in reply to: ↑ 18 Changed 11 years ago by malb

Replying to mpatel:

Martin, Simon, Tom, and William, do you have any objections to changing the verbose parameter of the forking @parallel decorator to default to False?

That is fine with me (I'm actually not a heavy user of the decorator :))

comment:22 Changed 11 years ago by mpatel

  • Reviewers set to Mitesh Patel
  • Status changed from needs_review to positive_review

comment:23 Changed 11 years ago by mpatel

  • Merged in set to sage-4.6.rc0
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.