Opened 4 years ago

# Non-deterministic test failure in cluster_algebra_quiver

Reported by: Owned by: embray major sage-8.0 combinatorics cluster tscrim, stumpc5, etn40ff, gmoose05 Frédéric Chapoton N/A

When running sage -t -a -p 0 in the develop Docker container (off a current version of the develop branch) I'm getting:

sage -t /opt/sage/src/sage/combinat/cluster_algebra_quiver/mutation_class.py
[54 tests, 0.34 s]
sage -t /opt/sage/src/sage/combinat/cluster_algebra_quiver/mutation_type.py
[71 tests, 0.88 s]
sage -t /opt/sage/src/sage/combinat/cluster_algebra_quiver/quiver.py
**********************************************************************
File "/opt/sage/src/sage/combinat/cluster_algebra_quiver/quiver.py", line 880, in sage.combinat.cluster_algebra_quiver.quiver.ClusterQuiver.mutation_type
Failed example:
Q.mutation_type()
Expected:
['G', 2]
Got:
'undetermined finite mutation type'
**********************************************************************
1 of  37 in sage.combinat.cluster_algebra_quiver.quiver.ClusterQuiver.mutation_type
[272 tests, 1 failure, 5.57 s]


but only sometimes. Seems like maybe a test ordering issue.

### comment:1 Changed 4 years ago by embray

• Description modified (diff)

### comment:2 Changed 4 years ago by embray

• Summary changed from Non-deterministic test failure in cluster_algebray_quiver to Non-deterministic test failure in cluster_algebra_quiver

Freudian slip?

### comment:3 Changed 4 years ago by embray

I noticed that these mutation_class files get written to the user's actual DOT_SAGE directory while running the tests. Do the tests not use a temp dir for DOT_SAGE?

### comment:4 follow-up: ↓ 5 Changed 4 years ago by embray

It seems this probably happens if sage/combinat/cluster_algebra_quiver/quiver_mutation_type.py is not finished before sage/combinat/cluster_algebra_quiver/quiver.py starts. Having a hard time getting it to do that reliably but am working on a patch nonetheless.

### comment:5 in reply to: ↑ 4 Changed 4 years ago by embray

It seems this probably happens if sage/combinat/cluster_algebra_quiver/quiver_mutation_type.py is not finished before sage/combinat/cluster_algebra_quiver/quiver.py starts. Having a hard time getting it to do that reliably but am working on a patch nonetheless.

No, that's not quite it either because testing justsage/combinat/cluster_algebra_quiver/quiver.py with a clean .sage would fail reliably otherwise.

### comment:6 Changed 4 years ago by chapoton

For what it's worth, from knowledge of the mathematical context and code:

I would say that a priori this result cannot happen unless Q is something else that the value given in the previous line. How this can happen is unknown to me.

There is caching of the result in Q._mutation_type. So the problem may come from a cache confusion (objets sharing their hash ?) ?

### comment:7 Changed 4 years ago by chapoton

• Cc tscrim stumpc5 etn40ff gmoose05 added

### comment:8 Changed 4 years ago by gmoose05

I think I know the issue. To avoid infinite loops when trying to figure out what the mutation type is, the first step in the command Q.mutation_type() for quivers is to run "Q.is_mutation_finite()" (unless the type of Q is already known from the start).

However, as is documented, this command "Uses a non-deterministic method by random mutations in various directions. Can result in a wrong answer."

I'm surprised that so many wrong answers are arising from this non-deterministic process, but it seems like the pseudo-random seeds are being consistently poorly chosen for this example.

Anyway, I agree that this is leading to potentially wrong answers but unless we change the "is_mutation_finite" command, this is not an entirely unexpected bug, but part of the way the code was written for computational efficiency sake.

### comment:9 Changed 4 years ago by etn40ff

I agree with Gregg: this is, in general, an artifact that comes from the non deterministic nature of mutation_type. Probably this method, and anything else calling is_mutation_finite, should mention explicitly how it works in its documentation and have # random doctests.

On the other hand the specific error in the description of the ticket is yet another instance of #22381. The quiver has only two non frozen vertices so it should be recognized immediately. Coefficients mess things up.

### comment:10 Changed 4 years ago by chapoton

• Authors set to Frédéric Chapoton
• Branch set to u/chapoton/22482
• Commit set to 0f0b3c22651611dd0aa0cbb4ee6075e724128a53

I have made some micro-enhancements to the is_mutation_finite code.

This could help prevent the sporadic doctest failure, maybe.

New commits:

 ​0f0b3c2 trac 22482 small enhancements in is_mutation_finite

### comment:11 Changed 4 years ago by embray

Thanks for looking into this! For the tests, would it make sense to just run them with a hard-coded random seed? There should even be a context manager for that somewhere...

### comment:12 Changed 4 years ago by chapoton

review, somebody ?

### comment:13 Changed 4 years ago by chapoton

• Milestone changed from sage-7.6 to sage-8.0
• Status changed from new to needs_review

### comment:14 Changed 4 years ago by embray

For some reason the merge preview seems to have broken--trying to view it returns an internal server error.

Maybe try rebasing it, just for the heck of it?

### comment:15 follow-up: ↓ 24 Changed 4 years ago by stumpc5

same here, but you can by hand navigate to https://git.sagemath.org/sage.git/commit/?h=u/chapoton/22482&id=0f0b3c22651611dd0aa0cbb4ee6075e724128a53 .

@Frederic: I don't see how your new choice of the new direction is at all different from before...

Also, I did millions of automated tests when I wrote the code some years ago, and this non-deterministic way really *always* produced the correct answer. I cannot believe this can be the problem! The reason that it should work is very simple: a finite type is usually very finite (in the sense that one usually does not consider finite types with >> 10^6 many clusters), so by 1000 mutations in random directions, you either stay in a very finite set of clusters (than it's is finite), or you have hit a witness that this type is not finite by 100% - \epsilon.

### comment:16 follow-up: ↓ 17 Changed 4 years ago by embray

Always is a strong word though. To be clear, this was a very hard to reproduce issue, and when I run the tests for this code under normal circumstances it always passes. It just happens that if the tests are run under certain conditions there must be some state of the RNG under which it ends up failing.

I could try to add some further diagnostics and try to reproduce the failure again, if you point me in the right direction as to what to add.

### comment:17 in reply to: ↑ 16 Changed 4 years ago by stumpc5

Okay, let's go through the problem step by step (and if below is correct, than this has nothing to do with this non-deterministic test, which I still ensure to be correctly working):

1. you call Q.mutation_type().
2. you call is_mutation_finite which correctly determines its finiteness, resulting in following the else clause in l912.
3. your quiver is connected, so you call _mutation_type_from_data with optional argument compute_if_necessary=False on its unique component.
4. (I expect that you have not built the data, so this does not result in any hit.) This results in mut_type_part == 'unknown'.
5. you call _connected_mutation_type which can determine all finite mutation types from the infinite families (but not the exceptional types, including G2).
6. --> I BET THIS CAUSES THE ISSUE <-- you again call _mutation_type_from_data with optional argument compute_if_necessary=True. This brute force computes all clusters of exceptional types (of correct rank) and writes the output into external files. These external files in relative_filename = 'cluster_algebra_quiver/mutation_classes_%s.dig6'%n are then finally checked.
7. As the final result, we obtain the mutation type as in your above output mut_type_part = 'undetermined finite mutation type'.

I am not to all depth into the context of your issue. But if I must place a bet, your doctest environment has a problem with writing these external files in (6) correctly and this causes the buggy behaviour.

Cheers, Christian

Last edited 4 years ago by stumpc5 (previous) (diff)

### comment:18 follow-up: ↓ 21 Changed 4 years ago by embray

What would you mean by "has a problem writing these external files correctly" (or for that matter, by "external")? I can see that the files are being written. One thing I noticed is that they're written to my actual DOT_SAGE when it seems to me the test suite should not be writing there.

Other than that there's nothing "special" about the environment or with Docker with regard to writing files.

### comment:19 follow-up: ↓ 20 Changed 4 years ago by stumpc5

One possible problem could be that multiple tests do .mutation_type in parallel. Both find the file to not exist at the beginning, and start writing and reading from the same file which results in bad results.

Is it correct that you don't have a way to trigger the bug? How often did it happen? Once the file is there, the correct answer should come out of (3) above directly. Have you seen the bug after the file was correctly computed and placed?

### comment:20 in reply to: ↑ 19 Changed 4 years ago by jdemeyer

One possible problem could be that multiple tests do .mutation_type in parallel. Both find the file to not exist at the beginning, and start writing and reading from the same file which results in bad results.

If I'm reading the code correctly, the files are written using atomic_write which should be free of race conditions while writing the file (fixed in #15534).

To debug further, it would be very interesting to know if the problem occurs when testing just a single file or only when testing in parallel.

### comment:21 in reply to: ↑ 18 Changed 4 years ago by jdemeyer

Other than that there's nothing "special" about the environment or with Docker with regard to writing files.

Assuming of course that there are no bugs in Docker or the OS.

### comment:22 Changed 4 years ago by jdemeyer

Note: there has been a bug in Sage involving the pexpect interface (don't ask me the number, I think it was reported by Erik) which was only reproducible on Docker. The underlying bug was a genuine race condition involving a fork(). However, for some reason, the timing outside of Docker was always such that the bug wouldn't be triggered. In Docker, things ran in a different order, causing trouble.

So: if anything in this code path involves multiple processes or threads, that's certainly a place to look for bugs.

### comment:23 Changed 4 years ago by embray

I don't think there's a bug here in Docker or the OS beyond that fact that it might cause the tests to run in a slightly different order than usual. As I stated in the description this happens when running the tests in parallel, and only sometimes. It can't be reproduced normally.

### comment:24 in reply to: ↑ 15 Changed 4 years ago by chapoton

@Frederic: I don't see how your new choice of the new direction is at all different from before...

My choice of direction never makes a useless choice, instead of rejecting backward moves.

### comment:25 Changed 4 years ago by chapoton

Note that my branch also changes the behaviour for rank n=2, where we know that everybody is of finite-mutation-type. So G2 should be handled more gracefully.

EDIT: Maybe my branch should be used in another independant ticket, as it does not claim to solve the issue raised here completely.

Last edited 4 years ago by chapoton (previous) (diff)

### comment:26 Changed 4 years ago by chapoton

I have moved my branch to ticket #23082, that does not pretend to solve fully the present issue of the random failing doctest. Please review #23082, this should be easy !

### comment:27 Changed 4 years ago by chapoton

• Branch u/chapoton/22482 deleted
• Commit 0f0b3c22651611dd0aa0cbb4ee6075e724128a53 deleted
• Status changed from needs_review to needs_info