Sage: Ticket #8547: implement hidden markov models in Cython from scratch (so can remove the GHMM standard package from Sage)
https://trac.sagemath.org/ticket/8547
<p>
NOTE: After applying this patch(es), the following standard spkg can be completely deleted:
</p>
<blockquote>
<p>
ghmm-20080813.p0.spkg
</p>
</blockquote>
<p>
This patch provides a complete reimplementation of Hidden Markov Models from scratch in Sage. It is now doctested and the tests should pass on all standard Sage platforms. Note that that HMM code in sage-4.3.4 and earlier is *NOT* tested at all, can easily segfault in all kinds of cases, leaks huge amounts of memory, etc. The new code shouldn't have those defects. Plus we get to delete a standard spkg. Moreover, this new code is much easier to modify and experiment with. Performance is generally similar, sometimes better, and sometimes worse, than the old code.
</p>
en-usSagehttps://trac.sagemath.org/chrome/site/logo_sagemath_trac.png
https://trac.sagemath.org/ticket/8547
Trac 1.1.6wasSat, 20 Mar 2010 12:45:21 GMTdescription changed
https://trac.sagemath.org/ticket/8547#comment:1
https://trac.sagemath.org/ticket/8547#comment:1
<ul>
<li><strong>description</strong>
modified (<a href="/ticket/8547?action=diff&version=1">diff</a>)
</li>
</ul>
TicketwasSat, 20 Mar 2010 12:49:05 GMTattachment set
https://trac.sagemath.org/ticket/8547
https://trac.sagemath.org/ticket/8547
<ul>
<li><strong>attachment</strong>
set to <em>trac_8547.patch</em>
</li>
</ul>
<p>
apply this. Then look at devel/sage/sage/stats/intlist.pyx, devel/sage/sage/stats/hmm (which is entirely new code), and look at the few minor bug fixes to finance/time_series.pyx
</p>
TicketwasSat, 20 Mar 2010 12:49:50 GMTstatus changed
https://trac.sagemath.org/ticket/8547#comment:2
https://trac.sagemath.org/ticket/8547#comment:2
<ul>
<li><strong>status</strong>
changed from <em>new</em> to <em>needs_review</em>
</li>
</ul>
TicketwasSat, 20 Mar 2010 12:53:10 GMT
https://trac.sagemath.org/ticket/8547#comment:3
https://trac.sagemath.org/ticket/8547#comment:3
<p>
This is now ready for review. Some notes:
</p>
<ol><li>Don't bother looking at the diff, except for time_series.pyx and module_list.py. Just look at at the files in stats/hmm, which are all new, and stats/intlist.pyx.
</li></ol><ol start="2"><li>There are a number of very small API changes (for the better). This technically violates our <a class="missing wiki">DeprecationPolicy?</a>. However, frankly, the current code was -- upon inspection -- so bad, that there is no way anybody was actually using it for anything serious. (As was evidenced by the response on sage-devel.) And the API is only a tiny bit changed now.
</li></ol><ol start="3"><li>I do have a long list of possible followup projects. However, I wanted to make a first solid release that at least has enough to replace the existing HMM functionality in Sage, before adding lots of new stuff.
</li></ol>
TicketwasSat, 20 Mar 2010 12:55:43 GMTcc set
https://trac.sagemath.org/ticket/8547#comment:4
https://trac.sagemath.org/ticket/8547#comment:4
<ul>
<li><strong>cc</strong>
<em>jason</em> <em>mhamptom</em> added
</li>
</ul>
TicketmhamptonSun, 21 Mar 2010 17:04:23 GMTowner, cc changed
https://trac.sagemath.org/ticket/8547#comment:5
https://trac.sagemath.org/ticket/8547#comment:5
<ul>
<li><strong>owner</strong>
changed from <em>amhou</em> to <em>mhampton</em>
</li>
<li><strong>cc</strong>
<em>mhampton</em> added; <em>mhamptom</em> removed
</li>
</ul>
TicketmhamptonSun, 21 Mar 2010 17:05:04 GMTowner changed
https://trac.sagemath.org/ticket/8547#comment:6
https://trac.sagemath.org/ticket/8547#comment:6
<ul>
<li><strong>owner</strong>
changed from <em>mhampton</em> to <em>amhou</em>
</li>
</ul>
TicketboothbySun, 28 Mar 2010 20:25:49 GMT
https://trac.sagemath.org/ticket/8547#comment:7
https://trac.sagemath.org/ticket/8547#comment:7
<p>
This patch touches matrix2.pyx which results in failures of a number of doctests. Example:
</p>
<pre class="wiki">sage -t devel/sage/sage/matrix/matrix2.pyx
**********************************************************************
File "/mnt/usb1/scratch/boothby/sage-4.3.4/devel/sage-main/sage/matrix/matrix2.pyx", line 3142:
sage: A.restrict(W2, check=True)
Expected:
Traceback (most recent call last):
...
ArithmeticError: subspace is not invariant under matrix
Got:
Traceback (most recent call last):
File "/mnt/usb1/scratch/boothby/sage-4.3.4/local/bin/ncadoctest.py", line 1231, in run_one_test
self.run_one_example(test, example, filename, compileflags)
File "/mnt/usb1/scratch/boothby/sage-4.3.4/local/bin/sagedoctest.py", line 38, in run_one_example
OrigDocTestRunner.run_one_example(self, test, example, filename, compileflags)
File "/mnt/usb1/scratch/boothby/sage-4.3.4/local/bin/ncadoctest.py", line 1172, in run_one_example
compileflags, 1) in test.globs
File "<doctest __main__.example_48[10]>", line 1, in <module>
A.restrict(W2, check=True)###line 3142:
sage: A.restrict(W2, check=True)
File "matrix2.pyx", line 3167, in sage.matrix.matrix2.Matrix.restrict (sage/matrix/matrix2.c:19470)
raise ArithmeticError, "subspace is not invariant under matrix (%s)"%msg
ArithmeticError: subspace is not invariant under matrix (vector is not in free module)
**********************************************************************
</pre><p>
That should be easy to fix.
</p>
TicketwasSun, 28 Mar 2010 21:47:19 GMTattachment set
https://trac.sagemath.org/ticket/8547
https://trac.sagemath.org/ticket/8547
<ul>
<li><strong>attachment</strong>
set to <em>trac-8547-take2.patch</em>
</li>
</ul>
<p>
this completely replaces the previous patch
</p>
TicketboothbyTue, 30 Mar 2010 23:47:50 GMT
https://trac.sagemath.org/ticket/8547#comment:8
https://trac.sagemath.org/ticket/8547#comment:8
<p>
Ok, doctests run clean now. I've tested the code out and read it over, and it works for me. I'm a little cautious giving this a positive review because I know very little about this area of math. However, given that it's replacing something that barely worked and used memory like toilet paper, I'm willing to throw caution to the wind if mhampton or jason are willing to sign off on the idea.
</p>
TicketmhamptonWed, 31 Mar 2010 03:06:55 GMT
https://trac.sagemath.org/ticket/8547#comment:9
https://trac.sagemath.org/ticket/8547#comment:9
<p>
I've been wanting to review this but I just haven't had the time. I might this weekend, but its unclear.
</p>
TicketjasonWed, 31 Mar 2010 03:52:17 GMT
https://trac.sagemath.org/ticket/8547#comment:10
https://trac.sagemath.org/ticket/8547#comment:10
<p>
Some notes from reading through the code looking at stylistic issues. I really wish we had line-by-line commenting like rietveld for this sort of thing.
</p>
<ol><li>Lots of cdef'd functions do not have doctests. I thought the policy from discussions on sage-devel was that *every* function (including cdef functions) should have doctests (but, of course, these doctests would have to either indirectly test the cdef'd function or would have to test a wrapper of the cdef function). Some functions (_baum_welch_gamma, for example) don't even have docstrings.
</li><li>I'm curious why <a class="missing wiki">IntList?</a>.__getitem__ does not use the sage.misc.misc_c.normalize_index function to deal with slices. How much of a performance penalty is there? Can this "python semantics" part be extracted out to be a more general cdef'd counterpart to normalize_index, so that matrices, vectors, and other types can use it better to implement __getitem__? Also, as a future enhancement, it doesn't seem that much harder for __setitem__ to also support slices. At the very least, the doctests for normalize_index probably ought to be run on the __getitem__ function, as they exercise a number of corner cases for the python semantics.
</li><li><a class="missing wiki">IntList?</a>.sum() does not have a doctest for the overflow case
</li></ol>
TicketwasSat, 03 Apr 2010 08:49:53 GMT
https://trac.sagemath.org/ticket/8547#comment:11
https://trac.sagemath.org/ticket/8547#comment:11
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/8547#comment:10" title="Comment 10">jason</a>:
</p>
<blockquote class="citation">
<p>
Some notes from reading through the code looking at stylistic issues. I really wish we had
</p>
<blockquote>
<p>
line-by-line commenting like rietveld for this sort of thing.
</p>
</blockquote>
<ol><li>Lots of cdef'd functions do not have doctests. I thought the policy from discussions on sage-devel
</li></ol><p>
was that *every* function (including cdef functions) should have doctests (but, of course, these doctests
</p>
<blockquote>
<p>
would have to either indirectly test the cdef'd function or would have to test a wrapper of the cdef function).
</p>
</blockquote>
</blockquote>
<p>
We definitely do not require doctests for cdef'd methods. The requirement is that "sage -coverage" returns a score of 100%. This is clearly stated in the patch reviewers guide. I did seriously consider them in this case, but literally every single one of these doctests would just be a literal *exact* copy of a doctest from elsewhere (!) but with # indirect doctest next to it. There's just no value in that. I could also make the methods cpdef, but that does incur a performance penalty -- in this case, it would be huge (which is totally unacceptable).
</p>
<blockquote class="citation">
<p>
Some functions (_baum_welch_gamma, for example) don't even have docstrings.
</p>
</blockquote>
<p>
I do think that all cdef'd methods should have docstrings, and will add docstrings to any that don't have them (in a part 2 patch).
</p>
<blockquote class="citation">
<ol><li>I'm curious why <a class="missing wiki">IntList?</a>.__getitem__ does not use the sage.misc.misc_c.normalize_index function to deal with slices.
</li></ol><p>
How much of a performance penalty is there?
</p>
</blockquote>
<p>
I'll switch to using normalize_index, since I'm not concerned with performance for list indexing, since everywhere that performance matters, I'm directly accessing the memory buffer (which is the only way to really compete with tightly coded C libraries). So my updated patch will change to use normalize_index, unless there is a serious problem with doing so.
</p>
<blockquote class="citation">
<blockquote>
<p>
Can this "python semantics" part be extracted out to be a more general
</p>
</blockquote>
<p>
cdef'd counterpart to normalize_index, so that matrices, vectors, and other types can use it better to implement
__getitem__?
</p>
</blockquote>
<p>
Perhaps. I'm certainly not doing so for this patch.
</p>
<blockquote class="citation">
<p>
Also, as a future enhancement, it doesn't seem that much harder for __setitem__ to also
support slices. At the very least, the doctests for normalize_index probably ought to be
run on the __getitem__ function, as they exercise a number of corner cases for the python semantics.
</p>
</blockquote>
<p>
Fortunately this won't be necessary since I'm switching to it.
</p>
<blockquote class="citation">
<ol><li><a class="missing wiki">IntList?</a>.sum() does not have a doctest for the overflow case
</li></ol></blockquote>
<p>
I can add that. I like testing all corner cases.
</p>
<p>
Thanks for your helpful review of style...
</p>
<ul><li>William
</li></ul>
TicketmhamptonSat, 03 Apr 2010 14:37:05 GMT
https://trac.sagemath.org/ticket/8547#comment:12
https://trac.sagemath.org/ticket/8547#comment:12
<p>
I got one numerical noise doctest failure on OS X 10.5.8:
</p>
<p>
sage -t devel/sage-t2/sage/stats/hmm/util.pyx
<strong></strong><strong></strong><strong></strong><strong></strong><strong></strong><strong></strong><strong></strong><strong></strong><strong></strong><strong></strong><strong></strong><strong></strong><strong></strong><strong></strong><strong></strong><strong></strong><strong></strong><strong>
File "/Users/mh/sagestuff/sage-4.3.5/devel/sage-t2/sage/stats/hmm/util.pyx", line 55:
</strong></p>
<blockquote>
<p>
sage: T.sum()
</p>
</blockquote>
<p>
Expected:
</p>
<blockquote>
<p>
1.0
</p>
</blockquote>
<p>
Got:
</p>
<blockquote>
<p>
0.99999999999999978
</p>
</blockquote>
<p>
I used this code in class yesterday (see <a class="ext-link" href="http://wiki.sagemath.org/interact/stats#HiddenMarkovModel.3ATheOccasionallyDishonestCasino"><span class="icon"></span>http://wiki.sagemath.org/interact/stats#HiddenMarkovModel.3ATheOccasionallyDishonestCasino</a>) and for that limited purpose it seemed fine.
</p>
<p>
It would be nice to have the forward and backward algorithms more exposed, but I don't think that needs to go into this ticket unless you feel like it.
</p>
<p>
Apart from the noise issue I think can give this a positive review.
</p>
TicketwasThu, 08 Apr 2010 21:47:38 GMT
https://trac.sagemath.org/ticket/8547#comment:13
https://trac.sagemath.org/ticket/8547#comment:13
<ul><li>I've attached a part 2 patch. I made sure all cdef's methods have docstrings and also that doctests are properly sphinx formated (some weren't since they were copied from the old code).
</li></ul><ul><li>Jason said above that <a class="missing wiki">IntList?</a>.sum doesn't have a doctest for the overflow case... but it does, so I don't know what he meant.
</li></ul><p>
</p>
<ul><li>I read the source code for <code>sage.misc.misc_c.normalize_index</code> and cannot bring myself to use that in this situation. That function actually returns a Python *list* of Python ints for every single index into the list that is being sliced! That would easily lead to factor of 50-100 slowdowns on realistic operations:
<pre class="wiki">sage: timeit('z = sage.misc.misc_c.normalize_index(slice(1,10^5),10^5)') # slow because constructions a big python list
125 loops, best of 3: 2.17 ms per loop
sage: a = stats.IntList([1..10^5])
sage: timeit('a[1:10^5]') # slice is just a memcpy
625 loops, best of 3: 48.4 µs per loop
sage: 2.17/0.0484
44.8347107438017
</pre></li></ul><p>
Here's an example with a step:
</p>
<pre class="wiki">sage: a = stats.IntList([1..10^5])
sage: timeit('a[1:10^5:2]')
625 loops, best of 3: 92.2 µs per loop
sage: timeit('z = sage.misc.misc_c.normalize_index(slice(1,10^5,2),10^5)')
625 loops, best of 3: 772 µs per loop
</pre><p>
and that 772 microseconds is *before* we do the actual iteration through the returned list of python ints, convert them to c ints, copy stuff around in memory, etc.
</p>
<p>
This stats code I'm writing is really meant to be industrial strength -- the sort of code maybe somebody would use in "realtime" processing of large datastreams. I don't want slow functions anywhere in there.
</p>
<blockquote>
<p>
-- William
</p>
</blockquote>
TicketwasThu, 08 Apr 2010 21:55:51 GMTattachment set
https://trac.sagemath.org/ticket/8547
https://trac.sagemath.org/ticket/8547
<ul>
<li><strong>attachment</strong>
set to <em>trac_8547-take2-part2.patch</em>
</li>
</ul>
<p>
part 2; apply this and the previous patch
</p>
TicketjasonThu, 08 Apr 2010 22:29:13 GMT
https://trac.sagemath.org/ticket/8547#comment:14
https://trac.sagemath.org/ticket/8547#comment:14
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/8547#comment:13" title="Comment 13">was</a>:
</p>
<blockquote class="citation">
<ul><li>Jason said above that <a class="missing wiki">IntList?</a>.sum doesn't have a doctest for the overflow case... but it does, so I don't know what he meant.
</li></ul></blockquote>
<p>
I meant that the doctest looks like this:
</p>
<pre class="wiki">Note that there can be overflow, since the entries are C ints::
sage: a = stats.IntList([2^30,2^30]); a
[1073741824, 1073741824]
</pre><p>
That's it. There's no test there; you're just creating a list, not summing anything.
</p>
<blockquote class="citation">
<p>
</p>
<ul><li>I read the source code for <code>sage.misc.misc_c.normalize_index</code> and cannot bring myself to use that in this situation. That function actually returns a Python *list* of Python ints for every single index into the list that is being sliced! That would easily lead to factor of 50-100 slowdowns on realistic operations:
</li></ul></blockquote>
<blockquote class="citation">
<p>
This stats code I'm writing is really meant to be industrial strength -- the sort of code maybe somebody would use in "realtime" processing of large datastreams. I don't want slow functions anywhere in there.
</p>
</blockquote>
<p>
I agree.
</p>
TicketwasSat, 10 Apr 2010 19:18:23 GMT
https://trac.sagemath.org/ticket/8547#comment:15
https://trac.sagemath.org/ticket/8547#comment:15
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/8547#comment:14" title="Comment 14">jason</a>:
</p>
<blockquote class="citation">
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/8547#comment:13" title="Comment 13">was</a>:
</p>
<blockquote class="citation">
<ul><li>Jason said above that <a class="missing wiki">IntList?</a>.sum doesn't have a doctest for the overflow case... but it does, so I don't know what he meant.
</li></ul></blockquote>
<p>
I meant that the doctest looks like this:
</p>
<pre class="wiki">Note that there can be overflow, since the entries are C ints::
sage: a = stats.IntList([2^30,2^30]); a
[1073741824, 1073741824]
</pre><p>
That's it. There's no test there; you're just creating a list, not summing anything.
</p>
</blockquote>
<p>
Thanks for the clarification -- I was being dense; I've put up a part 3 that addresses this.
</p>
<p>
So, can I get a positive review now?
</p>
TicketwasSat, 10 Apr 2010 19:19:10 GMTattachment set
https://trac.sagemath.org/ticket/8547
https://trac.sagemath.org/ticket/8547
<ul>
<li><strong>attachment</strong>
set to <em>trac_8547-take2-part3.patch</em>
</li>
</ul>
TicketmhamptonSun, 11 Apr 2010 21:32:11 GMTowner changed
https://trac.sagemath.org/ticket/8547#comment:16
https://trac.sagemath.org/ticket/8547#comment:16
<ul>
<li><strong>owner</strong>
changed from <em>amhou</em> to <em>mhampton</em>
</li>
</ul>
<p>
OK, positive review. I think this and basic_stats should be added to the reference manual, but that can be another ticket I guess.
</p>
TicketmhamptonSun, 11 Apr 2010 21:32:22 GMTstatus changed
https://trac.sagemath.org/ticket/8547#comment:17
https://trac.sagemath.org/ticket/8547#comment:17
<ul>
<li><strong>status</strong>
changed from <em>needs_review</em> to <em>positive_review</em>
</li>
</ul>
TicketjhpalmieriFri, 16 Apr 2010 18:43:45 GMTstatus changed; reviewer, resolution, merged, author set
https://trac.sagemath.org/ticket/8547#comment:18
https://trac.sagemath.org/ticket/8547#comment:18
<ul>
<li><strong>status</strong>
changed from <em>positive_review</em> to <em>closed</em>
</li>
<li><strong>reviewer</strong>
set to <em>Marshall Hampton, Tom Boothby, Jason Grout</em>
</li>
<li><strong>resolution</strong>
set to <em>fixed</em>
</li>
<li><strong>merged</strong>
set to <em>sage-4.4.alpha0</em>
</li>
<li><strong>author</strong>
set to <em>William Stein</em>
</li>
</ul>
<p>
Merged in 4.4.alpha0:
</p>
<ul><li>trac-8547-take2.patch
</li><li>trac_8547-take2-part2.patch
</li><li>trac_8547-take2-part3.patch
</li></ul>
Ticket