Opened 12 years ago
Closed 12 years ago
#8547 closed enhancement (fixed)
implement hidden markov models in Cython from scratch (so can remove the GHMM standard package from Sage)
Reported by: | was | Owned by: | mhampton |
---|---|---|---|
Priority: | major | Milestone: | sage-4.4 |
Component: | statistics | Keywords: | |
Cc: | jason, mhampton | Merged in: | sage-4.4.alpha0 |
Authors: | William Stein | Reviewers: | Marshall Hampton, Tom Boothby, Jason Grout |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | Stopgaps: |
Description (last modified by )
NOTE: After applying this patch(es), the following standard spkg can be completely deleted:
ghmm-20080813.p0.spkg
This patch provides a complete reimplementation of Hidden Markov Models from scratch in Sage. It is now doctested and the tests should pass on all standard Sage platforms. Note that that HMM code in sage-4.3.4 and earlier is *NOT* tested at all, can easily segfault in all kinds of cases, leaks huge amounts of memory, etc. The new code shouldn't have those defects. Plus we get to delete a standard spkg. Moreover, this new code is much easier to modify and experiment with. Performance is generally similar, sometimes better, and sometimes worse, than the old code.
Attachments (4)
Change History (22)
comment:1 Changed 12 years ago by
- Description modified (diff)
Changed 12 years ago by
comment:2 Changed 12 years ago by
- Status changed from new to needs_review
comment:3 Changed 12 years ago by
This is now ready for review. Some notes:
- Don't bother looking at the diff, except for time_series.pyx and module_list.py. Just look at at the files in stats/hmm, which are all new, and stats/intlist.pyx.
- There are a number of very small API changes (for the better). This technically violates our DeprecationPolicy?. However, frankly, the current code was -- upon inspection -- so bad, that there is no way anybody was actually using it for anything serious. (As was evidenced by the response on sage-devel.) And the API is only a tiny bit changed now.
- I do have a long list of possible followup projects. However, I wanted to make a first solid release that at least has enough to replace the existing HMM functionality in Sage, before adding lots of new stuff.
comment:4 Changed 12 years ago by
- Cc jason mhamptom added
comment:5 Changed 12 years ago by
- Cc mhampton added; mhamptom removed
- Owner changed from amhou to mhampton
comment:6 Changed 12 years ago by
- Owner changed from mhampton to amhou
comment:7 Changed 12 years ago by
This patch touches matrix2.pyx which results in failures of a number of doctests. Example:
sage -t devel/sage/sage/matrix/matrix2.pyx ********************************************************************** File "/mnt/usb1/scratch/boothby/sage-4.3.4/devel/sage-main/sage/matrix/matrix2.pyx", line 3142: sage: A.restrict(W2, check=True) Expected: Traceback (most recent call last): ... ArithmeticError: subspace is not invariant under matrix Got: Traceback (most recent call last): File "/mnt/usb1/scratch/boothby/sage-4.3.4/local/bin/ncadoctest.py", line 1231, in run_one_test self.run_one_example(test, example, filename, compileflags) File "/mnt/usb1/scratch/boothby/sage-4.3.4/local/bin/sagedoctest.py", line 38, in run_one_example OrigDocTestRunner.run_one_example(self, test, example, filename, compileflags) File "/mnt/usb1/scratch/boothby/sage-4.3.4/local/bin/ncadoctest.py", line 1172, in run_one_example compileflags, 1) in test.globs File "<doctest __main__.example_48[10]>", line 1, in <module> A.restrict(W2, check=True)###line 3142: sage: A.restrict(W2, check=True) File "matrix2.pyx", line 3167, in sage.matrix.matrix2.Matrix.restrict (sage/matrix/matrix2.c:19470) raise ArithmeticError, "subspace is not invariant under matrix (%s)"%msg ArithmeticError: subspace is not invariant under matrix (vector is not in free module) **********************************************************************
That should be easy to fix.
comment:8 Changed 12 years ago by
Ok, doctests run clean now. I've tested the code out and read it over, and it works for me. I'm a little cautious giving this a positive review because I know very little about this area of math. However, given that it's replacing something that barely worked and used memory like toilet paper, I'm willing to throw caution to the wind if mhampton or jason are willing to sign off on the idea.
comment:9 Changed 12 years ago by
I've been wanting to review this but I just haven't had the time. I might this weekend, but its unclear.
comment:10 follow-up: ↓ 11 Changed 12 years ago by
Some notes from reading through the code looking at stylistic issues. I really wish we had line-by-line commenting like rietveld for this sort of thing.
- Lots of cdef'd functions do not have doctests. I thought the policy from discussions on sage-devel was that *every* function (including cdef functions) should have doctests (but, of course, these doctests would have to either indirectly test the cdef'd function or would have to test a wrapper of the cdef function). Some functions (_baum_welch_gamma, for example) don't even have docstrings.
- I'm curious why IntList?.__getitem__ does not use the sage.misc.misc_c.normalize_index function to deal with slices. How much of a performance penalty is there? Can this "python semantics" part be extracted out to be a more general cdef'd counterpart to normalize_index, so that matrices, vectors, and other types can use it better to implement __getitem__? Also, as a future enhancement, it doesn't seem that much harder for __setitem__ to also support slices. At the very least, the doctests for normalize_index probably ought to be run on the __getitem__ function, as they exercise a number of corner cases for the python semantics.
- IntList?.sum() does not have a doctest for the overflow case
comment:11 in reply to: ↑ 10 Changed 12 years ago by
Replying to jason:
Some notes from reading through the code looking at stylistic issues. I really wish we had
line-by-line commenting like rietveld for this sort of thing.
- Lots of cdef'd functions do not have doctests. I thought the policy from discussions on sage-devel
was that *every* function (including cdef functions) should have doctests (but, of course, these doctests
would have to either indirectly test the cdef'd function or would have to test a wrapper of the cdef function).
We definitely do not require doctests for cdef'd methods. The requirement is that "sage -coverage" returns a score of 100%. This is clearly stated in the patch reviewers guide. I did seriously consider them in this case, but literally every single one of these doctests would just be a literal *exact* copy of a doctest from elsewhere (!) but with # indirect doctest next to it. There's just no value in that. I could also make the methods cpdef, but that does incur a performance penalty -- in this case, it would be huge (which is totally unacceptable).
Some functions (_baum_welch_gamma, for example) don't even have docstrings.
I do think that all cdef'd methods should have docstrings, and will add docstrings to any that don't have them (in a part 2 patch).
- I'm curious why IntList?.__getitem__ does not use the sage.misc.misc_c.normalize_index function to deal with slices.
How much of a performance penalty is there?
I'll switch to using normalize_index, since I'm not concerned with performance for list indexing, since everywhere that performance matters, I'm directly accessing the memory buffer (which is the only way to really compete with tightly coded C libraries). So my updated patch will change to use normalize_index, unless there is a serious problem with doing so.
Can this "python semantics" part be extracted out to be a more general
cdef'd counterpart to normalize_index, so that matrices, vectors, and other types can use it better to implement __getitem__?
Perhaps. I'm certainly not doing so for this patch.
Also, as a future enhancement, it doesn't seem that much harder for __setitem__ to also support slices. At the very least, the doctests for normalize_index probably ought to be run on the __getitem__ function, as they exercise a number of corner cases for the python semantics.
Fortunately this won't be necessary since I'm switching to it.
- IntList?.sum() does not have a doctest for the overflow case
I can add that. I like testing all corner cases.
Thanks for your helpful review of style...
- William
comment:12 Changed 12 years ago by
I got one numerical noise doctest failure on OS X 10.5.8:
sage -t devel/sage-t2/sage/stats/hmm/util.pyx File "/Users/mh/sagestuff/sage-4.3.5/devel/sage-t2/sage/stats/hmm/util.pyx", line 55:
sage: T.sum()
Expected:
1.0
Got:
0.99999999999999978
I used this code in class yesterday (see http://wiki.sagemath.org/interact/stats#HiddenMarkovModel.3ATheOccasionallyDishonestCasino) and for that limited purpose it seemed fine.
It would be nice to have the forward and backward algorithms more exposed, but I don't think that needs to go into this ticket unless you feel like it.
Apart from the noise issue I think can give this a positive review.
comment:13 follow-up: ↓ 14 Changed 12 years ago by
- I've attached a part 2 patch. I made sure all cdef's methods have docstrings and also that doctests are properly sphinx formated (some weren't since they were copied from the old code).
- Jason said above that IntList?.sum doesn't have a doctest for the overflow case... but it does, so I don't know what he meant.
- I read the source code for
sage.misc.misc_c.normalize_index
and cannot bring myself to use that in this situation. That function actually returns a Python *list* of Python ints for every single index into the list that is being sliced! That would easily lead to factor of 50-100 slowdowns on realistic operations:sage: timeit('z = sage.misc.misc_c.normalize_index(slice(1,10^5),10^5)') # slow because constructions a big python list 125 loops, best of 3: 2.17 ms per loop sage: a = stats.IntList([1..10^5]) sage: timeit('a[1:10^5]') # slice is just a memcpy 625 loops, best of 3: 48.4 µs per loop sage: 2.17/0.0484 44.8347107438017
Here's an example with a step:
sage: a = stats.IntList([1..10^5]) sage: timeit('a[1:10^5:2]') 625 loops, best of 3: 92.2 µs per loop sage: timeit('z = sage.misc.misc_c.normalize_index(slice(1,10^5,2),10^5)') 625 loops, best of 3: 772 µs per loop
and that 772 microseconds is *before* we do the actual iteration through the returned list of python ints, convert them to c ints, copy stuff around in memory, etc.
This stats code I'm writing is really meant to be industrial strength -- the sort of code maybe somebody would use in "realtime" processing of large datastreams. I don't want slow functions anywhere in there.
-- William
comment:14 in reply to: ↑ 13 ; follow-up: ↓ 15 Changed 12 years ago by
Replying to was:
- Jason said above that IntList?.sum doesn't have a doctest for the overflow case... but it does, so I don't know what he meant.
I meant that the doctest looks like this:
Note that there can be overflow, since the entries are C ints:: sage: a = stats.IntList([2^30,2^30]); a [1073741824, 1073741824]
That's it. There's no test there; you're just creating a list, not summing anything.
- I read the source code for
sage.misc.misc_c.normalize_index
and cannot bring myself to use that in this situation. That function actually returns a Python *list* of Python ints for every single index into the list that is being sliced! That would easily lead to factor of 50-100 slowdowns on realistic operations:
This stats code I'm writing is really meant to be industrial strength -- the sort of code maybe somebody would use in "realtime" processing of large datastreams. I don't want slow functions anywhere in there.
I agree.
comment:15 in reply to: ↑ 14 Changed 12 years ago by
Replying to jason:
Replying to was:
- Jason said above that IntList?.sum doesn't have a doctest for the overflow case... but it does, so I don't know what he meant.
I meant that the doctest looks like this:
Note that there can be overflow, since the entries are C ints:: sage: a = stats.IntList([2^30,2^30]); a [1073741824, 1073741824]That's it. There's no test there; you're just creating a list, not summing anything.
Thanks for the clarification -- I was being dense; I've put up a part 3 that addresses this.
So, can I get a positive review now?
Changed 12 years ago by
comment:16 Changed 12 years ago by
- Owner changed from amhou to mhampton
OK, positive review. I think this and basic_stats should be added to the reference manual, but that can be another ticket I guess.
comment:17 Changed 12 years ago by
- Status changed from needs_review to positive_review
comment:18 Changed 12 years ago by
- Merged in set to sage-4.4.alpha0
- Resolution set to fixed
- Reviewers set to Marshall Hampton, Tom Boothby, Jason Grout
- Status changed from positive_review to closed
Merged in 4.4.alpha0:
- trac-8547-take2.patch
- trac_8547-take2-part2.patch
- trac_8547-take2-part3.patch
apply this. Then look at devel/sage/sage/stats/intlist.pyx, devel/sage/sage/stats/hmm (which is entirely new code), and look at the few minor bug fixes to finance/time_series.pyx