#24784 closed defect (fixed)
search_doc broken with jupyter notebook
Reported by: | jhpalmieri | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | sage-8.4 |
Component: | notebook | Keywords: | jupyter |
Cc: | Merged in: | ||
Authors: | Erik Bray | Reviewers: | John Palmieri |
Report Upstream: | N/A | Work issues: | |
Branch: | 603d5a0 (Commits, GitHub, GitLab) | Commit: | |
Dependencies: | Stopgaps: |
Description (last modified by )
First reported at https://ask.sagemath.org/question/41170/search_doc-readable-format/. Running a command like search_doc("curl", whole_word=True)
opens a window containing garbage. On my machine, a long string that starts:
aHRtbC9lbi9yZWZlcmVuY2UvZ2VuaW5kZXgtYWxsLmh0bWw6MTM5MTA6ICAgICAgPGxpPjxhIGhyZWY
The issue is that search_src
frequently includes the non-ascii character "¶" in its results, and this breaks the Jupyter notebook pager. This character appears in lines without useful matches, so discard those matches.
Change History (43)
comment:1 Changed 4 years ago by
comment:2 Changed 4 years ago by
- Branch set to u/jhpalmieri/search_doc_non_ascii
comment:3 Changed 4 years ago by
- Commit set to 5fef0b3239f63aa523773af96d876aa4bace2220
- Description modified (diff)
- Status changed from new to needs_review
New commits:
5fef0b3 | trac 24784: if the results of search_doc contain non-ascii characters,
|
comment:4 follow-up: ↓ 8 Changed 4 years ago by
Thanks much John for investigating! I confirm that the above search now displays non garbage with the above search on "curl".
However it still break on e.g.:
search_doc("combinat")
Also, the rendering is still not as nice as in the old notebook where you get an itemized list of clickable links rather than the plain text:
html/en/reference/genindex-all.html:14283: <li><a href="manifolds/sage/manifolds/operators.html#sage.manifolds.operators.curl">curl() (in module sage.manifolds.operators)</a> html/en/reference/genindex-all.html:14286: <li><a href="manifolds/sage/manifolds/differentiable/vectorfield.html#sage.manifolds.differentiable.vectorfield.VectorField.curl">(sage.manifolds.differentiable.vectorfield.VectorField method)</a>
This could be for a latter ticket though (to be mentioned in #25837).
Some minor additional suggestions:
- use
"XXX" not in line
rather than find - add a doctest
comment:5 Changed 4 years ago by
- Status changed from needs_review to needs_work
comment:6 Changed 4 years ago by
When you say "IPython pager" I'm confused, because I thought we were talking about the notebook. Why would anything in the notebook not properly handle non-ASCII characters? And if so, I think that's something we should patch upstream rather than make our output worse. Maybe I'm missing something though.
comment:7 Changed 4 years ago by
Nevermind; I reproduced the issue and I see it now. Nevertheless, it's a bit surprising that this works for me fine at the command line and that it's broken in the notebook. What's weirder is I don't know why containing non-ASCII characters should have anything to do with it.
The "garbage" that is displayed in the notebook is the base64-encoding of the expected output. If I copy/paste the whole thing and wrap it in print(base64.b64decode())
, the output from that displays just fine in the notebook, including the paragraph symbol.
I think something a little deeper to do with IPython output formatting is going on here. I don't think we want a hack workaround; especially since there can be non-ASCII characters in other search results, and sometimes we'll want those too.
(Incidentally, the output of search_doc
is pretty atrocious to begin with. Why does it output all this HTML instead of searching and returning results from the plain-text docs? What use is having search results from the auto-generated index, for that matter? Anyways, that's a separate issue; it just surprises me every time I see it...)
comment:8 in reply to: ↑ 4 Changed 4 years ago by
Replying to nthiery:
Also, the rendering is still not as nice as in the old notebook where you get an itemized list of clickable links rather than the plain text:
It seems that search_doc
checks a global variable called EMBEDDED_MODE
, which I guess is set somehow when running in the sage notebook. If EMBDEDDED_MODE
then it HTML-formats the results. Whereas otherwise it just calls IPython.core.page.page
.
I think I might understand the problem now: Although search_doc
contains unicode characters, it's just returning the raw UTF-8 byte strings. Your terminal knows how to display these. But Jupyter is just treating this as a binary blob and displaying it in the notebook as base64. We need to instead pass an actual unicode string (I'm guessing). Further, I need to look into whether it's possible to pass HTML to the pager in supported contexts; I wonder if it supports multi-part MIME data like other Jupyter output contexts...
comment:9 Changed 4 years ago by
I've almost got it working--the HTML-formatted results are currently only formatted in a way that makes sense for the legacy notebook, so it needs two formatting modes.
comment:10 Changed 4 years ago by
Yikes. A <font>
tag.
comment:11 Changed 4 years ago by
This is turning out to be something of a rabbit hole, as the old-style HTML-formatted results are...not great. And as I pointed out above, broken in Jupyter, since the URLs to the results are only meaningful inside the legacy notebook.
Should I attach just an interim patch for the immediate issue, and push off updating the HTML results for another ticket, or just do it all at once?
comment:12 Changed 4 years ago by
- Branch changed from u/jhpalmieri/search_doc_non_ascii to u/embray/ticket-24784
- Commit changed from 5fef0b3239f63aa523773af96d876aa4bace2220 to 5c0d0a398cafd54072a05d534da52d4fbb9f1ab8
- Status changed from needs_work to needs_review
How's this look for a start? Aside from a few other minor code updates this doesn't change any functionality, but just improves support for unicode text in the existing behavior. I will create a follow-up ticket for improving the HTML-formatted search results and making them work in the Jupyter Notebook.
Thanks again John to pointing out in the first place that the issue had to do with unicode characters, which is what led me in the right direction.
New commits:
5c0d0a3 | Partially fixes #24784 by ensuring that text passed to the IPython pager that contains non-ASCII characters is an actual unicode text object
|
comment:13 Changed 4 years ago by
I created a follow-up #26104 for improving the HTML search results.
comment:14 Changed 4 years ago by
- Milestone changed from sage-8.2 to sage-8.4
comment:15 Changed 4 years ago by
sage: search_src('Steenrod') --------------------------------------------------------------------------- TypeError Traceback (most recent call last) ... TypeError: format_search_as_html() got an unexpected keyword argument 'whole_word'
Maybe it should be
-
src/sage/misc/sagedoc.py
diff --git a/src/sage/misc/sagedoc.py b/src/sage/misc/sagedoc.py index 2576cf9090..7b3c2630ed 100644
a b You can build this with 'sage -docbuild {} html'.""".format(s)) 958 958 if not interact: 959 959 return format_text_results() 960 960 961 html_results = format_search_as_html(title, results, [string] + extras, 962 whole_word=whole_word) 961 html_results = format_search_as_html(title, results, string) 963 962 964 963 from sage.server.support import EMBEDDED_MODE 965 964 if EMBEDDED_MODE:
Edit: no, looking at the old code, you're right that we need to use [string] + extras
. But I think whole_word
does not belong there.
comment:16 follow-up: ↓ 22 Changed 4 years ago by
I don't quite see the point behind format_text_results
. This looks a little simpler to me:
-
src/sage/misc/sagedoc.py
diff --git a/src/sage/misc/sagedoc.py b/src/sage/misc/sagedoc.py index 2576cf9090..36c5928efd 100644
a b You can build this with 'sage -docbuild {} html'.""".format(s)) 936 936 if not re.search(extra, match_list): 937 937 match_list = None 938 938 if match_list: 939 append(filename[strip:].lstrip("/") )939 append(filename[strip:].lstrip("/") + "\n") 940 940 else: 941 941 match_list = [(lineno, line) for lineno, line in 942 942 enumerate(open(filename).read().splitlines(True)) … … You can build this with 'sage -docbuild {} html'.""".format(s)) 949 949 append(':'.join([filename[strip:].lstrip("/"), 950 950 str(num + 1), line])) 951 951 952 def format_text_results(): 953 if multiline: 954 return '\n'.join(results) 955 else: 956 return ''.join(results) 957 952 text_results = ''.join(results) 958 953 if not interact: 959 return format_text_results()954 return text_results 960 955 961 956 html_results = format_search_as_html(title, results, [string] + extras, 962 957 whole_word=whole_word) … … You can build this with 'sage -docbuild {} html'.""".format(s)) 968 963 else: 969 964 # Pass through the IPython pager in a mime bundle 970 965 from IPython.core.page import page 971 text_results = format_text_results()972 966 if not isinstance(text_results, text_type): 973 967 text_results = text_results.decode('utf-8', 'replace') 974 968
(N.B. This diff doesn't include the previous one for format_search_as_html
.)
comment:17 Changed 4 years ago by
See also #26102 for a speed-up for search_doc
: mainly, don't search the _static
directories.
comment:18 follow-up: ↓ 23 Changed 4 years ago by
I personally find that the change append = results.append
makes the code harder to read without much benefit, but I won't argue about it.
comment:19 Changed 4 years ago by
With the legacy Sage notebook, searches with multiline=True
don't print anything with this branch. (To get it to work at all, I removed the whole_word=whole_word
from the call to format_results_as_html
, but I made no other changes.) It also prints the output in a slightly funny way: when I evaluate search_src('Steenrod', multiline=True)
, the old header said Search Source Code: "Steenrod"
. The new header says Search Source Code: "Steenrod", "", "", "", "", ""
– it's printing the extra
terms separately. I can upload screenshots if you want. This is not the highest priority since we're working on deprecating the legacy notebook, but I suppose we shouldn't actively make things worse for it.
comment:20 Changed 4 years ago by
- Status changed from needs_review to needs_work
comment:21 Changed 4 years ago by
Oops, I left in a change that I meant to omit (I had started down a path down #26104 but decided to separate off those changes). That's where all of those reported problems came from.
comment:22 in reply to: ↑ 16 Changed 4 years ago by
Replying to jhpalmieri:
I don't quite see the point behind
format_text_results
. This looks a little simpler to me:
I think that's probably fine , yeah. I think when I first added that I didn't fully understand what was going on with if multiline
. This code is still a bit confusing in general and should probably be rewritten but that's for another time.
comment:23 in reply to: ↑ 18 Changed 4 years ago by
Replying to jhpalmieri:
I personally find that the change
append = results.append
makes the code harder to read without much benefit, but I won't argue about it.
This can be a useful optimization in a loop, but in retrospect it will only be called in this case where there is a search match, which is relatively infrequent, so I agree there's not much benefit in this case.
comment:24 Changed 4 years ago by
- Commit changed from 5c0d0a398cafd54072a05d534da52d4fbb9f1ab8 to f7c4a65a210ee254f1ee8724145954ef245aef8c
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
f7c4a65 | Partially fixes #24784 by ensuring that text passed to the IPython pager that contains non-ASCII characters is an actual unicode text object
|
comment:25 Changed 4 years ago by
- Status changed from needs_work to needs_review
comment:26 follow-ups: ↓ 27 ↓ 30 Changed 4 years ago by
- Status changed from needs_review to needs_work
sage: search_src('IdealMonoid') --------------------------------------------------------------------------- NameError Traceback (most recent call last) ... NameError: global name 'format_text_results' is not defined
You forgot to delete a line:
-
src/sage/misc/sagedoc.py
diff --git a/src/sage/misc/sagedoc.py b/src/sage/misc/sagedoc.py index c77f76f2a5..64f4b8afaa 100644
a b You can build this with 'sage -docbuild {} html'.""".format(s)) 961 961 else: 962 962 # Pass through the IPython pager in a mime bundle 963 963 from IPython.core.page import page 964 text_results = format_text_results()965 964 if not isinstance(text_results, text_type): 966 965 text_results = text_results.decode('utf-8', 'replace') 967 966
I would encourage you to actually try a few simple cases before marking it as needs_review
;)
I wonder if we should put an example with interact=True
in the doctests to catch this sort of thing; for example:
-
src/sage/misc/sagedoc.py
diff --git a/src/sage/misc/sagedoc.py b/src/sage/misc/sagedoc.py index c77f76f2a5..ff8021ffff 100644
a b def search_src(string, extra1='', extra2='', extra3='', extra4='', 1143 1143 matrix/matrix0.pyx:924: Set the 2 x 2 submatrix of M, starting at row index and column 1144 1144 matrix/matrix0.pyx:933: Set the 2 x 3 submatrix of M starting at row index and column 1145 1145 1146 One example with ``interact=True``:: 1147 1148 sage: search_src('IdealMonoid') # random 1149 misc/sagedoc.py:1145: sage: search_src('IdealMonoid') 1150 algebras/letterplace/free_algebra_letterplace.pyx:110:from sage.rings.noncommutative_ideals import IdealMonoid_ncalgebras/letterplace/free_algebra_letterplace.pyx:556: self.__monoid = IdealMonoid_nc(self) 1151 rings/ideal_monoid.py:12:def IdealMonoid(R): 1152 ... 1146 1153 """ 1147 1154 return _search_src_or_doc('src', string, extra1=extra1, extra2=extra2, 1148 1155 extra3=extra3, extra4=extra4, extra5=extra5,
Also, with multiline=True
, it still doesn't work with the legacy notebook. The colons which used to be inserted seem to be necessary. You could keep the old code and defer any changes involving html_results
to #26104.
comment:27 in reply to: ↑ 26 ; follow-up: ↓ 29 Changed 4 years ago by
Replying to jhpalmieri:
I would encourage you to actually try a few simple cases before marking it as
needs_review
;)
I'm actually now going to have to review the doctests for this code, because I did run the doctests and somehow it passed. That obviously can't be good.
comment:28 follow-up: ↓ 31 Changed 4 years ago by
A related, but actually different problem, is that many of the tests in this module are (unsurprisingly) marked # long time
, and I wasn't testing with --long
.
comment:29 in reply to: ↑ 27 Changed 4 years ago by
Replying to embray:
Replying to jhpalmieri:
I would encourage you to actually try a few simple cases before marking it as
needs_review
;)I'm actually now going to have to review the doctests for this code, because I did run the doctests and somehow it passed. That obviously can't be good.
The issue is that (I think) the doctests all use interact=False
, so they bypass parts of the code. Hence my suggestion for a new doctest.
comment:30 in reply to: ↑ 26 Changed 4 years ago by
Replying to jhpalmieri:
Also, with
multiline=True
, it still doesn't work with the legacy notebook. The colons which used to be inserted seem to be necessary. You could keep the old code and defer any changes involvinghtml_results
to #26104.
I'll fix that. Just another symptom of the bad interface for format_search_as_html
.
comment:31 in reply to: ↑ 28 Changed 4 years ago by
Replying to embray:
A related, but actually different problem, is that many of the tests in this module are (unsurprisingly) marked
# long time
, and I wasn't testing with--long
.
Another issue is that some doctests are marked # optional - dochtml
, and they are not run by default, although I was led to believe in #25345 that they would be. To verify this, I added
sage: 1+1==5 # optional - dochtml True
to the file, and doctests pass unless I explicitly use the --optional dochtml
flag. I'll open a new ticket for that: #26110.
comment:32 Changed 4 years ago by
- Commit changed from f7c4a65a210ee254f1ee8724145954ef245aef8c to 603d5a025ace47b19186007da8d48a5142f2848a
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
c2a2aa0 | Partially fixes #24784 by ensuring that text passed to the IPython pager that contains non-ASCII characters is an actual unicode text object
|
691a224 | cleaned up argument handling in this function
|
603d5a0 | fixed a bug when passiing a list of multiline=True results to format_search_as_html
|
comment:33 follow-up: ↓ 36 Changed 4 years ago by
Should fixed.
Given more time, I would like to redo the tests for this module. While maybe it's OK to have one or two tests that go over all of the sage sources, most of the # long time
tests in this module should be done away with. The tests can instead run the search over just a small subset of docs/source to speed them up.
comment:34 Changed 4 years ago by
- Status changed from needs_work to needs_review
comment:35 follow-up: ↓ 39 Changed 4 years ago by
Not a big deal, but in the code
for L in results: filename = L.strip().split(':', 1)[0] if filename: files.add(filename)
as long as L
is nonempty and doesn't start with a colon, filename
will be nonempty as well. Do you expect L
to start with a colon? I tried a few cases and didn't see that come up.
comment:36 in reply to: ↑ 33 Changed 4 years ago by
Replying to embray:
Should be fixed.
Given more time, I would like to redo the tests for this module. While maybe it's OK to have one or two tests that go over all of the sage sources, most of the
# long time
tests in this module should be done away with. The tests can instead run the search over just a small subset of docs/source to speed them up.
This can wait for another ticket, but if you are thinking about changes like this, then that sounds like a good idea:
-
src/sage/misc/sagedoc.py
diff --git a/src/sage/misc/sagedoc.py b/src/sage/misc/sagedoc.py index 94fe79bd6c..539f7420fd 100644
a b def _search_src_or_doc(what, string, extra1='', extra2='', extra3='', 782 782 EXAMPLES:: 783 783 784 784 sage: from sage.misc.sagedoc import _search_src_or_doc 785 sage: print(_search_src_or_doc('src', r'matrix\(', # long timerandom785 sage: print(_search_src_or_doc('src', r'matrix\(', # random 786 786 ....: 'incidence_structures', 'self', 787 ....: '^combinat', interact=False)) 788 misc/sagedoc.py: sage: _search_src_or_doc('src', 'matrix(', 'incidence_structures', 'self', '^combinat', interact=False) 787 ....: '^combinat', module='sage.combinat', interact=False)) 789 788 combinat/designs/incidence_structures.py: M1 = self.incidence_matrix() 790 789 combinat/designs/incidence_structures.py: A = self.incidence_matrix() 791 790 combinat/designs/incidence_structures.py: M = transpose(self.incidence_matrix()) … … def _search_src_or_doc(what, string, extra1='', extra2='', extra3='', 818 817 819 818 sage: from sage.misc.sagedoc import _search_src_or_doc 820 819 sage: _search_src_or_doc('src', r'def _search_src_or_doc\(', 821 ....: interact=True) # long time820 ....: module='sage.misc', interact=True) 822 821 misc/sagedoc.py:...: def _search_src_or_doc(what, string, extra1='', extra2='', extra3='', 823 822 """ 824 823 … … def search_src(string, extra1='', extra2='', extra3='', extra4='', 1061 1060 The following produces an error because the string 'fetch(' is a 1062 1061 malformed regular expression:: 1063 1062 1064 sage: print(search_src(" fetch(", "def", interact=False))1063 sage: print(search_src(" fetch(", "def", module="sage.matrix", interact=False)) 1065 1064 Traceback (most recent call last): 1066 1065 ... 1067 1066 error: unbalanced parenthesis 1068 1067 1069 1068 To fix this, *escape* the parenthesis with a backslash:: 1070 1069 1071 sage: print(search_src(" fetch\(", "def", interact=False)) # random # long time1070 sage: print(search_src(" fetch\(", "def", module="sage.matrix", interact=False)) # random 1072 1071 matrix/matrix0.pyx: cdef fetch(self, key): 1073 1072 matrix/matrix0.pxd: cdef fetch(self, key) 1074 1073 1075 sage: print(search_src(" fetch\(", "def", "pyx", interact=False)) # random # long time1074 sage: print(search_src(" fetch\(", "def", "pyx", module="sage.matrix", interact=False)) # random 1076 1075 matrix/matrix0.pyx: cdef fetch(self, key): 1077 1076 1078 1077 As noted above, the search is case-insensitive, but you can make it
comment:37 Changed 4 years ago by
I get one doctest failure when run with sage -t --long --optional=sage,dochtml
:
********************************************************************** File "src/sage/misc/sagedoc.py", line 1179, in sage.misc.sagedoc.? Failed example: len(search_doc('tree', whole_word=True, interact=False).splitlines()) < 2000 # optional - dochtml, long time Expected: True Got: False **********************************************************************
Is this fixed by #26017? If so, should that ticket be a dependency of this one?
comment:38 Changed 4 years ago by
That's fixed by #26017. Also, I don't get that failure, but as I wrote there that failure can also happen if you just have "junk" in your doc build and goes away with doc-clean. I don't think it's related to this ticket.
comment:39 in reply to: ↑ 35 Changed 4 years ago by
Replying to jhpalmieri:
Not a big deal, but in the code
for L in results: filename = L.strip().split(':', 1)[0] if filename: files.add(filename)as long as
L
is nonempty and doesn't start with a colon,filename
will be nonempty as well. Do you expectL
to start with a colon? I tried a few cases and didn't see that come up.
That's a good point. The original code also would have broken if a filename started with a colon. This should never be the case though so I'm not worried about it.
comment:40 Changed 4 years ago by
- Reviewers set to John Palmieri
- Status changed from needs_review to positive_review
Okay, let's close this one. Cleaning up the doctests can happen on another ticket.
comment:41 Changed 4 years ago by
Agreed--thank you for filling the gap of my shoddy self-review.
comment:42 Changed 4 years ago by
- Branch changed from u/embray/ticket-24784 to 603d5a025ace47b19186007da8d48a5142f2848a
- Resolution set to fixed
- Status changed from positive_review to closed
comment:43 Changed 3 years ago by
- Commit 603d5a025ace47b19186007da8d48a5142f2848a deleted
Note: similar issue with "table" at #27656.
The problem is that the output from
search_doc
regularly contains non-ascii characters, like "¶", and the IPython pager does not handle these well. We can catch the error and then remove the offending character, or we can try to skip those search matches altogether ("Permalink to this definition: ¶" is probably not a useful search result).