Opened 20 months ago

Closed 14 months ago

Last modified 6 months ago

#24784 closed defect (fixed)

search_doc broken with jupyter notebook

Reported by: jhpalmieri Owned by:
Priority: major Milestone: sage-8.4
Component: notebook Keywords: jupyter
Cc: Merged in:
Authors: Erik Bray Reviewers: John Palmieri
Report Upstream: N/A Work issues:
Branch: 603d5a0 (Commits) Commit:
Dependencies: Stopgaps:

Description (last modified by jhpalmieri)

First reported at https://ask.sagemath.org/question/41170/search_doc-readable-format/. Running a command like search_doc("curl", whole_word=True) opens a window containing garbage. On my machine, a long string that starts:

aHRtbC9lbi9yZWZlcmVuY2UvZ2VuaW5kZXgtYWxsLmh0bWw6MTM5MTA6ICAgICAgPGxpPjxhIGhyZWY

The issue is that search_src frequently includes the non-ascii character "¶" in its results, and this breaks the Jupyter notebook pager. This character appears in lines without useful matches, so discard those matches.

Change History (43)

comment:1 Changed 14 months ago by jhpalmieri

The problem is that the output from search_doc regularly contains non-ascii characters, like "¶", and the IPython pager does not handle these well. We can catch the error and then remove the offending character, or we can try to skip those search matches altogether ("Permalink to this definition: ¶" is probably not a useful search result).

comment:2 Changed 14 months ago by jhpalmieri

  • Branch set to u/jhpalmieri/search_doc_non_ascii

comment:3 Changed 14 months ago by jhpalmieri

  • Authors set to John Palmieri
  • Commit set to 5fef0b3239f63aa523773af96d876aa4bace2220
  • Description modified (diff)
  • Status changed from new to needs_review

New commits:

5fef0b3trac 24784: if the results of search_doc contain non-ascii characters,

comment:4 follow-up: Changed 14 months ago by nthiery

Thanks much John for investigating! I confirm that the above search now displays non garbage with the above search on "curl".

However it still break on e.g.:

    search_doc("combinat")

Also, the rendering is still not as nice as in the old notebook where you get an itemized list of clickable links rather than the plain text:

html/en/reference/genindex-all.html:14283:      <li><a href="manifolds/sage/manifolds/operators.html#sage.manifolds.operators.curl">curl() (in module sage.manifolds.operators)</a>
html/en/reference/genindex-all.html:14286:        <li><a href="manifolds/sage/manifolds/differentiable/vectorfield.html#sage.manifolds.differentiable.vectorfield.VectorField.curl">(sage.manifolds.differentiable.vectorfield.VectorField method)</a>

This could be for a latter ticket though (to be mentioned in #25837).

Some minor additional suggestions:

  • use "XXX" not in line rather than find
  • add a doctest

comment:5 Changed 14 months ago by nthiery

  • Status changed from needs_review to needs_work

comment:6 Changed 14 months ago by embray

When you say "IPython pager" I'm confused, because I thought we were talking about the notebook. Why would anything in the notebook not properly handle non-ASCII characters? And if so, I think that's something we should patch upstream rather than make our output worse. Maybe I'm missing something though.

comment:7 Changed 14 months ago by embray

Nevermind; I reproduced the issue and I see it now. Nevertheless, it's a bit surprising that this works for me fine at the command line and that it's broken in the notebook. What's weirder is I don't know why containing non-ASCII characters should have anything to do with it.

The "garbage" that is displayed in the notebook is the base64-encoding of the expected output. If I copy/paste the whole thing and wrap it in print(base64.b64decode()), the output from that displays just fine in the notebook, including the paragraph symbol.

I think something a little deeper to do with IPython output formatting is going on here. I don't think we want a hack workaround; especially since there can be non-ASCII characters in other search results, and sometimes we'll want those too.

(Incidentally, the output of search_doc is pretty atrocious to begin with. Why does it output all this HTML instead of searching and returning results from the plain-text docs? What use is having search results from the auto-generated index, for that matter? Anyways, that's a separate issue; it just surprises me every time I see it...)

Last edited 14 months ago by embray (previous) (diff)

comment:8 in reply to: ↑ 4 Changed 14 months ago by embray

Replying to nthiery:

Also, the rendering is still not as nice as in the old notebook where you get an itemized list of clickable links rather than the plain text:

It seems that search_doc checks a global variable called EMBEDDED_MODE, which I guess is set somehow when running in the sage notebook. If EMBDEDDED_MODE then it HTML-formats the results. Whereas otherwise it just calls IPython.core.page.page.

I think I might understand the problem now: Although search_doc contains unicode characters, it's just returning the raw UTF-8 byte strings. Your terminal knows how to display these. But Jupyter is just treating this as a binary blob and displaying it in the notebook as base64. We need to instead pass an actual unicode string (I'm guessing). Further, I need to look into whether it's possible to pass HTML to the pager in supported contexts; I wonder if it supports multi-part MIME data like other Jupyter output contexts...

comment:9 Changed 14 months ago by embray

I've almost got it working--the HTML-formatted results are currently only formatted in a way that makes sense for the legacy notebook, so it needs two formatting modes.

comment:10 Changed 14 months ago by embray

Yikes. A <font> tag.

comment:11 Changed 14 months ago by embray

This is turning out to be something of a rabbit hole, as the old-style HTML-formatted results are...not great. And as I pointed out above, broken in Jupyter, since the URLs to the results are only meaningful inside the legacy notebook.

Should I attach just an interim patch for the immediate issue, and push off updating the HTML results for another ticket, or just do it all at once?

comment:12 Changed 14 months ago by embray

  • Authors changed from John Palmieri to John Palmieri, Erik Bray
  • Branch changed from u/jhpalmieri/search_doc_non_ascii to u/embray/ticket-24784
  • Commit changed from 5fef0b3239f63aa523773af96d876aa4bace2220 to 5c0d0a398cafd54072a05d534da52d4fbb9f1ab8
  • Status changed from needs_work to needs_review

How's this look for a start? Aside from a few other minor code updates this doesn't change any functionality, but just improves support for unicode text in the existing behavior. I will create a follow-up ticket for improving the HTML-formatted search results and making them work in the Jupyter Notebook.

Thanks again John to pointing out in the first place that the issue had to do with unicode characters, which is what led me in the right direction.


New commits:

5c0d0a3Partially fixes #24784 by ensuring that text passed to the IPython pager that contains non-ASCII characters is an actual unicode text object

comment:13 Changed 14 months ago by embray

I created a follow-up #26104 for improving the HTML search results.

comment:14 Changed 14 months ago by embray

  • Milestone changed from sage-8.2 to sage-8.4

comment:15 Changed 14 months ago by jhpalmieri

sage: search_src('Steenrod')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

...

TypeError: format_search_as_html() got an unexpected keyword argument 'whole_word'

Maybe it should be

  • src/sage/misc/sagedoc.py

    diff --git a/src/sage/misc/sagedoc.py b/src/sage/misc/sagedoc.py
    index 2576cf9090..7b3c2630ed 100644
    a b You can build this with 'sage -docbuild {} html'.""".format(s)) 
    958958    if not interact:
    959959        return format_text_results()
    960960
    961     html_results = format_search_as_html(title, results, [string] + extras,
    962                                          whole_word=whole_word)
     961    html_results = format_search_as_html(title, results, string)
    963962
    964963    from sage.server.support import EMBEDDED_MODE
    965964    if EMBEDDED_MODE:

Edit: no, looking at the old code, you're right that we need to use [string] + extras. But I think whole_word does not belong there.

Last edited 14 months ago by jhpalmieri (previous) (diff)

comment:16 follow-up: Changed 14 months ago by jhpalmieri

I don't quite see the point behind format_text_results. This looks a little simpler to me:

  • src/sage/misc/sagedoc.py

    diff --git a/src/sage/misc/sagedoc.py b/src/sage/misc/sagedoc.py
    index 2576cf9090..36c5928efd 100644
    a b You can build this with 'sage -docbuild {} html'.""".format(s)) 
    936936                                if not re.search(extra, match_list):
    937937                                    match_list = None
    938938                        if match_list:
    939                             append(filename[strip:].lstrip("/"))
     939                            append(filename[strip:].lstrip("/") + "\n")
    940940                    else:
    941941                        match_list = [(lineno, line) for lineno, line in
    942942                                      enumerate(open(filename).read().splitlines(True))
    You can build this with 'sage -docbuild {} html'.""".format(s)) 
    949949                            append(':'.join([filename[strip:].lstrip("/"),
    950950                                             str(num + 1), line]))
    951951
    952     def format_text_results():
    953         if multiline:
    954             return '\n'.join(results)
    955         else:
    956             return ''.join(results)
    957 
     952    text_results = ''.join(results)
    958953    if not interact:
    959         return format_text_results()
     954        return text_results
    960955
    961956    html_results = format_search_as_html(title, results, [string] + extras,
    962957                                         whole_word=whole_word)
    You can build this with 'sage -docbuild {} html'.""".format(s)) 
    968963    else:
    969964        # Pass through the IPython pager in a mime bundle
    970965        from IPython.core.page import page
    971         text_results = format_text_results()
    972966        if not isinstance(text_results, text_type):
    973967            text_results = text_results.decode('utf-8', 'replace')
    974968

(N.B. This diff doesn't include the previous one for format_search_as_html.)

Last edited 14 months ago by jhpalmieri (previous) (diff)

comment:17 Changed 14 months ago by jhpalmieri

See also #26102 for a speed-up for search_doc: mainly, don't search the _static directories.

comment:18 follow-up: Changed 14 months ago by jhpalmieri

I personally find that the change append = results.append makes the code harder to read without much benefit, but I won't argue about it.

comment:19 Changed 14 months ago by jhpalmieri

With the legacy Sage notebook, searches with multiline=True don't print anything with this branch. (To get it to work at all, I removed the whole_word=whole_word from the call to format_results_as_html, but I made no other changes.) It also prints the output in a slightly funny way: when I evaluate search_src('Steenrod', multiline=True), the old header said Search Source Code: "Steenrod". The new header says Search Source Code: "Steenrod", "", "", "", "", "" – it's printing the extra terms separately. I can upload screenshots if you want. This is not the highest priority since we're working on deprecating the legacy notebook, but I suppose we shouldn't actively make things worse for it.

comment:20 Changed 14 months ago by jhpalmieri

  • Status changed from needs_review to needs_work

comment:21 Changed 14 months ago by embray

Oops, I left in a change that I meant to omit (I had started down a path down #26104 but decided to separate off those changes). That's where all of those reported problems came from.

Last edited 14 months ago by embray (previous) (diff)

comment:22 in reply to: ↑ 16 Changed 14 months ago by embray

Replying to jhpalmieri:

I don't quite see the point behind format_text_results. This looks a little simpler to me:

I think that's probably fine , yeah. I think when I first added that I didn't fully understand what was going on with if multiline. This code is still a bit confusing in general and should probably be rewritten but that's for another time.

comment:23 in reply to: ↑ 18 Changed 14 months ago by embray

Replying to jhpalmieri:

I personally find that the change append = results.append makes the code harder to read without much benefit, but I won't argue about it.

This can be a useful optimization in a loop, but in retrospect it will only be called in this case where there is a search match, which is relatively infrequent, so I agree there's not much benefit in this case.

comment:24 Changed 14 months ago by git

  • Commit changed from 5c0d0a398cafd54072a05d534da52d4fbb9f1ab8 to f7c4a65a210ee254f1ee8724145954ef245aef8c

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

f7c4a65Partially fixes #24784 by ensuring that text passed to the IPython pager that contains non-ASCII characters is an actual unicode text object

comment:25 Changed 14 months ago by embray

  • Status changed from needs_work to needs_review

comment:26 follow-ups: Changed 14 months ago by jhpalmieri

  • Status changed from needs_review to needs_work
sage: search_src('IdealMonoid')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)

  ...

NameError: global name 'format_text_results' is not defined

You forgot to delete a line:

  • src/sage/misc/sagedoc.py

    diff --git a/src/sage/misc/sagedoc.py b/src/sage/misc/sagedoc.py
    index c77f76f2a5..64f4b8afaa 100644
    a b You can build this with 'sage -docbuild {} html'.""".format(s)) 
    961961    else:
    962962        # Pass through the IPython pager in a mime bundle
    963963        from IPython.core.page import page
    964         text_results = format_text_results()
    965964        if not isinstance(text_results, text_type):
    966965            text_results = text_results.decode('utf-8', 'replace')
    967966

I would encourage you to actually try a few simple cases before marking it as needs_review ;)

I wonder if we should put an example with interact=True in the doctests to catch this sort of thing; for example:

  • src/sage/misc/sagedoc.py

    diff --git a/src/sage/misc/sagedoc.py b/src/sage/misc/sagedoc.py
    index c77f76f2a5..ff8021ffff 100644
    a b def search_src(string, extra1='', extra2='', extra3='', extra4='', 
    11431143        matrix/matrix0.pyx:924:        Set the 2 x 2 submatrix of M, starting at row index and column
    11441144        matrix/matrix0.pyx:933:        Set the 2 x 3 submatrix of M starting at row index and column
    11451145
     1146    One example with ``interact=True``::
     1147
     1148        sage: search_src('IdealMonoid') # random
     1149        misc/sagedoc.py:1145:        sage: search_src('IdealMonoid')
     1150        algebras/letterplace/free_algebra_letterplace.pyx:110:from sage.rings.noncommutative_ideals import IdealMonoid_ncalgebras/letterplace/free_algebra_letterplace.pyx:556:            self.__monoid = IdealMonoid_nc(self)
     1151        rings/ideal_monoid.py:12:def IdealMonoid(R):
     1152        ...
    11461153    """
    11471154    return _search_src_or_doc('src', string, extra1=extra1, extra2=extra2,
    11481155                              extra3=extra3, extra4=extra4, extra5=extra5,

Also, with multiline=True, it still doesn't work with the legacy notebook. The colons which used to be inserted seem to be necessary. You could keep the old code and defer any changes involving html_results to #26104.

comment:27 in reply to: ↑ 26 ; follow-up: Changed 14 months ago by embray

Replying to jhpalmieri:

I would encourage you to actually try a few simple cases before marking it as needs_review ;)

I'm actually now going to have to review the doctests for this code, because I did run the doctests and somehow it passed. That obviously can't be good.

comment:28 follow-up: Changed 14 months ago by embray

A related, but actually different problem, is that many of the tests in this module are (unsurprisingly) marked # long time, and I wasn't testing with --long.

comment:29 in reply to: ↑ 27 Changed 14 months ago by jhpalmieri

Replying to embray:

Replying to jhpalmieri:

I would encourage you to actually try a few simple cases before marking it as needs_review ;)

I'm actually now going to have to review the doctests for this code, because I did run the doctests and somehow it passed. That obviously can't be good.

The issue is that (I think) the doctests all use interact=False, so they bypass parts of the code. Hence my suggestion for a new doctest.

comment:30 in reply to: ↑ 26 Changed 14 months ago by embray

Replying to jhpalmieri:

Also, with multiline=True, it still doesn't work with the legacy notebook. The colons which used to be inserted seem to be necessary. You could keep the old code and defer any changes involving html_results to #26104.

I'll fix that. Just another symptom of the bad interface for format_search_as_html.

comment:31 in reply to: ↑ 28 Changed 14 months ago by jhpalmieri

Replying to embray:

A related, but actually different problem, is that many of the tests in this module are (unsurprisingly) marked # long time, and I wasn't testing with --long.

Another issue is that some doctests are marked # optional - dochtml, and they are not run by default, although I was led to believe in #25345 that they would be. To verify this, I added

    sage: 1+1==5 # optional - dochtml
    True

to the file, and doctests pass unless I explicitly use the --optional dochtml flag. I'll open a new ticket for that: #26110.

Last edited 14 months ago by jhpalmieri (previous) (diff)

comment:32 Changed 14 months ago by git

  • Commit changed from f7c4a65a210ee254f1ee8724145954ef245aef8c to 603d5a025ace47b19186007da8d48a5142f2848a

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

c2a2aa0Partially fixes #24784 by ensuring that text passed to the IPython pager that contains non-ASCII characters is an actual unicode text object
691a224cleaned up argument handling in this function
603d5a0fixed a bug when passiing a list of multiline=True results to format_search_as_html

comment:33 follow-up: Changed 14 months ago by embray

Should be fixed.

Given more time, I would like to redo the tests for this module. While maybe it's OK to have one or two tests that go over all of the sage sources, most of the # long time tests in this module should be done away with. The tests can instead run the search over just a small subset of docs/source to speed them up.

Last edited 14 months ago by embray (previous) (diff)

comment:34 Changed 14 months ago by embray

  • Status changed from needs_work to needs_review

comment:35 follow-up: Changed 14 months ago by jhpalmieri

Not a big deal, but in the code

    for L in results:
        filename = L.strip().split(':', 1)[0]
        if filename:
            files.add(filename)

as long as L is nonempty and doesn't start with a colon, filename will be nonempty as well. Do you expect L to start with a colon? I tried a few cases and didn't see that come up.

comment:36 in reply to: ↑ 33 Changed 14 months ago by jhpalmieri

Replying to embray:

Should be fixed.

Given more time, I would like to redo the tests for this module. While maybe it's OK to have one or two tests that go over all of the sage sources, most of the # long time tests in this module should be done away with. The tests can instead run the search over just a small subset of docs/source to speed them up.

This can wait for another ticket, but if you are thinking about changes like this, then that sounds like a good idea:

  • src/sage/misc/sagedoc.py

    diff --git a/src/sage/misc/sagedoc.py b/src/sage/misc/sagedoc.py
    index 94fe79bd6c..539f7420fd 100644
    a b def _search_src_or_doc(what, string, extra1='', extra2='', extra3='', 
    782782    EXAMPLES::
    783783
    784784        sage: from sage.misc.sagedoc import _search_src_or_doc
    785         sage: print(_search_src_or_doc('src', r'matrix\(',  # long time random
     785        sage: print(_search_src_or_doc('src', r'matrix\(',  # random
    786786        ....:       'incidence_structures', 'self',
    787         ....:       '^combinat', interact=False))
    788         misc/sagedoc.py:        sage: _search_src_or_doc('src', 'matrix(', 'incidence_structures', 'self', '^combinat', interact=False)
     787        ....:       '^combinat', module='sage.combinat', interact=False))
    789788        combinat/designs/incidence_structures.py:        M1 = self.incidence_matrix()
    790789        combinat/designs/incidence_structures.py:        A = self.incidence_matrix()
    791790        combinat/designs/incidence_structures.py:        M = transpose(self.incidence_matrix())
    def _search_src_or_doc(what, string, extra1='', extra2='', extra3='', 
    818817
    819818        sage: from sage.misc.sagedoc import _search_src_or_doc
    820819        sage: _search_src_or_doc('src', r'def _search_src_or_doc\(',
    821         ....:                    interact=True)  # long time
     820        ....:                    module='sage.misc', interact=True)
    822821        misc/sagedoc.py:...:        def _search_src_or_doc(what, string, extra1='', extra2='', extra3='',
    823822    """
    824823
    def search_src(string, extra1='', extra2='', extra3='', extra4='', 
    10611060    The following produces an error because the string 'fetch(' is a
    10621061    malformed regular expression::
    10631062
    1064         sage: print(search_src(" fetch(", "def", interact=False))
     1063        sage: print(search_src(" fetch(", "def", module="sage.matrix", interact=False))
    10651064        Traceback (most recent call last):
    10661065        ...
    10671066        error: unbalanced parenthesis
    10681067
    10691068    To fix this, *escape* the parenthesis with a backslash::
    10701069
    1071         sage: print(search_src(" fetch\(", "def", interact=False)) # random # long time
     1070        sage: print(search_src(" fetch\(", "def", module="sage.matrix", interact=False)) # random
    10721071        matrix/matrix0.pyx:    cdef fetch(self, key):
    10731072        matrix/matrix0.pxd:    cdef fetch(self, key)
    10741073
    1075         sage: print(search_src(" fetch\(", "def", "pyx", interact=False)) # random # long time
     1074        sage: print(search_src(" fetch\(", "def", "pyx", module="sage.matrix", interact=False)) # random
    10761075        matrix/matrix0.pyx:    cdef fetch(self, key):
    10771076
    10781077    As noted above, the search is case-insensitive, but you can make it

comment:37 Changed 14 months ago by jhpalmieri

I get one doctest failure when run with sage -t --long --optional=sage,dochtml:

**********************************************************************
File "src/sage/misc/sagedoc.py", line 1179, in sage.misc.sagedoc.?
Failed example:
    len(search_doc('tree', whole_word=True, interact=False).splitlines()) < 2000  # optional - dochtml, long time
Expected:
    True
Got:
    False
**********************************************************************

Is this fixed by #26017? If so, should that ticket be a dependency of this one?

comment:38 Changed 14 months ago by embray

That's fixed by #26017. Also, I don't get that failure, but as I wrote there that failure can also happen if you just have "junk" in your doc build and goes away with doc-clean. I don't think it's related to this ticket.

comment:39 in reply to: ↑ 35 Changed 14 months ago by embray

Replying to jhpalmieri:

Not a big deal, but in the code

    for L in results:
        filename = L.strip().split(':', 1)[0]
        if filename:
            files.add(filename)

as long as L is nonempty and doesn't start with a colon, filename will be nonempty as well. Do you expect L to start with a colon? I tried a few cases and didn't see that come up.

That's a good point. The original code also would have broken if a filename started with a colon. This should never be the case though so I'm not worried about it.

comment:40 Changed 14 months ago by jhpalmieri

  • Authors changed from John Palmieri, Erik Bray to Erik Bray
  • Reviewers set to John Palmieri
  • Status changed from needs_review to positive_review

Okay, let's close this one. Cleaning up the doctests can happen on another ticket.

comment:41 Changed 14 months ago by embray

Agreed--thank you for filling the gap of my shoddy self-review.

comment:42 Changed 14 months ago by vbraun

  • Branch changed from u/embray/ticket-24784 to 603d5a025ace47b19186007da8d48a5142f2848a
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:43 Changed 6 months ago by slelievre

  • Commit 603d5a025ace47b19186007da8d48a5142f2848a deleted

Note: similar issue with "table" at #27656.

Note: See TracTickets for help on using tickets.