Opened 8 years ago

Last modified 3 years ago

#11542 needs_work defect

Fix preparse_file to prevent constants from being assigned to

Reported by: nbruin Owned by: was
Priority: major Milestone: sage-6.4
Component: user interface Keywords:
Cc: kcrisman Merged in:
Authors: Martin von Gagern Reviewers:
Report Upstream: N/A Work issues:
Branch: u/gagern/ticket/11542 (Commits) Commit: 9a909f48f9f3fdeadefc2511d694a7b9ae37d214
Dependencies: Stopgaps:

Description

In this sage-devel thread Tom Boothby observed that misc.preparse_file has a habit to replace integers by identifiers that are supposed to be constants. This transformation turns some ungrammatical strings into grammatical ones, such as

1=5

which gets preparsed to

_sage_const_1 = Integer(1)
_sage_const_5 = Integer(5)
_sage_const_1 =_sage_const_5

Note that there does not have to be a "=" involved for this to happen:

[1^2 for j in range(10) for 1 in [1..2*1] ] 

which only differs one keyword from the perfectly legal

[1^2 for j in range(10) if 1 in [1..2*1] ] 

so it's likely that one has to fully parse the string to detect the problems.

Change History (20)

comment:1 Changed 8 years ago by nbruin

Since the code produced by the normal preparse does not suffer from the problem (it transforms the ungrammatical constructs into other ungrammatical ones), one could use its output to test for errors:

sage: from ast import parse
sage: def fixed_preparse_file(s):
...       t=preparse(s)
...       try:
...           parse(t)
...           return sage.misc.preparser.preparse_file(s)
...       except:
...           return t

some timings on an ungrammatical construct

sage: timeit('fixed_preparse_file("[1^2 for j in range(10) for 1 in [1..2*1]]")')
625 loops, best of 3: 309 µs per loop
sage: timeit('sage.misc.preparser.preparse_file("[1^2 for j in range(10) for 1 in [1..2*1]]")')
625 loops, best of 3: 477 µs per loop
sage: timeit('preparse("[1^2 for j in range(10) for 1 in [1..2*1]]")')
625 loops, best of 3: 231 µs per loop

On a grammatical one

sage: timeit('fixed_preparse_file("[1^2 for j in range(10) if 1 in [1..2*1]]")')
625 loops, best of 3: 840 µs per loop
sage: timeit('sage.misc.preparser.preparse_file("[1^2 for j in range(10) if 1 in [1..2*1]]")')
625 loops, best of 3: 476 µs per loop

And, for reference, the output generated

sage: fixed_preparse_file("[1^2 for j in range(10) for 1 in [1..2*1]]")
'[Integer(1)**Integer(2) for j in range(Integer(10)) for Integer(1) in (ellipsis_range(Integer(1),Ellipsis,Integer(2)*Integer(1)))]'
sage: fixed_preparse_file("[1^2 for j in range(10) if 1 in [1..2*1]]")
'_sage_const_2 = Integer(2); _sage_const_1 = Integer(1); _sage_const_10 = Integer(10)\n[_sage_const_1 **_sage_const_2  for j in range(_sage_const_10 ) if _sage_const_1  in (ellipsis_range(_sage_const_1 ,Ellipsis,_sage_const_2 *_sage_const_1 ))]'

This code simply tries to parse the result of preparse and if that leads to an error, it returns that result (so that the error can be raised in the normal way). If it does parse, it returns the result of preparse_file instead. As you can see from the timings, the preparsing (and the double preparsing doubly so) is more expensive than the parsing.

comment:2 Changed 8 years ago by robertwb

Another option is to parse all but the literals, and it should then be syntactically valid Python code ("fast" to check), and then throw the literals in there. This might be a better representation to raise syntax errors from as well.

comment:3 Changed 8 years ago by kcrisman

  • Cc kcrisman added

Here's another example by Tom Boothby:

class 0: 
    def 0(0): 
        return 0

comment:4 Changed 8 years ago by jason

Note that this constant refactoring is an *option* to preparse_file that can be turned off.

       * ``numeric_literals`` - bool (default: True), whether to factor
         out wrapping of integers and floats, so they don't get created
         repeatedly inside loops

comment:5 Changed 6 years ago by jdemeyer

  • Milestone changed from sage-5.11 to sage-5.12

comment:6 Changed 6 years ago by vbraun_spam

  • Milestone changed from sage-6.1 to sage-6.2

comment:7 Changed 6 years ago by vbraun_spam

  • Milestone changed from sage-6.2 to sage-6.3

comment:8 Changed 5 years ago by vbraun_spam

  • Milestone changed from sage-6.3 to sage-6.4

comment:9 Changed 5 years ago by gagern

  • Branch set to u/gagern/ticket/11542
  • Modified changed from 08/10/14 16:51:03 to 08/10/14 16:51:03

comment:10 Changed 5 years ago by gagern

  • Commit set to a8c10952b8ca987cd5c8623cbbaf857ee79bbecc

Quoting from the commit message of the commit I just pushed:

The core idea is to replace numeric literals not directly with the corresponding variable name, but instead with a formatting placeholder similar to those used for string literals and comments. That way we can substitute different things for different applications: for syntax checks we substitute real python numeric literals, namely the literal 1. For compilation and execution, we plug in the variable names. This approach was suggested by Robert Bradshaw.

This change deliberately breaks compatibility with regard to the values returned by preparse_numeric_literals(code, extract=True) and extract_numeric_literals. Where the first element of the returned tuple used to contain variable names, it now contains formatting termplates. There are also some slight deviations regarding spaces added after raw Python numeric literals. It is assumed that no user code will rely on any of this functionality, except indirectly via preparse_file.

This change might incur some performance penalty since the Python code has to be parsed twice. On the other hand, there are some situations where performance should become better. We avoid extracting string literals twice, in preparse_file and preparse. We also avoid running the preparser on the numeric literals. The latter apparently was a limiting factor for the number of assignments which could be placed in the first row. Perhaps that limit no longer applies now, and should be removed in a subsequent commit.

Please let me know if you consider my approach of breaking the API unsuitable, and if I should follow proper deprecation procedures instead. If so, getting a clean interface for all of this might require several years: I'd first introduce a keyword argument to extract_numeric_literals and deprecate calls without that, then I'd make code always follow the new behavior and deprecate calls with that argument. Two deprecation steps, means two years minimum.

Please also let me know what you thing of always placing all assignments into the first line, now that doing so should be possible without trouble from the preparser.

I'll not be able to work on this for the next few weeks, or even respond to input. If someone else wants to take this up and push it through review, be my guest. Otherwise I'll request review at some point next month.


New commits:

a8c1095preparse_file: trigger SyntaxError on assignment to numeric literal.

comment:11 Changed 5 years ago by gagern

  • Authors set to Martin von Gagern

comment:12 Changed 5 years ago by gagern

  • Status changed from new to needs_review

comment:13 Changed 5 years ago by kcrisman

  • Status changed from needs_review to needs_work

Fro some reason this doesn't apply, apparently.

comment:14 Changed 5 years ago by git

  • Commit changed from a8c10952b8ca987cd5c8623cbbaf857ee79bbecc to c8375cad159f50605c07da1447afc0ff1f6607b1

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

c8375capreparse_file: trigger SyntaxError on assignment to numeric literal.

comment:15 Changed 5 years ago by gagern

  • Status changed from needs_work to needs_review

Conflict due to a2598e1 which moved the code to a different package. I rebased the code to 6.5.beta2 by applying the patch to the new file using patch --merge.

comment:16 Changed 5 years ago by gagern

Now that 6.6 has reached its first release candidate, can we perhaps try to get this here addressed for 6.7? This change might theoretically affect pretty much all sage interactions, so it would be nice to see this introduced early in the next round of beta testing, to spot any issues in terms of performance or semantics. Since I'm still waiting for some basic feedback to comment:10, it would be nice if someone could look at my changes soon, to get things settled by the time 6.6 gets released 6.7 starts betas.

comment:17 follow-up: Changed 5 years ago by nbruin

I haven't checked or tested the approach at all, so I cannot comment on that. I did notice

    try:
        ...
    except:
        print ...
        raise

Unfortunately, that's a bad idea. The normal behaviour is that during exception raising, nothing is printed. Only once the exception reaches the REPL top level does the traceback get printed, and in a way that is subject to hooks.

Printing something that belongs in the traceback could get lost (if stdout is not stderr) or reach an inappropriate location (imagine that this routine itself is run inside a try/except where the exception is supposed to be caught).

I think the better solution here is to test if the failure of the "try" body is in the way that warrants the information contained in "print" and then raise an appropriate, new exception with an informative message (possibly with bits extracted from the original exception) and otherwise just raise, without printing new information.

If the "try" body fails in an expected way then the traceback from there shouldn't be relevant, so a fresh exception is OK. Otherwise the exception should appear uncaught.

comment:18 in reply to: ↑ 17 Changed 5 years ago by gagern

Replying to nbruin.

Thanks for the feedback. The situation is the following: we see an exception for any kind of syntax error, and the exception itself will likely contain code snippets describing its location, but these code snippets are neither for the code the user entered, nor for the code which will eventually get executed, but instead contains some dummy values in place of literals.

Perhaps I should do the following: try to parse the code with the actual values plugged back in, and if that throws, let the exception escape. If not, raise a new exception about there (probably?) being an assignment to a literal. Will write code for that soon.

Last edited 4 years ago by gagern (previous) (diff)

comment:19 Changed 5 years ago by git

  • Commit changed from c8375cad159f50605c07da1447afc0ff1f6607b1 to 9a909f48f9f3fdeadefc2511d694a7b9ae37d214

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

9a909f4preparse_file: trigger SyntaxError on assignment to numeric literal.

comment:20 Changed 3 years ago by chapoton

  • Status changed from needs_review to needs_work

does not apply

Note: See TracTickets for help on using tickets.