#28974 closed defect (fixed)

F-strings (PEP 498) are not correctly handled by Sage's preparser

Reported by: bouvier Owned by:
Priority: critical Milestone: sage-9.3
Component: user interface Keywords: f-string, preparse, preparser
Cc: gh-jcamp0x2a, gh-kliem, slelievre, embray Merged in:
Authors: Joshua Campbell Reviewers: Markus Wageringel
Report Upstream: N/A Work issues:
Branch: 3eedd36 (Commits, GitHub, GitLab) Commit: 3eedd36e37a5b1d1d966c078c2f02a05e5253060
Dependencies: Stopgaps:

Status badges

Description

F-strings are defined by PEP 498 (https://www.python.org/dev/peps/pep-0498/)

Currently, in sage, they are not touched by the preparser which can cause incoherences.

For example, the following code

print (f'{1/3}')
print (1/3)
a = 1
print (f'{a/3}')
print (a/3)

outputs

0.3333333333333333
1/3
1/3
1/3

The difference between the first and second print statement is due to the facts that:

  • in the second print statement, the preparser replace 1/3 by Integer(1)/Integer(3) which is a Rational
  • in the first print, the preparser does not touch the content of the string between the brackets

Other incoherences can be found with power:

print (f'{2^3}')
print (2^3)

outputs

1
8

Change History (50)

comment:1 Changed 17 months ago by mkoeppe

  • Milestone changed from sage-9.1 to sage-9.2

Batch modifying tickets that will likely not be ready for 9.1, based on a review of the ticket title, branch/review status, and last modification date.

comment:2 Changed 13 months ago by mkoeppe

  • Cc gh-jcamp0x2a gh-kliem added

comment:3 Changed 13 months ago by mkoeppe

  • Priority changed from major to critical

comment:4 Changed 13 months ago by gh-jcamp0x2a

Hmm, seems tricky due to the possibility of nesting strings -- including other f-strings -- inside their replacement sections, like in the following:

>>> f"{ 'abc-1-2'.upper() }" # Don't want 'ABC-Integer(1)-Integer(2)'
'ABC-1-2'
>>> f"""1 { f'''2 { f"4 { f'8 { 2**4 } 8' } 4" } 2''' } 1""" # Ideally 2^4
'1 2 4 8 16 8 4 2 1'

I can take a stab at implementing this since the preparser code is somewhat fresh in my mind from troubleshooting #30417.

comment:5 Changed 13 months ago by mkoeppe

  • Cc slelievre added

I think it will be a great improvement already if it can handle simple cases; I would not worry about nested f-strings too much.

comment:6 Changed 13 months ago by gh-jcamp0x2a

  • Authors set to Joshua Campbell
  • Branch set to u/gh-jcamp0x2a/28974-f-strings
  • Commit set to 14ba983f5c48c5399fe929de9ef9d6cc2fb026bf

The nested case turned out to be fairly straight-forward to implement after all. Have pushed a rough draft. It handles, among other things, the two examples I posted before.

Still need to:

  • deal with {{ and }} escape sequences in literal portion of F-string.
  • treat format specifier as literal except for portions surrounded by braces in order to avoid an error b/c of something like f'{x:5}' transforming into f'{x:Integer(5)}', which is invalid. f'{x:{5}}' is a workaround, though.
  • add lots of new doctests to deal with all these fun corner cases.

comment:7 follow-up: Changed 13 months ago by gh-mwageringel

Thanks for your work on the preparser. I have not looked at your code in detail yet, but I think the function containing_block might be useful for the nested cases. It could be used to extract a substring that is preparsed individually. For example:

sage: s = """1 { f'''2 { f"4 { f'8 { 2**4 } 8' } 4" } 2''' } 1"""
sage: sage.repl.preparse.containing_block(s, 2, delimiters=['{}'])
(2, 47)
sage: print(s[slice(*_)])
{ f'''2 { f"4 { f'8 { 2**4 } 8' } 4" } 2''' }
sage: for idx in [2, 10, 16, 22]: print(s[slice(*sage.repl.preparse.containing_block(s, idx, delimiters=['{}']))])
{ f'''2 { f"4 { f'8 { 2**4 } 8' } 4" } 2''' }
{ f"4 { f'8 { 2**4 } 8' } 4" }
{ f'8 { 2**4 } 8' }
{ 2**4 }

comment:8 in reply to: ↑ 7 Changed 13 months ago by gh-jcamp0x2a

Replying to gh-mwageringel:

Thanks for your work on the preparser. I have not looked at your code in detail yet, but I think the function containing_block might be useful for the nested cases. It could be used to extract a substring that is preparsed individually.

Thank you. I will take a look at containing_block and see if I can use it to simplify things. I'm concerned that it might not work for multi-line F-strings, though, since each line is preparsed separately, and so the braces on any one line may be unbalanced.

comment:9 Changed 13 months ago by embray

  • Cc embray added

comment:10 Changed 13 months ago by git

  • Commit changed from 14ba983f5c48c5399fe929de9ef9d6cc2fb026bf to 3f27c6ca456659976f1d5fb9a826664f1798708d

Branch pushed to git repo; I updated commit sha1. New commits:

b69674dProperly handle {{ and }} escape sequences in F-strings
ae7fda7Make sure quote.braces is never negative
3f27c6cTreat brace as code to avoid problem with generator expression

comment:11 Changed 13 months ago by git

  • Commit changed from 3f27c6ca456659976f1d5fb9a826664f1798708d to 6b29a45ed65ab2e8b4ac2b810f522549866c84d3

Branch pushed to git repo; I updated commit sha1. New commits:

6b29a45Fix issue with numeric wrapping inside single-quoted F-strings

comment:12 Changed 13 months ago by gh-jcamp0x2a

I ran into a snag with another, later stage of the preparser, preparse_numeric_literals, and would like feedback on how to proceed.

Expressions like 3.14 and 5j get turned into RealNumber('3.14') and ComplexNumber(0, '5'). That's fine when using a double-quoted F-string, but it of course fails when surrounded by single-quotes in an F-string:

sage: f"{3.14}"
'3.14000000000000'
sage: f"{5j}"
'5.00000000000000*I'
sage: f'{3.14}'
  File "<ipython-input-11-8042a075c204>", line 1
    f'{RealNumber('3.14')}'
                      ^
SyntaxError: invalid syntax

sage: f'{5j}'
  File "<ipython-input-12-a4948e5b5673>", line 1
    f'{ComplexNumber(0, '5')}'
                         ^
SyntaxError: invalid syntax

sage: f'{3.14r}'
'3.14'
sage: f'{5jr}'
'5j'

I've "solved" this by keeping track of which quotation marks (', ", ''', """) have been used by F-strings and telling preparse_numeric_literals to use one of the unused ones instead, if any. In the -- perhaps rare -- case of all of them being used up, numeric wrapping will not occur. For example:

sage: f'{3.14}' + f"{3.14}" + f'''{3.14}'''  # """ is available to use
'3.140000000000003.140000000000003.14000000000000'
sage: f'{3.14}' + f"{3.14}" + f'''{3.14}''' + f"""{3.14}"""  # no longer
'3.143.143.143.14'
sage: f"""{f'''{f"{3.14}"}'''}"""  # ' is available to use
'3.14000000000000'
sage: f"""{f'''{f"{f'{3.14}'}"}'''}"""  # no longer
'3.14'

Is this acceptable? The only other way forward I see at the moment would be to modify Sage's RealNumber and ComplexNumber to accept something like a (digits, exponent) tuple like ([3, 1, 4], 1) in place of the string '3.14'. Then have preparse_numeric_literals use those to avoid adding any of its own quotation marks whatsoever. I'm not familiar with the sage.rings code nor do I have much experience with Cython, and there's lots of Cython in there, so I could do some damage mucking around :)

On a brighter note, I figured out how to handle the {{ and }} escape sequences. I also found Python's own set of test cases for F-strings so that should be a good source of inspiration for doctests. I turned that file into a .sage so the preparser could have a go at it, and the only failures at this point seem to be related to the format specifier problem I already noted in a previous comment.

comment:13 follow-up: Changed 13 months ago by embray

This might be ugly, but would be more reliable and less messy than dealing with nesting of quotation marks. I don't know if it would be easy though. What if the preparser replaced all numerical literals in the format string (of any sort that are already handled by the preparser) with variables? Something like:

f'{3.14} {5j}' ->
__sage_tmp_literals = [RealNumber('3.14'), ComplexNumber(0, '5')]; f'{__sage_tmp_literals[0]} {__sage_tmp_literals[1]}'; del __sage_tmp_literals

comment:14 in reply to: ↑ 13 ; follow-up: Changed 13 months ago by gh-jcamp0x2a

Replying to embray:

This might be ugly, but would be more reliable and less messy than dealing with nesting of quotation marks. I don't know if it would be easy though. What if the preparser replaced all numerical literals in the format string (of any sort that are already handled by the preparser) with variables? Something like:

f'{3.14} {5j}' ->
__sage_tmp_literals = [RealNumber('3.14'), ComplexNumber(0, '5')]; f'{__sage_tmp_literals[0]} {__sage_tmp_literals[1]}'; del __sage_tmp_literals

I really like the idea, and I believe something similar is already occurring when preparsing an entire file (such as when running a .sage script). For example, one of the test case failures I mentioned looks like:

# in test_fstring.sage:
self.assertEqual(f'{3.14:10.10}', '      3.14')
# when running with ./sage test_fstring.sage:
======================================================================
ERROR: test_conversions (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_fstring.sage.py", line 866, in test_conversions
    self.assertEqual(f'{_sage_const_3p14 :_sage_const_10p10 }', '      3.14')
TypeError: unsupported format string passed to sage.rings.real_mpfr.RealLiteral.__format__

Those constants are seen at the top of the preparsed file, test_fstring.sage.py:

_sage_const_3p14 = RealNumber('3.14'); _sage_const_10p10 = RealNumber('10.10');

My concern is that preparse is invoked on individual lines by the interpreter and even on individual expressions elsewhere in Sage (ex: sage_eval and the sage(...) construct you can use from gap/maxima/singular). I wouldn't want to prepend an assignment in the middle of a multi-line statement or an arbitrary expression. Might be able to fix it on a case-by-case basis everywhere preparse is called, but I'd like a more self-contained solution.

I'm going to try that (sign, digits, exponent) idea and see how it works out. If it gets too messy, I'll come back to this idea instead.

comment:15 in reply to: ↑ 14 Changed 13 months ago by nbruin

Replying to gh-jcamp0x2a:

My concern is that preparse is invoked on individual lines by the interpreter and even on individual expressions elsewhere in Sage (ex: sage_eval and the sage(...) construct you can use from gap/maxima/singular). I wouldn't want to prepend an assignment in the middle of a multi-line statement or an arbitrary expression. Might be able to fix it on a case-by-case basis everywhere preparse is called, but I'd like a more self-contained solution.

These are different modes. preparse_file indeed factors out numerical constants and assigns them to variables, to avoid the repeated recreation of them in loops. However, preparse normally doesn't do that, and cannot do that for f-string components either.

comment:16 follow-up: Changed 13 months ago by gh-mwageringel

Inside expressions, one could use anonymous lambdas in order to avoid assignments, like this:

sage: f'π ≈ {(lambda __sage_tmp: f"{__sage_tmp}")(RealNumber("3.14"))}'
'π ≈ 3.14000000000000'

This would still be difficult in case of multiline strings – though, I am not really sure why the preparser needs to run line by line.

In reality though I think that nested f-strings are quite rare, so just raising a SyntaxError with a helpful message when all possible quotes have been used should be fine. This can already happen in the first nested level as in

f"{f'{3.14}'}"

but the example from comment:12 can be rewritten without conflicting quotes:

sage: f'{3.14}' + f"{3.14}" + f'''{3.14}''' + f"""{3.14}"""  # no longer

comment:17 Changed 13 months ago by git

  • Commit changed from 6b29a45ed65ab2e8b4ac2b810f522549866c84d3 to 4125a682b004e1a34e66d20718a92a7d255577c1

Branch pushed to git repo; I updated commit sha1. New commits:

4125a68Rebuild string using array of code-points when out of quotes to use

comment:18 follow-up: Changed 13 months ago by gh-jcamp0x2a

After thinking it over, I realized I was over-thinking it :)

I was able to fix it by just converting to/from a list of Unicode code-points when out of quote delimiters to use. The preparsed code doesn't look pretty, but it only happens in one of those rare cases mentioned. Ugly yet functional over pretty yet broken.

sage: preparse('f"""{f\'\'\'{f"{f\'{3.14}\'}"}\'\'\'"""')
'f"""{f\'\'\'{f"{f\'{RealNumber(str().join(map(chr, [51, 46, 49, 52])))}\'}"}\'\'\'"""'

I apologize for the churn. I expect to get that format specifier problem ironed out and doctests ready for review within a day or so.

comment:19 in reply to: ↑ 16 ; follow-up: Changed 13 months ago by embray

Replying to gh-mwageringel:

This would still be difficult in case of multiline strings – though, I am not really sure why the preparser needs to run line by line.

It's definitely a mess. No one has taken the time to sit down and write a proper grammar for Sage and a real parser.

comment:20 in reply to: ↑ 18 Changed 13 months ago by embray

Replying to gh-jcamp0x2a:

After thinking it over, I realized I was over-thinking it :)

I was able to fix it by just converting to/from a list of Unicode code-points when out of quote delimiters to use. The preparsed code doesn't look pretty, but it only happens in one of those rare cases mentioned. Ugly yet functional over pretty yet broken.

sage: preparse('f"""{f\'\'\'{f"{f\'{3.14}\'}"}\'\'\'"""')
'f"""{f\'\'\'{f"{f\'{RealNumber(str().join(map(chr, [51, 46, 49, 52])))}\'}"}\'\'\'"""'

I apologize for the churn. I expect to get that format specifier problem ironed out and doctests ready for review within a day or so.

That's still pretty hideous (which I don't mean as a dig at you; it's a clever solution). But if it works for now I say go with it. The only "good" solution I think is to completely rewrite the parser from bottom up but that's a bigger project.

comment:21 Changed 13 months ago by git

  • Commit changed from 4125a682b004e1a34e66d20718a92a7d255577c1 to 7c65600a0f188f5916b876c62823a42866e55721

Branch pushed to git repo; I updated commit sha1. New commits:

90fb3a8Merge branch 'develop' into u/gh-jcamp0x2a/28974-f-strings
7c65600Fix handling of format specifier and add doctests

comment:22 Changed 13 months ago by gh-jcamp0x2a

  • Status changed from new to needs_review

I was able to resolve the issue with format specifiers and added some documentation and doctests to make it ready for review.

I cleaned up a bunch in sage.repl.preparse.strip_string_literals where most of my changes were focused. It's harder to understand now (due to the necessary added complexity of F-strings), but hopefully easier to understand than it otherwise would've been.

comment:23 Changed 13 months ago by git

  • Commit changed from 7c65600a0f188f5916b876c62823a42866e55721 to 1d6b9d6dd9e8f25f9d7e7c4b342da6f4fbf8b4d0

Branch pushed to git repo; I updated commit sha1. New commits:

1d6b9d6Fix issue trying to test for '"""' in doctest

comment:24 follow-up: Changed 13 months ago by gh-kliem

sage -t --long --random-seed=0 src/sage/doctest/sources.py
**********************************************************************
File "src/sage/doctest/sources.py", line 765, in sage.doctest.sources.FileDocTestSource._test_enough_doctests
Failed example:
    for path, dirs, files in itertools.chain(os.walk('sage'), os.walk('doc')): # long time
        path = os.path.relpath(path)
        dirs.sort(); files.sort()
        for F in files:
            _, ext = os.path.splitext(F)
            if ext in ('.py', '.pyx', '.pxd', '.pxi', '.sage', '.spyx', '.rst'):
                filename = os.path.join(path, F)
                FDS = FileDocTestSource(filename, DocTestDefaults(long=True, optional=True, force_lib=True))
                FDS._test_enough_doctests(verbose=False)
Expected:
    There are 3 unexpected tests being run in sage/doctest/parsing.py
    There are 1 unexpected tests being run in sage/doctest/reporting.py
Got:
    There are 3 unexpected tests being run in sage/doctest/parsing.py
    There are 1 unexpected tests being run in sage/doctest/reporting.py
    There are 5 tests in sage/repl/preparse.py that are not being run

Also there are several methods that need coverage.

comment:25 in reply to: ↑ 24 ; follow-ups: Changed 13 months ago by gh-jcamp0x2a

Replying to gh-kliem:

sage -t --long --random-seed=0 src/sage/doctest/sources.py
**********************************************************************
File "src/sage/doctest/sources.py", line 765, in sage.doctest.sources.FileDocTestSource._test_enough_doctests
Failed example:
    for path, dirs, files in itertools.chain(os.walk('sage'), os.walk('doc')): # long time
        path = os.path.relpath(path)
        dirs.sort(); files.sort()
        for F in files:
            _, ext = os.path.splitext(F)
            if ext in ('.py', '.pyx', '.pxd', '.pxi', '.sage', '.spyx', '.rst'):
                filename = os.path.join(path, F)
                FDS = FileDocTestSource(filename, DocTestDefaults(long=True, optional=True, force_lib=True))
                FDS._test_enough_doctests(verbose=False)
Expected:
    There are 3 unexpected tests being run in sage/doctest/parsing.py
    There are 1 unexpected tests being run in sage/doctest/reporting.py
Got:
    There are 3 unexpected tests being run in sage/doctest/parsing.py
    There are 1 unexpected tests being run in sage/doctest/reporting.py
    There are 5 tests in sage/repl/preparse.py that are not being run

I think the latest commit, 1d6b9d6, should resolve this.

Also there are several methods that need coverage.

I will push a commit to add doctests to these methods shortly.


Perhaps related, perhaps not: I noticed a segfault in the patchbot results when it was testing src/sage/interfaces/singular.py, and I was able to reproduce it locally. Interestingly, I still see it happening intermittently on develop, and I also see it in the patchbot results for other tickets like #9407 and #29541.

comment:26 Changed 13 months ago by git

  • Commit changed from 1d6b9d6dd9e8f25f9d7e7c4b342da6f4fbf8b4d0 to 01e2ca16f674eef93ea947f5702ab7990dd71e90

Branch pushed to git repo; I updated commit sha1. New commits:

01e2ca1Add doctests to various methods and fix safe_delimiter bug

comment:27 in reply to: ↑ 25 Changed 13 months ago by gh-jcamp0x2a

Replying to gh-jcamp0x2a:

Replying to gh-kliem:

Also there are several methods that need coverage.

I will push a commit to add doctests to these methods shortly.

Thank you for mentioning the need for these doctests. In the course of adding them, I found a bug in QuoteStack.safe_delimiter. Was popping an element from the set when I should've just been peeking at the first one.

comment:28 follow-up: Changed 13 months ago by gh-mwageringel

  • Reviewers set to Markus Wageringel
  • Status changed from needs_review to needs_work

Nice. Overall the implementation looks ok to me. Here are a couple of remarks.

Triple quotes cannot be used inside single or double quotes:

sage: f'{ f"{ 3.14 }" }'
  File "<ipython-input-9-f05825bdea91>", line 1
    f'{ f"{ RealNumber('''3.14''') }" }'
                          ^
SyntaxError: invalid syntax

Also, could this

sage: preparse(""" f'{ 3.14 }' + f"{ 3.14 }"  """)
 f'{ RealNumber('''3.14''') }' + f"{ RealNumber('''3.14''') }"

return the following instead?

 f'{ RealNumber("3.14") }' + f"{ RealNumber('3.14') }"

Change: they're -> they are

Please avoid assigning to the _ identifier. I have used this pattern before myself, but in Python, this assigns a value to the variable named _ which is not a descriptive name. It does not avoid warnings that an assigned variable is not used. Since IPython 6, this also conflicts with the IPython convention that _ denotes the result of the last computation. If you want to avoid assigning a tuple component to a name, you could use slices. For example:

-        sage: s, literals, _ = strip_string_literals(''' f'{ {"x":1, "y":2} }' '''); s
-        ' f%(L1)s{ {%(L2)s:1, %(L3)s:2} }%(L4)s '
-        sage: literals
-        {'L1': "'", 'L2': '"x"', 'L3': '"y"', 'L4': "'"}
+        sage: strip_string_literals(''' f'{ {"x":1, "y":2} }' ''')[:2]
+        (' f%(L1)s{ {%(L2)s:1, %(L3)s:2} }%(L4)s ',
+         {'L1': "'", 'L2': '"x"', 'L3': '"y"', 'L4': "'"})

Use quotes in docstrings in cases like these: (default: False) -> (default: ``False``)

When specifying the type in docstrings, simply use "boolean"/"string" instead of "a boolean"/"a string".

Typo(?):

-    '{{' and '}}' only escape sequences work in the literal portion of an F-string::
+    '{{' and '}}' escape sequences only work in the literal portion of an F-string::

Typo: specier -> specifier

Preferably add a space after commas, e.g.:

-            new_code.append(code[start:q].replace('%','%%'))
+            new_code.append(code[start:q].replace('%', '%%'))

Did you think of colons in slices here?

        elif ch == ':' and quote and not quote.parens and (quote.braces == 1 or quote.fmt_spec):
            if quote.braces == 1:
                # In a replacement section but outside of any nested braces or
                # parentheses, the colon signals the beginning of the format specifier.

For example:

sage: strip_string_literals("f'{ [1,2,3][:1]  } '")
('f%(L1)s{ [1,2,3][:%(L2)s}%(L3)s', {'L1': "'", 'L2': '1]  ', 'L3': " '"}, [])

Besides lambdas, dictionaries and slices, are there other possible uses of colons?

Check that the indices are ≥ 0:

                if code[q-1] == '\\':
                    k = 2
                    while code[q-k] == '\\':

Please change the comment to an explanation about why this is done – something along the lines that Python 3 does not accept leading 0s, but Sage does and interprets them as decimal.

                num = re.sub(r'^0+', '', num) # Strip leading zeroes.

comment:29 in reply to: ↑ 19 ; follow-up: Changed 13 months ago by gh-mwageringel

Replying to embray:

Replying to gh-mwageringel:

This would still be difficult in case of multiline strings – though, I am not really sure why the preparser needs to run line by line.

It's definitely a mess. No one has taken the time to sit down and write a proper grammar for Sage and a real parser.

Do I understand correctly that such a change would replace the preparser by a lexer and a parser in order to convert Sage code directly to an AST that can be compiled to byte code? Is this a long-term goal?

comment:30 in reply to: ↑ 28 ; follow-up: Changed 13 months ago by gh-jcamp0x2a

Replying to gh-mwageringel:

Nice. Overall the implementation looks ok to me. Here are a couple of remarks.

Triple quotes cannot be used inside single or double quotes: [...]

My relative inexperience with Python is showing. A case of the blind "leading" the seeing :) I don't think this will be difficult to remedy; just popping ''' and """ out of the set of safe delimiters whenever a ' or " is used.

Also, could this

sage: preparse(""" f'{ 3.14 }' + f"{ 3.14 }"  """)
 f'{ RealNumber('''3.14''') }' + f"{ RealNumber('''3.14''') }"

return the following instead?

 f'{ RealNumber("3.14") }' + f"{ RealNumber('3.14') }"

Not with the current approach. Both the ' and " would be used up, and since triple quotes would be invalid, it'd have to fallback to the list of ordinals. Could try to maintain a mapping between ranges in the processed code and sets of valid delimiters within instead of the single set of delimiters that applies to the whole thing. I'm not sure if the effort or cycles would be worth it, though. Is there a need for the preparsed code to be particularly human-readable? Perhaps to make error messages occuring within it more readable?

Did you think of colons in slices here?

I did not, and that's a very good catch. I think keeping track of unmatched square brackets as I do braces and parentheses will solve the slice problem.

Besides lambdas, dictionaries and slices, are there other possible uses of colons?

I will look into this further and check the Python grammar. I know trying to define a function or start an if/while/try block inside an F-string will raise a SyntaxError.


I agree 100% with your other points and will work on remedying them. It seems like the 9.2.beta builds are wrapping up, and I don't think I'll be able to get this ready in time for the last one, so I'm going to go ahead and change the milestone to 9.3.

Thank you for your time reviewing this ticket; it is most appreciated!

comment:31 Changed 13 months ago by gh-jcamp0x2a

  • Milestone changed from sage-9.2 to sage-9.3

comment:32 in reply to: ↑ 25 ; follow-up: Changed 12 months ago by gh-mwageringel

Replying to gh-jcamp0x2a:

Not with the current approach. Both the ' and " would be used up, and since triple quotes would be invalid, it'd have to fallback to the list of ordinals.

Ok, let us keep it like this then to avoid unnecessary complication. This only happens in the rare case of nested f-strings anyway. Personally, I have never inspected the transpiled code, so I agree there is no need to make it particularly pretty.

Replying to gh-jcamp0x2a:

Perhaps related, perhaps not: I noticed a segfault in the patchbot results when it was testing src/sage/interfaces/singular.py, and I was able to reproduce it locally.

This is not related. I was able to reproduce it with 9.2.beta11 and it is most likely related to the Pexpect upgrade in that beta. With 9.2.beta12 after the upgrade to Python 3.8, I am unable to reproduce it anymore, so the problem seems to have solved itself.

comment:33 in reply to: ↑ 32 Changed 12 months ago by gh-jcamp0x2a

Replying to gh-mwageringel:

Replying to gh-jcamp0x2a:

Perhaps related, perhaps not: I noticed a segfault in the patchbot results when it was testing src/sage/interfaces/singular.py, and I was able to reproduce it locally.

This is not related. I was able to reproduce it with 9.2.beta11 and it is most likely related to the Pexpect upgrade in that beta. With 9.2.beta12 after the upgrade to Python 3.8, I am unable to reproduce it anymore, so the problem seems to have solved itself.

That's a relief. Didn't look forward to trying to troubleshoot that! :)

comment:34 in reply to: ↑ 30 ; follow-up: Changed 12 months ago by gh-jcamp0x2a

Markus, I shall be pushing an updated branch shortly that seeks to address the issues you raised in your review.

Also, as to...

Replying to gh-jcamp0x2a:

Replying to gh-mwageringel:

Besides lambdas, dictionaries and slices, are there other possible uses of colons?

I will look into this further and check the Python grammar. I know trying to define a function or start an if/while/try block inside an F-string will raise a SyntaxError.

...I believe I've identified all of the uses of the colon in Python's grammar. My findings:

Function and class definitions: def foo():, class Foo:
not allowed inside F-strings
Control flow statements: if x:, elif x:, else:, while x:, for x in y:, try:, except:, finally:, with x as y:
not allowed inside F-strings
Dictionary literals: {"x": 5}
enclosed in braces, and the brace count is already checked before handling the colon
Lambda expressions: lambda x: expr
"Because lambdas use the ':' character, they cannot appear outside of parentheses in an expression." PEP 498 "Lambdas inside expressions". The parenthesis count is already checked before handling the colon.
Slice notation: x[start:stop:step]
enclosed in square brackets, and my latest push adds a square bracket check before handling the colon
Assignment expressions: x := y)
"Assignment expressions inside of f-strings require parentheses."PEP 572 "Exceptional cases", last bullet point. The parenthesis count is already checked before handling the colon.
Parameter annotations: def foo(x: int, y: float):
Function definitions aren't allowed inside F-strings. "lambda's syntax does not support annotations." PEP 3107, "Lambda". Even if they did, the entire lambda expression would still require parentheses around it.
Variable annotations / annotated assignment: x: int, x: int = 5
I'm least sure about this one, and I've been unable to find any solid documentation relating it with F-strings. My reading of the grammar suggests it's not allowed, and sure enough Python seems to reject any attempts to use such an annotation in an F-String. For example: f'{x: int}' and f'{x: int = 5}' seem to treat int and int = 5 as format specifiers and end up raising an error to that effect.

comment:35 Changed 12 months ago by git

  • Commit changed from 01e2ca16f674eef93ea947f5702ab7990dd71e90 to e626438fc43cf10b1e5f6159937dbd1632ca004e

Branch pushed to git repo; I updated commit sha1. New commits:

79dcf82Fix some typos, whitespace, and wording issues
9ab0497Merge branch 'develop' into u/gh-jcamp0x2a/28974-f-strings
9315106Remove ''' or """ from safe delimiters when ' or " is used
260b343Prefix QuoteStack.stack and safe_delims with _
81d409bKeep track of brackets to support slices in F-strings
21e2d74Revert usage of _ as a throwaway variable
e626438Add lower bounds check when looking for escaped quotes

comment:36 Changed 12 months ago by gh-jcamp0x2a

  • Status changed from needs_work to needs_review

comment:37 in reply to: ↑ 34 Changed 12 months ago by gh-mwageringel

  • Status changed from needs_review to needs_work

Thanks for the detailed analysis.

Replying to gh-jcamp0x2a:

Assignment expressions: x := y)
"Assignment expressions inside of f-strings require parentheses."PEP 572 "Exceptional cases", last bullet point. The parenthesis count is already checked before handling the colon.

I did not know about this syntax yet, but indeed there seems to be no problem.

Variable annotations / annotated assignment: x: int, x: int = 5
I'm least sure about this one, and I've been unable to find any solid documentation relating it with F-strings. My reading of the grammar suggests it's not allowed, and sure enough Python seems to reject any attempts to use such an annotation in an F-String. For example: f'{x: int}' and f'{x: int = 5}' seem to treat int and int = 5 as format specifiers and end up raising an error to that effect.

As far as I understand, these are statements, not expressions, and therefore cannot occur in f-strings.

Here are a few more small comments about the code:

The method safe_delimiters does not seem to be used anywhere. I think it would be fine to remove it – the purpose of the class is only for internal use in Sage anyway. The tests in its docstring seem to be mainly replicating the tests of push, so are not really needed either.

Since Python 3.6, dictionaries are order-preserving, so you could replace the uses of OrderedDict by dict.

This could be shortened:

-        return next(iter(self._safe_delims)) if self._safe_delims else None
+        return next(iter(self._safe_delims), None)
    Similarly, a colon inside brackets doesn't start the format specifier in order
    to allow slices.::

The .:: is not optimal here. Either remove the dot or add a space. This is how Sphinx interprets this:

slices.::    ⇒   slices.:
slices::     ⇒   slices:
slices. ::   ⇒   slices.

Further, the first paragraph of docstrings should not be more than one sentence. If you have more than one sentence, start a new paragraph. Also, the first sentence should use the imperative form. For example:

-        Adds a frame to the stack.
+        Add a frame to the stack.

Finally, I would like to mention that, with this branch, preparsing files is about 60% slower than before. That is quite a big difference, but it is still quite fast, so it seems to me this is ok. If it turns out this is a problem, we can look into improving the performance later on.

comment:38 Changed 12 months ago by git

  • Commit changed from e626438fc43cf10b1e5f6159937dbd1632ca004e to 4ff53d6c933a6cfaaf88e01ce15538861487ecad

Branch pushed to git repo; I updated commit sha1. New commits:

cb59ce3Address some feedback items and also some contractions I had missed
4ff53d6Replace _safe_delims OrderedDict by a couple booleans for ' and "

comment:39 Changed 12 months ago by gh-jcamp0x2a

Thanks Markus. I believe I have addressed most of your feedback in my latest push.

Replying to gh-mwageringel:

The method safe_delimiters does not seem to be used anywhere. I think it would be fine to remove it – the purpose of the class is only for internal use in Sage anyway. The tests in its docstring seem to be mainly replicating the tests of push, so are not really needed either.

Since Python 3.6, dictionaries are order-preserving, so you could replace the uses of OrderedDict by dict.

I have removed the _safe_delims dict entirely in favor of a couple booleans to keep track of ' and ". The triple-quoted versions would never be chosen anyway, since by the time they would (' and " used up), they'd also be invalid (since ''' can't appear within an F-string delimited by ' for example). All the doctests related to safe delimiters are moved to the safe_delimiter method itself.

Finally, I would like to mention that, with this branch, preparsing files is about 60% slower than before. That is quite a big difference, but it is still quite fast, so it seems to me this is ok. If it turns out this is a problem, we can look into improving the performance later on.

That is a problem. A small overall drop in performance would be ok, I think, as would a bigger drop when dealing with code that uses F-strings extensively. That big of a drop for preparsing files in general seems unacceptable, though.

Looks like I've got some more testing and tweaking to do. I'll leave the ticket as needs_work.

comment:40 Changed 12 months ago by git

  • Commit changed from 4ff53d6c933a6cfaaf88e01ce15538861487ecad to 3eedd36e37a5b1d1d966c078c2f02a05e5253060

Branch pushed to git repo; I updated commit sha1. New commits:

3eedd36Performance improvements to strip_string_literals

comment:41 Changed 12 months ago by gh-jcamp0x2a

  • Status changed from needs_work to needs_review

I've pushed a commit that attempts to improve the performance by:

  • Jumping between special chars (parentheses, quotes, etc.) using re.search similar to what the original algorithm was doing with str.find instead of handling every character ourselves.
  • Avoiding peeking at the top of the stack every special char, maintaining a local copy of the frame at the top.
  • Re-organizing the control flow to bail out as soon as possible including performing the checks for each special char in order of frequency found in the examples in src/doc and src/sage.

As my test case, I concatenated all of the code examples in src/doc and src/sage into a single file of about 400,000 lines. There is a single line that needs to be removed due to it using a variable name with special meaning to preparse_file.

$ grep -Phro "(?<=(sage|\.\.\.\.): ).+$" src/doc src/sage | grep -v "f(_sage_const_)" > preparser_test_input

I used this file to also determine the frequency of special chars in the Sage examples and thus inform the order in which to place their conditions.

$ fgrep -o "(" preparser_test_input  | wc -l
508806

I then ran the following in Sage to time how long it took to preparse the large test file:

with open('preparser_test_input', 'r') as f:
    contents = f.read()
print(timeit('sage.repl.preparse.preparse_file(contents)'))

My results the first time were as follows:

develop 5 loops, best of 3: 16.1 s per loop
prev commit 5 loops, best of 3: 23.5 s per loop
new commit 5 loops, best of 3: 17.9 s per loop

...and a second time, mostly the same:

develop 5 loops, best of 3: 16.1 s per loop
prev commit 5 loops, best of 3: 23.5 s per loop
new commit 5 loops, best of 3: 18 s per loop

So from a ~45% increase to ~10%, which I'm personally OK with now. I'm curious if others will get the same results.

comment:42 follow-up: Changed 12 months ago by gh-mwageringel

Thank you for the analysis. The changes look good to me. Now, I also see a slowdown of only about 10%, which seems fine.

I have just one more question regarding this comment:

# Deal with escaped quotes (odd number of backslashes preceding).

Could it be a problem here if we are inside a raw string? In this case an odd number of backslashes might not indicate an escaped quote. I could not find any example where this is a problem, so maybe there is no problem. Could you confirm this?

comment:43 in reply to: ↑ 42 Changed 12 months ago by gh-jcamp0x2a

Thank you for your time looking through the code again; I know there's been quite a bit of churn.

Replying to gh-mwageringel:

I have just one more question regarding this comment:

# Deal with escaped quotes (odd number of backslashes preceding).

Could it be a problem here if we are inside a raw string? In this case an odd number of backslashes might not indicate an escaped quote. I could not find any example where this is a problem, so maybe there is no problem. Could you confirm this?

This threw me for a loop when I first encountered it, too. I had thought that backslashes had no meaning in a raw string whatsoever, but apparently they can still be used to escape a quote character, although the backslash still appears in the output. For example:

>>> r'\'
  File "<stdin>", line 1
    r'\'
       ^
SyntaxError: EOL while scanning string literal
>>>
>>> r'\''
"\\'"

...and strip_string_literals has the following test:

TESTS:

    Even for raw strings, a backslash can escape a following quote::

        sage: s, literals, state = strip_string_literals(r"r'somethin\' funny'"); s
        'r%(L1)s'
        sage: dep_regex = r'^ *(?:(?:cimport +([\w\. ,]+))|(?:from +(\w+) +cimport)|(?:include *[\'"]([^\'"]+)[\'"])|(?:cdef *extern *from *[\'"]([^\'"]+)[\'"]))' # Ticket 5821

The ticket mentioned is #5821. It looks like prior to that ticket, raw-strings were excluded from the check for escaped quotes, resulting in a bug.

comment:44 follow-up: Changed 12 months ago by gh-jcamp0x2a

Oh, just to add to the example to show how python treats raw-strings with multiple backslashes:

>>> r'\\'
'\\\\'
>>> r'\\\'
  File "<stdin>", line 1
    r'\\\'
         ^
SyntaxError: EOL while scanning string literal
>>> r'\\\\'
'\\\\\\\\'
>>> r'\\\\\'
  File "<stdin>", line 1
    r'\\\\\'
           ^
SyntaxError: EOL while scanning string literal

So the difference between even/odd numbers of backslashes seems to carry over into raw-strings as well.

comment:45 in reply to: ↑ 44 ; follow-up: Changed 12 months ago by gh-mwageringel

  • Status changed from needs_review to positive_review

Replying to gh-jcamp0x2a:

So the difference between even/odd numbers of backslashes seems to carry over into raw-strings as well.

Ok, I see. I was not aware of that.

Let us move forward with this ticket then. Thanks again for all the effort that went into this.

comment:46 in reply to: ↑ 45 Changed 12 months ago by gh-jcamp0x2a

  • Milestone changed from sage-9.3 to sage-9.2

Changing the milestone just in case it is still possible to get the code into 9.2 since I haven't seen a release candidate build for 9.2 yet.

comment:47 Changed 11 months ago by mkoeppe

  • Milestone changed from sage-9.2 to sage-9.3

Since this is a pretty large change, I think it is better to merge it at the beginning of the 9.3 series.

comment:48 in reply to: ↑ 29 Changed 11 months ago by embray

Replying to gh-mwageringel:

Replying to embray:

Replying to gh-mwageringel:

This would still be difficult in case of multiline strings – though, I am not really sure why the preparser needs to run line by line.

It's definitely a mess. No one has taken the time to sit down and write a proper grammar for Sage and a real parser.

Do I understand correctly that such a change would replace the preparser by a lexer and a parser in order to convert Sage code directly to an AST that can be compiled to byte code? Is this a long-term goal?

Yes, exactly. The Sage interpreter should have its own parser based on extending the exiting Python grammar where necessary, plus a node transformer that transforms the AST into one that implements other preparser transformations, such as converting int literals to sage Integers.

I don't think it's formally a "long-term goal" of anyone, but it really should be. It's just a big task that no one's been motivated to take on. But it would make issues like this one, and any other extensions we might want to add to the Sage language, much more feasible. Python 3.8's new PEG parser (which we don't necessarily need to depend on Python 3.8 to use for Sage) might help make this easier: https://www.python.org/dev/peps/pep-0617/

comment:49 Changed 11 months ago by embray

I opened #30760 to provide some description to this task and put it out there for anyone who wants to tackle it. I would work on it myself if I had the time...

comment:50 Changed 11 months ago by vbraun

  • Branch changed from u/gh-jcamp0x2a/28974-f-strings to 3eedd36e37a5b1d1d966c078c2f02a05e5253060
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.