Fix parsing of integer literals with base prefix #106

wnienhaus · 2025-06-19T20:24:57Z

MicroPython 1.25.0 introduced a breaking change, aligning the behaviour of the int() function closer to the behaviour of CPython (something along the lines of: strings are assumed to represent a decimal number, unless a base is specified. if a base of 0 is specified, is the base is inferred from the string)

This broke our parsing logic, which relied on the previous behaviour of the int() function to automatically determine the base of the string literal, based on a base prefix present in the string. Specifying base 0 was not a solution, as this resulted in parsing behaviour different from GNU as.

Additionally, we never actually parsed octal in the format 0100 correctly - even before this PR; that number would have been interpreted as 100 rather than 64.

So, to fix this, and to ensure our parsing matches the GNU assembler, this PR implements a custom parse_int() function, using the base prefix in a string to determine the correct base to pass to int(). The following are supported:

0x -> treated as hex
0b -> treated as binary
0... -> treated as octal
0o -> treated as octal
anything else parsed as decimal

The parse_int method also supports the negative prefix operator for all of the above cases.

This change also ensures .int, .long, .word directives correctly handle the above mentioned formats. This fixes the issue described in #104.

Note: GNU as does not actually accept the octal prefix 0o..., but we accept it as a convenience, as this is accepted in Python code. This means however, that our assembler accepts code which GNU as does not accept. But the other way around, we still accept all code that GNU as accepts, which was one of our goals.

wnienhaus · 2025-06-19T20:45:10Z

After merging #107 the tests now pass.

dpgeorge

Specifying base 0 was not a solution, as this resulted in parsing behaviour different from GNU as.

I guess the simplest fix here would be to just replace int(x) with int(x, 0). That should restore the existing behaviour. But it looks like you want to improve things further, which is great!

dpgeorge · 2025-06-20T01:57:38Z

esp32_ulp/opcodes.py

    parts = "".join(parts)
    if not validate_expression(parts):
        raise ValueError('Unsupported expression: %s' % parts)
    return eval(parts)


+def parse_int(literal):


I'm not familiar with this code base, but would it make sense to factor this function out into a separate file, so it can be reused in opcodes_s2.py?

Similarly, could have a single unit test for this function in a separate testing file.

(Just a suggestion 😄 )

Sounds good, also from a code (de-)duplication aspect.

Yes indeed. I tend to like keeping structural changes and logic changes in different commits/PRs, to make each change more clear. So in this case I kept the duplication we already had and therefore added parse_int twice, with a planned followup PR to reduce the duplication.

The existing duplication was there to make the various opcodes* modules have the same interface to the outside world.

But I notice now, that parse_int is not used anywhere else than inside the opcodes* modules, so the function doesn't have to form part of the interface. I can simply have it once in the util module like for example the existing split_tokens function.

I will make that change, to have only one parse_int implementation and therefore only 1 set of unit tests for the function. Thanks for challenging this.

(I might consider a future PR to see if other code can be de-duplicated.)

mjaspers2mtu · 2025-06-22T12:36:10Z

Hey @wnienhaus , tested with the ulp programs on my s2, and had no issues 👍

wnienhaus · 2025-06-30T10:24:17Z

Specifying base 0 was not a solution, as this resulted in parsing behaviour different from GNU as.

I guess the simplest fix here would be to just replace int(x) with int(x, 0). That should restore the existing behaviour. But it looks like you want to improve things further, which is great!

Yes, int(x, 0) would have restored the previous behaviour, but it wasn't the behaviour we needed. Of course it wasn't the behaviour we needed even before, so this PR technically fixes 2 things - adapt to the new MicroPython behaviour and fix parsing behaviour to match GNU as.

That said, since int(x, 0) exists, and because it's really just octal parsing we need extra, I just tried this simpler approach:

def parse_int(literal):
    if len(literal) > 2:
        prefix_start = 1 if literal[0] == '-' else 0  # skip over negative sign if present
        if literal[prefix_start] == '0' and literal[prefix_start+1] in '123456789':
            return int(literal, 8)

    return int(literal, 0)

and all tests still pass.

So it's really just the octal case that's different (and theoretically we should disallow python style octal (0b..), but I had already decided to support it).

Now I am starting the overthink this:

what is better for clarity and/or long-term stability? To handle all cases we support explicitly (as I am doing now)? Or to just handle the extra octal case we need?
I see very little performance difference, and the shorter code saves perhaps a few bytes of memory, but it's probably not worth quibbling over.

I think I'll keep the current approach, as it's very explicit about what we support (including explicitly supporting python style octal). I'll just remove the comment about legacy octal format, because from the GNU as perspective, it's the currently valid and only possible octal format.

(Happy to get feedback on my chosen approach)

wnienhaus · 2025-06-30T10:46:48Z

Ok. Fixes pushed. Will squash-merge this once approved.

remove duplication

wnienhaus self-assigned this Jun 19, 2025

wnienhaus requested a review from ThomasWaldmann June 19, 2025 20:25

wnienhaus removed their assignment Jun 19, 2025

wnienhaus mentioned this pull request Jun 19, 2025

Update builder image to ubuntu-22.04 #107

Merged

fix parsing of integer literals with base prefix

23f8ab4

wnienhaus force-pushed the fix-int-parsing-with-base-prefix branch from 9452423 to 23f8ab4 Compare June 19, 2025 20:42

dpgeorge reviewed Jun 20, 2025

View reviewed changes

mjaspers2mtu mentioned this pull request Jun 22, 2025

improve argument evaluation #104

Closed

wnienhaus force-pushed the fix-int-parsing-with-base-prefix branch from e4a7e33 to f7dfddc Compare June 30, 2025 11:09

move parse_int to util module

5c84d08

remove duplication

wnienhaus force-pushed the fix-int-parsing-with-base-prefix branch from f7dfddc to 5c84d08 Compare June 30, 2025 11:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix parsing of integer literals with base prefix #106

Fix parsing of integer literals with base prefix #106

Uh oh!

wnienhaus commented Jun 19, 2025

Uh oh!

wnienhaus commented Jun 19, 2025

Uh oh!

dpgeorge left a comment

Uh oh!

dpgeorge Jun 20, 2025

Uh oh!

ThomasWaldmann Jun 20, 2025

Uh oh!

wnienhaus Jun 30, 2025

Uh oh!

mjaspers2mtu commented Jun 22, 2025

Uh oh!

wnienhaus commented Jun 30, 2025 •

edited

Loading

Uh oh!

wnienhaus commented Jun 30, 2025

Uh oh!

Uh oh!

Fix parsing of integer literals with base prefix #106

Are you sure you want to change the base?

Fix parsing of integer literals with base prefix #106

Uh oh!

Conversation

wnienhaus commented Jun 19, 2025

Uh oh!

wnienhaus commented Jun 19, 2025

Uh oh!

dpgeorge left a comment

Choose a reason for hiding this comment

Uh oh!

dpgeorge Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

ThomasWaldmann Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

wnienhaus Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

mjaspers2mtu commented Jun 22, 2025

Uh oh!

wnienhaus commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wnienhaus commented Jun 30, 2025

Uh oh!

Uh oh!

wnienhaus commented Jun 30, 2025 •

edited

Loading