Skip to content

Fix parsing of integer literals with base prefix #106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wnienhaus
Copy link
Collaborator

MicroPython 1.25.0 introduced a breaking change, aligning the behaviour of the int() function closer to the behaviour of CPython (something along the lines of: strings are assumed to represent a decimal number, unless a base is specified. if a base of 0 is specified, is the base is inferred from the string)

This broke our parsing logic, which relied on the previous behaviour of the int() function to automatically determine the base of the string literal, based on a base prefix present in the string. Specifying base 0 was not a solution, as this resulted in parsing behaviour different from GNU as.

Additionally, we never actually parsed octal in the format 0100 correctly - even before this PR; that number would have been interpreted as 100 rather than 64.

So, to fix this, and to ensure our parsing matches the GNU assembler, this PR implements a custom parse_int() function, using the base prefix in a string to determine the correct base to pass to int(). The following are supported:

  • 0x -> treated as hex
  • 0b -> treated as binary
  • 0... -> treated as octal
  • 0o -> treated as octal
  • anything else parsed as decimal

The parse_int method also supports the negative prefix operator for all of the above cases.

This change also ensures .int, .long, .word directives correctly handle the above mentioned formats. This fixes the issue described in #104.

Note: GNU as does not actually accept the octal prefix 0o..., but we accept it as a convenience, as this is accepted in Python code. This means however, that our assembler accepts code which GNU as does not accept. But the other way around, we still accept all code that GNU as accepts, which was one of our goals.

@wnienhaus wnienhaus self-assigned this Jun 19, 2025
@wnienhaus wnienhaus requested a review from ThomasWaldmann June 19, 2025 20:25
@wnienhaus wnienhaus removed their assignment Jun 19, 2025
@wnienhaus wnienhaus force-pushed the fix-int-parsing-with-base-prefix branch from 9452423 to 23f8ab4 Compare June 19, 2025 20:42
@wnienhaus
Copy link
Collaborator Author

After merging #107 the tests now pass.

Copy link
Member

@dpgeorge dpgeorge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying base 0 was not a solution, as this resulted in parsing behaviour different from GNU as.

I guess the simplest fix here would be to just replace int(x) with int(x, 0). That should restore the existing behaviour. But it looks like you want to improve things further, which is great!

parts = "".join(parts)
if not validate_expression(parts):
raise ValueError('Unsupported expression: %s' % parts)
return eval(parts)


def parse_int(literal):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with this code base, but would it make sense to factor this function out into a separate file, so it can be reused in opcodes_s2.py?

Similarly, could have a single unit test for this function in a separate testing file.

(Just a suggestion 😄 )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, also from a code (de-)duplication aspect.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed. I tend to like keeping structural changes and logic changes in different commits/PRs, to make each change more clear. So in this case I kept the duplication we already had and therefore added parse_int twice, with a planned followup PR to reduce the duplication.

The existing duplication was there to make the various opcodes* modules have the same interface to the outside world.

But I notice now, that parse_int is not used anywhere else than inside the opcodes* modules, so the function doesn't have to form part of the interface. I can simply have it once in the util module like for example the existing split_tokens function.

I will make that change, to have only one parse_int implementation and therefore only 1 set of unit tests for the function. Thanks for challenging this.

(I might consider a future PR to see if other code can be de-duplicated.)

@mjaspers2mtu
Copy link

Hey @wnienhaus , tested with the ulp programs on my s2, and had no issues 👍

@wnienhaus
Copy link
Collaborator Author

wnienhaus commented Jun 30, 2025

Specifying base 0 was not a solution, as this resulted in parsing behaviour different from GNU as.

I guess the simplest fix here would be to just replace int(x) with int(x, 0). That should restore the existing behaviour. But it looks like you want to improve things further, which is great!

Yes, int(x, 0) would have restored the previous behaviour, but it wasn't the behaviour we needed. Of course it wasn't the behaviour we needed even before, so this PR technically fixes 2 things - adapt to the new MicroPython behaviour and fix parsing behaviour to match GNU as.

That said, since int(x, 0) exists, and because it's really just octal parsing we need extra, I just tried this simpler approach:

def parse_int(literal):
    if len(literal) > 2:
        prefix_start = 1 if literal[0] == '-' else 0  # skip over negative sign if present
        if literal[prefix_start] == '0' and literal[prefix_start+1] in '123456789':
            return int(literal, 8)

    return int(literal, 0)

and all tests still pass.

So it's really just the octal case that's different (and theoretically we should disallow python style octal (0b..), but I had already decided to support it).

Now I am starting the overthink this:

  • what is better for clarity and/or long-term stability? To handle all cases we support explicitly (as I am doing now)? Or to just handle the extra octal case we need?
  • I see very little performance difference, and the shorter code saves perhaps a few bytes of memory, but it's probably not worth quibbling over.

I think I'll keep the current approach, as it's very explicit about what we support (including explicitly supporting python style octal). I'll just remove the comment about legacy octal format, because from the GNU as perspective, it's the currently valid and only possible octal format.

(Happy to get feedback on my chosen approach)

@wnienhaus
Copy link
Collaborator Author

Ok. Fixes pushed. Will squash-merge this once approved.

@wnienhaus wnienhaus force-pushed the fix-int-parsing-with-base-prefix branch from e4a7e33 to f7dfddc Compare June 30, 2025 11:09
remove duplication
@wnienhaus wnienhaus force-pushed the fix-int-parsing-with-base-prefix branch from f7dfddc to 5c84d08 Compare June 30, 2025 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants