Skip to content

Interface version canonicalization #536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lann
Copy link
Contributor

@lann lann commented Jun 25, 2025

See #534

  • I stuck fullversion in the import/export productions rather than interfacename because I wanted it to be clear that it wouldn't be lowered into the core name.
  • The version canonicalization rules are adapted from Add BuildTargets.md #378. I'm still leaning toward omitting prerelease versions but I've only thought "medium hard" about it.
  • Still needs binary encoding; see comment below.
  • Not sure how best to capture the discussion about making canonicalization mandatory pre-1.0; the "Binary Warts" section doesn't seem quite right.

@lann lann force-pushed the truncated-versions branch 3 times, most recently from 7b6bd7d to 2f8eda8 Compare June 25, 2025 20:46
@lann lann changed the title WIP: Truncated interface versions Interface version canonicalization Jun 25, 2025
@lann
Copy link
Contributor Author

lann commented Jun 25, 2025

For the binary encoding the most straightforward option from a quick review would seem to be adding variants of importname' / exportname' along the lines of:

importname' ::= 0x00 len:<u32> in:<importname>                       => in  (if len = |in|)
              | 0x01 len:<u32> in:<importname> fullverlen:<u16> fullver:<valid semver>

I suppose if we wanted to optimize the binary a bit this extra field could contain just the part of the original version that got lopped off by canonicalization.

On this field width:

fullverlen:<u16>

https://semver.org/#does-semver-have-a-size-limit-on-the-version-string

No, but use good judgment. A 255 character version string is probably overkill, for example. Also, specific systems may impose their own limits on the size of the string.

🤷

@lukewagner
Copy link
Member

@lann Thanks for starting this! For the binary encoding question: yes, taking over the 0x00 byte and using it as a discriminant is a nice coincidence we can take advantage of (and could you update the corresponding bullet in the "Warts" section at the end)?

I suppose if we wanted to optimize the binary a bit this extra field could contain just the part of the original version that got lopped off by canonicalization.

Is there a simplicity argument to be made that requiring the concatenation of the version and the fullversion to match <valid semver> is simpler than allowing the fullversion to be <valid semver> and then adding the additional validation requirement (which I assume we want) that the fullversion has to "match" the version? If so, that could be a second argument in favor in addition to size.

Copy link
Member

@lukewagner lukewagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! A few drive-by comments:

Copy link
Member

@lukewagner lukewagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(oops, meant to "comment" not approve before it's even ready to review 🙃 )

@alexcrichton
Copy link
Collaborator

For the binary encoding, here's another possible encoding:

importname' ::= 0x00 len:<u32> in:<importname>                       => in  (if len = |in|)
              | 0x01 len:<u32> in:<importname>                       => "${in.name}@N"  (if len = |in|,  in.version = N.*)
              | 0x02 len:<u32> in:<importname>                       => "${in.name}@0.N"  (if len = |in|,  in.version = 0.N.*)
              | 0x03 len:<u32> in:<importname>                       => "${in.name}@0.0.N"  (if len = |in|,  in.version = 0.0.N.*)

maybe with affordances for rc/etc unsure. The basic idea though is that the actual import name would always be foo:bar/[email protected] in the binary format but the semantic meaning (e.g. the text format) would be a subslice of such a string. This codifies that in the binary format it's always a valid semver and the discriminant byte says basically how to shorten it. The goal here would be to make the binary format still pretty clear what it can be without changing the meaning of the meaning at a parsed layer.

fullverlen:

https://semver.org/#does-semver-have-a-size-limit-on-the-version-string

No, but use good judgment. A 255 character version string is probably overkill, for example. Also, specific systems may impose their own limits on the size of the string.

For this I'd recommend using <u32> regardless. We already limit many strings far below the theoretical 4G limit with a 32-bit length and keeping <u32> makes it more consistent with the rest of the decoding process. Otherwise when implementing a decoder you'd have to implement a specific function for decoding a 16-bit LEB which is otherwise not required when parsing WebAssembly today. Basically while I agree that >255 characters for a version is silly, I'd say that for consistency with the rest of the binary format this'd want to be <u32> if we go with this variant.

@lukewagner
Copy link
Member

@alexcrichton Good idea; that cleanly answers some of the questions above. My only light concern is that tools might just treat the <importname> as the name and miss the nuance of chopping off parts of the versions. I suppose tests and common low-level tools could catch/factor-out most of this though. But if we go this direction: I suppose technically we don't even need the {0, 1, 2, 3} opcode; it could just be derived from the full <valid semver> string, making version canonicalization a binary encoding detail. Thoughts?

@alexcrichton
Copy link
Collaborator

I agree yeah there's risk since the name in the binary format is "so simple", but yeah that's also where I'd hope that tests could weed things out. It'd be pretty simple in parser libraries I'd imagine to avoid exposing the full name as the import name if the discriminant was present.

My thinking though was that the name always has a full and valid semver, as defined by semver itself. That way the discriminant says what the "real" import name is (e.g. chopping off other stuff) for linking/semantic purposes. Although I may be misunderstanding what you're thinking about how to drop the discriminant?

@lann
Copy link
Contributor Author

lann commented Jun 30, 2025

I think @lukewagner is suggesting that the differences between 1/2/3 can be derived from the string itself. The algo would be something like:

  • starting at @:
  • if the string between @ and the first . isn't 0, trim before the first .
  • if the string between @ and the second . isn't 0.0, trim before the second .
  • otherwise, trim immediately after any digits after the second . (which should only be a - or +)

@alexcrichton
Copy link
Collaborator

Ah I see! So something like (as a transition to the future):

importname' ::= 0x00 len:<u32> in:<importname>  => in  (if len = |in|)
              | 0x01 len:<u32> in:<importname>  => "${in.name}@${in.canonver}"  (if len = |in|)

where in the future we'd drop 0x00 entirely (and possibly rename 0x01 to 0x00). The <importname> is always required to have a full and valid semver too?

@lann
Copy link
Contributor Author

lann commented Jun 30, 2025

I think of the "semver-aware" options I prefer @lukewagner's 1 (extra) discriminant option; if you are parsing semver anyway then the logic is only marginally more complex than the 3 discriminant option.

I'm more ambivalent on whether the parser should be semver-aware. I like the conceptual simplicity of "the name is the name" but this is a binary encoding and if we're going to require validation of semver then we're probably already committing to most of that code complexity anyway.

@lann lann force-pushed the truncated-versions branch from 2f8eda8 to 6d56eaf Compare June 30, 2025 19:27
@lann
Copy link
Contributor Author

lann commented Jul 1, 2025

We discussed this in a meeting today and decided to simplify a bit:

  • In the text format fullversion will change to versionsuffix and hold just the part of the full version that is removed by canonicalization
  • The binary format will use two strings: the canonicalized import name and the versionsuffix

@lann lann force-pushed the truncated-versions branch from 6d56eaf to d3efc82 Compare July 1, 2025 22:49
@lann
Copy link
Contributor Author

lann commented Jul 1, 2025

After spending way too much time staring at SemVer and SemVer accessories I have a new draft of the explainer changes. I ran out of time to edit so hopefully it's still coherent...

I should be able to get to the binary format changes tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants