Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a regression test for url_search_params::sort taken from WPT #861

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

npaun
Copy link

@npaun npaun commented Jan 30, 2025

While working on integrating Web Platform Tests as a test suite for Cloudflare's worked, I discovered a situation where ada-url's implementation of url_search_params::sort appears not to be following the URL spec.

The URL spec provides the following requirements for URLSearchParams.sort:

The sort() method steps are:

  1. Sort all tuples in this’s list, if any, by their names. Sorting must be done by comparison of code units. The relative order between tuples with equal names must be preserved.

https://url.spec.whatwg.org/#example-searchparams-sort

WPT's urlsearchparams-sort.any.js test uses the following example to demonstrate this:

{
"input": "ffi&🌈", // 🌈 > code point, but < code unit because two code units
"output": [["🌈", ""], ["ffi", ""]]
},
https://github.com/web-platform-tests/wpt/blob/master/url/urlsearchparams-sort.any.js#L11

So the spec requires "comparsion of code units" but as far as I can tell, it isn't very clear about what encoding should be used to perform the comparsion. Based on the WPT test, and checking other implementations like Node and Chromium, the encoding used is UTF-16. This makes sense, as this matches the behaviour of Javascript's String type.

I can fix this by implementing something like ada::idna::utf8_to_utf32 to do a conversion from utf8 to utf16 before sorting the keys, but please let me know if you have an alternative approach in mind - I'd be happy to implement it that way instead.

@anonrig anonrig requested a review from lemire January 30, 2025 22:29
@npaun npaun force-pushed the npaun/urlsearchparams-sort-wpt-bug branch from aa8bdf3 to fe2efd1 Compare January 30, 2025 22:41
@lemire
Copy link
Member

lemire commented Jan 30, 2025

@npaun Sorting in UTF-32 and UTF-8 should be equivalent.

Regarding UTF-16, it will differ from UTF-32/UTF-8 but only when surrogate pairs are involved (I believe this is correct).

I asked my favorite LLM to help me and it suggests something like this...

int compareUtf8AsUtf16(const std::string& str1, const std::string& str2) {
    size_t i = 0;
    size_t j = 0;

    while (i < str1.length() && j < str2.length()) {
        uint32_t codePoint1 = 0;
        uint32_t codePoint2 = 0;

        // Decode UTF-8 code point from str1
        unsigned char c1 = static_cast<unsigned char>(str1[i]);
        if (c1 <= 0x7F) {
            codePoint1 = c1;
            i++;
        } else if (c1 <= 0xDF) {
            codePoint1 = ((c1 & 0x1F) << 6) | (static_cast<unsigned char>(str1[i + 1]) & 0x3F);
            i += 2;
        } else if (c1 <= 0xEF) {
            codePoint1 = ((c1 & 0x0F) << 12) | ((static_cast<unsigned char>(str1[i + 1]) & 0x3F) << 6) | (static_cast<unsigned char>(str1[i + 2]) & 0x3F);
            i += 3;
        } else  {
            codePoint1 = ((c1 & 0x07) << 18) | ((static_cast<unsigned char>(str1[i + 1]) & 0x3F) << 12) | ((static_cast<unsigned char>(str1[i + 2]) & 0x3F) << 6) | (static_cast<unsigned char>(str1[i + 3]) & 0x3F);
            i += 4;
        } 


        // Decode UTF-8 code point from str2 (same logic as above)
        unsigned char c2 = static_cast<unsigned char>(str2[j]);
        if (c2 <= 0x7F) {
            codePoint2 = c2;
            j++;
        } else if (c2 <= 0xDF) {
            codePoint2 = ((c2 & 0x1F) << 6) | (static_cast<unsigned char>(str2[j + 1]) & 0x3F);
            j += 2;
        } else if (c2 <= 0xEF) {
            codePoint2 = ((c2 & 0x0F) << 12) | ((static_cast<unsigned char>(str2[j + 1]) & 0x3F) << 6) | (static_cast<unsigned char>(str2[j + 2]) & 0x3F);
            j += 3;
        } else {
            codePoint2 = ((c2 & 0x07) << 18) | ((static_cast<unsigned char>(str2[j + 1]) & 0x3F) << 12) | ((static_cast<unsigned char>(str2[j + 2]) & 0x3F) << 6) | (static_cast<unsigned char>(str2[j + 3]) & 0x3F);
            j += 4;
        } 

        // Handle surrogate pairs (emulate UTF-16 comparison without conversion)
        if (codePoint1 >= 0x10000) {
            codePoint1 -= 0x10000;
            if (codePoint2 < 0x10000) return 1; // str1 is "greater" (surrogate pair)
        } else if (codePoint2 >= 0x10000) {
            codePoint2 -= 0x10000;
            return -1; // str2 is "greater" (surrogate pair)
        }


        if (codePoint1 != codePoint2) {
            return (codePoint1 < codePoint2) ? -1 : 1;
        }
    }

    if (i < str1.length()) return 1; // str1 is longer
    if (j < str2.length()) return -1; // str2 is longer

    return 0; // Strings are equal
}

It is assuredly not correct and needs review, but it should be doable after a few fixes.

I can probably get it done in a few minutes.

Importantly, we can avoid conversion and memory allocation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants