Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow users hook early on request to Support Unicode and sanitization uses #765

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

eanavitarte
Copy link

❓ Type of change

  • 📖 Documentation (updates to the documentation, readme, or JSdoc annotations)
  • 🐞 Bug fix (a non-breaking change that fixes an issue)
  • 👌 Enhancement (improving an existing functionality like performance)
  • ✨ New feature (a non-breaking change that adds functionality)
  • 🧹 Chore (updates to the build process or auxiliary tools and libraries)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)

The Problem:

Sometimes, when dealing with non-ASCII characters, for example, for people who speak Spanish, French, Portuguese, Farsi, German, Polish, Russian, Chinese or Hindi, if they want to get a site that uses directly their language in urls, they'll will face this problem: URLs containing non-ASCII characters (e.g., accents, diacritics) can be encoded in uppercase or lowercase.

I mean, the characters can be uri-encoded as lowercase or uppercase, because technically they must be decoded again as case insensitive (See RFC 3986, Section 6.2.2.1).

So a word like: encyclopédie, can be:

  • Lowercase: encyclop%c3%a9die
  • Uppercase: encyclop%C3%A9die

And what's the problem with that? Matches! Because although 'encyclopédie' === 'encyclopédie', is not the same for 'encyclop%c3%a9die' !== 'encyclop%C3%A9die'.

'encyclopédie' === 'encyclopédie' // true
'encyclop%c3%a9die' !== 'encyclop%C3%A9die' // false

As so, if your customers want proper localization in urls, you will have a lot of problems, because on the one hand you'll get Chrome requests that properly implement the uppercased version. But, on the other hand you'll see that NodeJS threat those urls as lowercase.

This mismatch is headache, because the only way for fixing this is creating custom rules on front servers to handle this issue. Or... lose incoming request that humans intent to match the 'encyclopédie' word, but because of the different implementation for some Internet actors, they wont get. When the client do not explicitly encode their request (which is each time more in more common for apis for example) they just get a 404 or a 500 (depends on how'll handle mismatched requests).

Why do that happens?

Well first, Internet was mainly made for English speaker consumers. And some standards will take time to adapt. A lot of frameworks already deal with this situation properly, for example if you work with Golang/Caddy you'll see that they uses Unicode.

But NodeJS dont.

How to solve it?

Allowing users to customize this behaviour as soon as possible, using a really early hook for sanitizing the incoming nodejs request.

This will allow each user, to hook there, and fills their needs.

For example:

  • Making a double encoding for normalizing the incoming requests in lowercase (from node default) with their configuration in uppercase (which must be the standard, see RFC 3986, Section 2.1).
  • Or... completely decoding the uri for working, in their project, with a full UNICODE support for both, their configuration files and the incoming requests.

This wont hurts nobody, but will allow users to safety choose how to face this problem.

The Solution

As you can see in the pull request, I just add a new hook called 'onSanitizeRequest', and each user will be able to use if they want, allowing users to customize the url behaviour before any url logic is set for the app, but without moving (for safety reasons) the current onRequest hook.

// Call onSanitizeRequest hook
if (options.onSanitizeRequest) {
    await options.onSanitizeRequest(event);
}

So, the user is the one who choose how to sanitize the incoming requests. For example passing a decoder encoder:

createAppEventHandler(stack, {
    onSanitizeRequest: (event) => {
        event.node.req.url = decodeURI(event.node.req.url).replace(/[^/]+/g, encodeURIComponent); // double encode decode for normalizing
  },
  // ... other options
});

This way for example, all any other posterior requests will works just as if the incoming one request was properly encoded. No problem for anyone!

This will break something?

No! anything, because it won't change any currently set configuration (that's why I'm not proposing to just place up the already existing 'onRequest' hook), because that hook could contain unexpected hooked things and (we don't know), brake something for a project. This will allow a new (and so no changing) option for hooking in there.

📝 Checklist

  • I have linked an issue or discussion.
  • I have updated the documentation accordingly.

Created an onSanitizeRequest hook, for allowing users to customize the url behaviour before onSanitizeRequest is set, but without moving (for safety reasons) the current onRequest hook.
@pi0
Copy link
Member

pi0 commented Jun 19, 2024

Thanks for nice explanations. While it all makes sense, i am thinking more about a more generic hook name. For now marking PR as draft.

@pi0 pi0 marked this pull request as draft June 19, 2024 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants