Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't load glTF URIs with unicode characters #973

Open
lilleyse opened this issue Oct 31, 2024 · 4 comments
Open

Can't load glTF URIs with unicode characters #973

lilleyse opened this issue Oct 31, 2024 · 4 comments
Labels
bug Something isn't working quality Improve code quality or encourage developer success

Comments

@lilleyse
Copy link
Contributor

According to the glTF URIs section:

Paths with non-ASCII characters MAY be written as-is, with JSON string escaping, or with percent-encoding; all these options are valid. For example, the following three paths point to the same resource:

{
   "images": [
       {
          "uri": "grande_sphère.png"
       },
       {
           "uri": "grande_sph\u00E8re.png"
       },
       {
           "uri": "grande_sph%C3%A8re.png"
       }
   ]
}

But when I try to use GltfReader to load a glTF with a URI to 🐶.bin I get a runtime error:

std::runtime_error caught: GET `🐶.bin` failed: Couldn't resolve host name

🐶.gltf.zip

This may be a limitation in uriparser which only parses RFC 3986 URIs which are ASCII-based. It does not handle RFC 3987 IRIs which allow Unicode characters verbatim.

@kring
Copy link
Member

kring commented Nov 4, 2024

Your explanation makes sense to me @lilleyse. I wonder if there's a better library for our purposes than uriparser. I couldn't find one when I originally selected it for cesium-native, but that was over four years ago now.

@j9liu j9liu added bug Something isn't working quality Improve code quality or encourage developer success labels Nov 8, 2024
@azrogers
Copy link
Contributor

I've investigated a few alternatives to uriparser that each come with their own drawbacks:

  • skyr-url is a more modern C++ URL parser as compared to uriparser's C API. It implements the WhatWG URL specification, which covers both RFC 3986 and RFC 3987 URLs. However, skyr-url seems to be unmaintained (last commit December 2023, compared to yesterday for uriparser). It also fails for URLs that uriparser can parse. For example, one of the test cases includes the URLs "test" and "buffer". skyr-url, perhaps correctly, believes these URLs to be invalid. We could probably work around this somehow, but that's a pain.
  • uriparser has wchar_t equivalents to all of its methods - uriParseSingleUriW instead of uriParseSingleUriA, etc. But not only would we need to convert to and from UTF-16 in every method call, copying the string multiple times (and how do we convert when the string/wstring conversion classes in C++ are deprecated?) - but I'm not even sure it would support RFC 3987 if we used the wide equivalents.
  • ada seems to be the current state-of-the-art in URL parsing. It supports the WhatWG specification and unicode, and is the URL parser used by Node.js among others. However, it seems to do things just slightly differently. Take a look at some of the failed test cases in the branch I pushed where I was trying it out. This is after a number of workarounds I implemented to try to get the tests passing. Some of these will probably not cause much of an issue - the difference between https://example.com/?some=parameter and https://example.com?some=parameter is probably meaningless to most web servers - but some of them, like not escaping colons in the pathname, could be a big problem.

Also, perhaps the biggest issue, neither alternative to uriparser has equivalents to uriUriStringToWindowsFilenameA, uriUnixFilenameToUriStringA, etc. These methods would have to be reproduced ourselves. They're not too complicated - about fifty or sixty lines of C - but it's yet another headache to deal with that makes switching just a bit more tricky.

Instead of an alternative parser, I'm going to try to see if I can work around the lack of unicode support in uriparser. While 🐶.bin might not parse, %f0%9f%90%b6.bin probably will.

@kring
Copy link
Member

kring commented Jan 16, 2025

Thanks for investigating @azrogers. ada looks really promising, and may be worth the switch even if it causes some pain. There is a lot of dodgy stuff in our Uri.cpp due to limitations in uriparser or just general incompleteness. For example, take a look at the TODO and terrible implementation of addQuery, even though we use it all over the place. I'm inclined to think that test failures caused by switching are much more likely to be bad tests than problems with ada.

The lack of the filename conversion functions would indeed be a hassle, but I suspect they're actually quite simple when based on a URL library with a less old-school interface.

Considering how fundamental URLs are to what we do, I think this is well worth another day or three of effort. Do you think making the switch to ada is possible in that kind of time frame?

@azrogers
Copy link
Contributor

It should be, as long as we're ok with changing around some of the tests where necessary. I can put some more time into it today once I finish catching up on reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working quality Improve code quality or encourage developer success
Projects
None yet
Development

No branches or pull requests

4 participants