-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LocalFileSystem restore _strip_protocol signature #1567
LocalFileSystem restore _strip_protocol signature #1567
Conversation
@fleming79 , this undoes much of your code, but maintains the tests for correctness, while reimplementing consistency for _strip_protocol. Do you have thoughts? |
I have no issue with removing the option 'remove_trailing_slash' if it isn't required. It was provided because there were The question I pose is - which code is more maintainable/performant/correct? |
Thanks for the quick response! Regarding maintainability, I would prefer if the method signatures would agree with Regarding performance, I'm not a big fan of micro-benchmarks anymore, but I ran one (checking 10000 generated URIs) and the implementation in this PR seems to be equally fast/slow (within margin of error) as [click to expand] benchmark for `_strip_protocol` and `_parent`
# pip install asv
import itertools
import random
import string
from fsspec import get_filesystem_class
random.seed(0)
def _make_random_path():
# build a random length path
pth_len = random.randint(5, 40)
return "".join(random.sample(string.ascii_letters + "/", k=pth_len))
def _make_uris(n):
# create uris with and without protocols and with and without netlocs
it_proto_netloc = itertools.cycle(
itertools.product(
["file", "local", "wrong", None],
["netloc", ""]
)
)
for _ in range(n):
proto, netloc = next(it_proto_netloc)
pth = _make_random_path()
if proto and netloc:
yield f"{proto}://{netloc}/{pth}"
elif proto:
yield f"{proto}:/{pth}"
else:
yield f"{netloc}/{pth}"
uris = list(_make_uris(10000))
class Suite:
def setup(self):
self.fs_cls = get_filesystem_class("file")
self.uris = uris
def teardown(self):
del self.uris
def time_split_protocol(self):
for uri in self.uris:
self.fs_cls._strip_protocol(uri)
def time_parent(self):
for uri in self.uris:
self.fs_cls._parent(uri) To be able to answer if this implementation is more or less correct would require more tests. I'll try to find some time to come up with more cases for the test-suite. Cheers, |
I don't think we can trust timing differences <5%; in these cases it can even matter which order you run them (whether the CPU is warm, etc). The interesting thing would be to check on Windows (but they have lots of path styles). |
The original Here is a starting point.
|
Thank you @fleming79 for the proposed implementations. I'll try them out shortly, when I find some more time. I've been collecting more test cases for So far I noticed that between
Outdated. todo: update
on posix py310
on windows py311
|
Performance comparison on windows python==3.11, shows that the I'll update this benchmark once I test the implementation provided in the comment above. outdated benchmark
|
Hello @martindurant and @fleming79 I found some time to continue with this PR, and it would be great to get your feedback. NOTE: the failing tests seem to be Summarizing the changes:
Performance:I ran a benchmark on 10000 randomly generated uri's comparing performance of click to show code for generating the urisimport itertools
import random
import string
from fsspec import get_filesystem_class
random.seed(0)
def _make_random_path(style):
pth_len = random.randint(5, 40)
if style == "posix":
chars = string.ascii_letters + "/"
prefix = ""
elif style == "win":
chars = string.ascii_letters + "\\"
prefix = "c:\\"
elif style == "win-posix":
chars = string.ascii_letters + "/"
prefix = "c:/"
else:
raise ValueError(f"Unknown style {style}")
return prefix + "".join(random.sample(chars, k=pth_len))
def _make_uris(n):
it_proto_netloc_style = itertools.cycle(
itertools.product(
["file", "local", "wrong", None],
["netloc", ""],
["posix", "win", "win-posix"],
)
)
for _ in range(n):
proto, netloc, style = next(it_proto_netloc_style)
pth = _make_random_path(style)
if proto and netloc:
yield f"{proto}://{netloc}/{pth}"
elif proto:
yield f"{proto}:/{pth}"
else:
yield f"{netloc}/{pth}"
uris = list(_make_uris(10000))
class Suite:
def setup(self):
self.fs_cls = get_filesystem_class("file")
self.uris = uris
def teardown(self):
del self.uris
def time_split_protocol(self):
for uri in self.uris:
self.fs_cls._strip_protocol(uri)
def time_parent(self):
for uri in self.uris:
self.fs_cls._parent(uri) Ubuntu under WSL2_strip_protocol
_parent
Windows 11_strip_protocol
_parent
Some open questions:
Have a great day! |
Note: the failing CI runs seem to be caused by a new git release? pypa/setuptools-scm#1038 |
@ap-- It looks like you've made some good changes.
I agree those tests are useful and an error should probably be raised for invalid types somehow. I guess removing the coercion would suffice, but maybe it would be better to also check the return type from the call to |
Do those fail with a different exception type now, or just return garbage pathstrings?
This is for the s3fs build, right? We could pin its requirement on setuptools_scm, wait for a new release, or move that build to hatch (as fsspec has been). |
Current master returns garbage: >>> from fsspec.implementations.local import make_path_posix
>>> make_path_posix(object())
'/Users/poehlmann/development/filesystem_spec/<object object at 0x100664b80>' Possible fixes are:
I think this happens when during the s3fs install fsspec gets installed. The traceback shows it failing in hatchling, which uses the hatch-vcs plugin to get the current version via setuptools_scm. setuptools_scm in turn has a compatibility issue with the newest git version which seems to be used on the Github Actions runner already via the conda test-env environment. I'll see if downgrading git will fix the issue. |
A specific version of git could be pinned in the CI conda environment file |
That made it work |
rerunning ... I have yet to come up with a way to get SMB to reliably pass every time. |
I'm happy with this. @fleming79, any comments? |
Awesome. So we just need a decision regarding
Update: I double checked, and |
|
I'm happy with In |
I made the changes and added tests for
|
Hello,
I noticed that the
LocalFileSystem._strip_protocol
signature changed in #1477, when running the universal_pathlib test-suite against the current fsspec version.To me it seems that the intention of #1477 was initially to prevent
"C:/"
or"C:\\"
to be reduced to"C:"
by_strip_protocol
, but in addition it introduced function signature changes infsspec.implementations.local
and minor changes infsspec.mapping
.I created this PR as a basis for discussion if the signature change could be avoided. In this PR I reverted the changes to
LocalFileSystem
andmake_path_posix
but kept the changes to the tests (first commit) and then provide an alternative implementation that avoids the function signature change.I would also be happy to try to add more tests around the
LocalFileSystem._parent
,LocalFileSystem._strip_protocol
andmake_path_posix
behavior for edge-cases if there is interest. But windows is not my main OS for daily work, so I am very likely not aware of most of them.Cheers,
Andreas