You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
import fsspec
fileobj = fsspec.open('gs://<insert-your-bucket-here>/test-write-flush', 'w', auto_mkdirs=True)
f = fileobj.fs.open(fileobj.path, mode=fileobj.mode)
f.write('w' * (2**20)) # is guaranteed to be larger than minimal block size
f.flush() # does nothing visible - no file is created at the destination
f.close() # now the file is created and has content
Upon debugging flush call, it seems that the check self.buffer.tell() < self.blocksize is always True, because the way things are implemented, self.buffer.tell() returns 0.
Furthermore, if I call manually what is in fsspecflush implementation after the check, meaning this code:
if self.offset is None:
# Initialize a multipart upload
self.offset = 0
try:
self._initiate_upload()
except: # noqa: E722
self.closed = True
raise
if self._upload_chunk(final=force) is not False:
self.offset += self.buffer.seek(0, 2)
self.buffer = io.BytesIO()
the file is still not created, although the underlying code in _upload_chunk does something.
The text was updated successfully, but these errors were encountered:
There are two ways to write a file ("key") to GCS: a single upload, or a multi-part upload. For the first, it's a one-shot deal, to close and flush are necessarily the same (we do this for small files).
For the latter, an upload container is created with the first flush (if the buffer is big enough - GCS limits how small each write can be!), and subsequent flushes will send more pieces; but on the remote API, the only way to patch the pieces together at the destination is when you are finally done with the file, i.e., the same as close. Sorry, GCS is not a real file system, and we do our best to emulate it, but cannot get around such shortcomings.
After reading a bit more into the limitations, I think I understood them.
I suppose, the only "easy" workaround is to re-upload the entire file every time there's a flush, but it doesn't sound all that practical for bigger files.
The workaround I am likely going to use, given the immutability of the final objects, is, to upload parts of the stream into separate intermediate files if the "flush" is called, and if that happened, once the file is closed, perform the analogue of gsutil compose.
I wonder if this approach could be translated to a more general one.
fsspec
version2022.5.0
gcsfs
version2022.5.0
Code to reproduce:
Upon debugging
flush
call, it seems that the checkself.buffer.tell() < self.blocksize
is alwaysTrue
, because the way things are implemented,self.buffer.tell()
returns 0.Furthermore, if I call manually what is in
fsspec
flush
implementation after the check, meaning this code:the file is still not created, although the underlying code in
_upload_chunk
does something.The text was updated successfully, but these errors were encountered: