Use `jl_adopt_thread` instead of AsyncCondition for interop #39

andrebsguedes · 2024-07-01T13:54:26Z

This PR introduces a number of improvements and fixes to RustyObjectStore:

using jl_adopt_thread instead of AsyncCondition for reducing contention in the Julia <-> Rust interop as well as fixing the EOFErrors we saw a couple times
New task scheduling on the Rust side that allows us to roughly saturate a 100Gbps NIC
Improved retry reporting on error
A fix for when we fail to retry due to the client getting evicted from the Rust cache
Properly destruct complex objects on the Rust side inside tokio threads

…object_store upgrade

Drvi · 2024-07-02T13:39:03Z

The CI is red -- we maybe need to bump object_store_ffi_jll?

andrebsguedes · 2024-07-02T13:46:45Z

The CI is red -- we maybe need to bump object_store_ffi_jll?

Yes, I am just waiting on conclusion of the review here to release object_store_ffi because there could be feedback from here that requires changes in there

kpamnany

I'm wondering if pointer_from_objref is the right way to do this. Will check into if there's a more canonical way, but otherwise this looks good.

src/RustyObjectStore.jl

NHDaly · 2024-07-03T14:43:20Z

src/RustyObjectStore.jl

    config = into_config(conf)
    while true
-        result = GC.@preserve buffer config response cond begin
+        preserve_task(ct)


I don't understand this. Why would we need to GC-preserve our own task, if we're also blocking, waiting for the event to finish? Of course the task will be kept alive since the task itself is waiting on the event. ...

... oh, is the issue that the Event itself is only rooted from our own stack, and the task is only rooted by the Event's wait-queue, so it's a cycle and that cycle can get GC'd!? 😅 😅 😅 Wow, that's wild.

I think that this deserves some comments explaining this, and explaining why the preserve is needed.

Also, why couldn't this use Base.preserve_handle(ct)? Is it to reduce contention?

Please clarify this in comments. 🙏

NHDaly · 2024-07-03T14:46:17Z

src/RustyObjectStore.jl

+        result = GC.@preserve buffer config response event try
            result = @ccall rust_lib.put(
                path::Cstring,
                buffer::Ref{Cuchar},
                size::Culonglong,
                config::Ref{Config},
                response::Ref{Response},
-                cond_handle::Ptr{Cvoid}
+                handle::Ptr{Cvoid}
            )::Cint

-            wait_or_cancel(cond, response)
+            wait_or_cancel(event, response)

            result
+        finally
+            unpreserve_task(ct)
        end

        if result == 2
            # backoff
-            sleep(1.0)
+            sleep(0.01)
            continue
        end


Question: Why do we check result == 2, after waiting? The wait_or_cancel cannot affect the value of result, so if result == 2 means we need to retry, why even wait?

I think this could use a comment, and/or a global constant that gives a meaningful name to whatever 2 is.

NHDaly · 2024-07-03T15:07:15Z

src/RustyObjectStore.jl

+# and should thus not be garbage collected.
+# This copies the behavior of Base.preserve_handle.
+const tasks_in_flight = IdDict{Task, Int64}()
+const preserve_task_lock = Threads.SpinLock()


Last comment: if we're still seeing high contention here, it could make sense to shard this into multiple dicts and locks, keyed by a hash of the task's objectid? That would be a simple way to significantly reduce contention on the spinlock. Ideally, we'd be able to register a new IO without any spinning at all. I think even just 10 or 100 shards should be enough to completely eliminate that, yeah?

andrebsguedes added 5 commits June 28, 2024 14:09

Adds retry tests for resets and timeouts while streaming the body

4b3bf45

Use jl_adopt_thread instead of AsyncCondition and changes related to …

3d75cb5

…object_store upgrade

Make dns erros connection errors

7af991a

Use two different lines to improve debugging

cec8fe5

Try using preserve handle

4843988

andrebsguedes requested a review from Drvi July 1, 2024 13:54

Drvi requested a review from kpamnany July 1, 2024 14:31

andrebsguedes added 2 commits July 1, 2024 21:43

Use Base.Event notify instead of direct task schedule

d6821fe

Clarifying comments

58ec557

kpamnany approved these changes Jul 2, 2024

View reviewed changes

andrebsguedes added 2 commits July 2, 2024 12:36

PR feedback

9311ca6

Version bump to 0.7.0 (also object_store_ffi)

b6e132e

andrebsguedes merged commit 7c03077 into main Jul 2, 2024
5 checks passed

NHDaly deleted the ag-adopt-thread branch July 3, 2024 14:33

NHDaly reviewed Jul 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `jl_adopt_thread` instead of AsyncCondition for interop #39

Use `jl_adopt_thread` instead of AsyncCondition for interop #39

andrebsguedes commented Jul 1, 2024

Drvi commented Jul 2, 2024

andrebsguedes commented Jul 2, 2024

kpamnany left a comment

NHDaly Jul 3, 2024

NHDaly Jul 3, 2024 •

edited

Loading

NHDaly Jul 3, 2024

NHDaly Jul 3, 2024

Use jl_adopt_thread instead of AsyncCondition for interop #39

Use jl_adopt_thread instead of AsyncCondition for interop #39

Conversation

andrebsguedes commented Jul 1, 2024

Drvi commented Jul 2, 2024

andrebsguedes commented Jul 2, 2024

kpamnany left a comment

Choose a reason for hiding this comment

NHDaly Jul 3, 2024

Choose a reason for hiding this comment

NHDaly Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

NHDaly Jul 3, 2024

Choose a reason for hiding this comment

NHDaly Jul 3, 2024

Choose a reason for hiding this comment

Use `jl_adopt_thread` instead of AsyncCondition for interop #39

Use `jl_adopt_thread` instead of AsyncCondition for interop #39

NHDaly Jul 3, 2024 •

edited

Loading