Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use jl_adopt_thread instead of AsyncCondition for interop #39

Merged
merged 9 commits into from
Jul 2, 2024

Conversation

andrebsguedes
Copy link
Member

This PR introduces a number of improvements and fixes to RustyObjectStore:

  • using jl_adopt_thread instead of AsyncCondition for reducing contention in the Julia <-> Rust interop as well as fixing the EOFErrors we saw a couple times
  • New task scheduling on the Rust side that allows us to roughly saturate a 100Gbps NIC
  • Improved retry reporting on error
  • A fix for when we fail to retry due to the client getting evicted from the Rust cache
  • Properly destruct complex objects on the Rust side inside tokio threads

@andrebsguedes andrebsguedes requested a review from Drvi July 1, 2024 13:54
@Drvi Drvi requested a review from kpamnany July 1, 2024 14:31
@Drvi
Copy link
Member

Drvi commented Jul 2, 2024

The CI is red -- we maybe need to bump object_store_ffi_jll?

@andrebsguedes
Copy link
Member Author

The CI is red -- we maybe need to bump object_store_ffi_jll?

Yes, I am just waiting on conclusion of the review here to release object_store_ffi because there could be feedback from here that requires changes in there

Copy link

@kpamnany kpamnany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if pointer_from_objref is the right way to do this. Will check into if there's a more canonical way, but otherwise this looks good.

src/RustyObjectStore.jl Outdated Show resolved Hide resolved
src/RustyObjectStore.jl Outdated Show resolved Hide resolved
src/RustyObjectStore.jl Outdated Show resolved Hide resolved
src/RustyObjectStore.jl Outdated Show resolved Hide resolved
src/RustyObjectStore.jl Outdated Show resolved Hide resolved
@andrebsguedes andrebsguedes merged commit 7c03077 into main Jul 2, 2024
5 checks passed
@NHDaly NHDaly deleted the ag-adopt-thread branch July 3, 2024 14:33
config = into_config(conf)
while true
result = GC.@preserve buffer config response cond begin
preserve_task(ct)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. Why would we need to GC-preserve our own task, if we're also blocking, waiting for the event to finish? Of course the task will be kept alive since the task itself is waiting on the event. ...

... oh, is the issue that the Event itself is only rooted from our own stack, and the task is only rooted by the Event's wait-queue, so it's a cycle and that cycle can get GC'd!? 😅 😅 😅 Wow, that's wild.

I think that this deserves some comments explaining this, and explaining why the preserve is needed.

Copy link
Member

@NHDaly NHDaly Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why couldn't this use Base.preserve_handle(ct)? Is it to reduce contention?

Please clarify this in comments. 🙏

Comment on lines +776 to 797
result = GC.@preserve buffer config response event try
result = @ccall rust_lib.put(
path::Cstring,
buffer::Ref{Cuchar},
size::Culonglong,
config::Ref{Config},
response::Ref{Response},
cond_handle::Ptr{Cvoid}
handle::Ptr{Cvoid}
)::Cint

wait_or_cancel(cond, response)
wait_or_cancel(event, response)

result
finally
unpreserve_task(ct)
end

if result == 2
# backoff
sleep(1.0)
sleep(0.01)
continue
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why do we check result == 2, after waiting? The wait_or_cancel cannot affect the value of result, so if result == 2 means we need to retry, why even wait?

I think this could use a comment, and/or a global constant that gives a meaningful name to whatever 2 is.

# and should thus not be garbage collected.
# This copies the behavior of Base.preserve_handle.
const tasks_in_flight = IdDict{Task, Int64}()
const preserve_task_lock = Threads.SpinLock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comment: if we're still seeing high contention here, it could make sense to shard this into multiple dicts and locks, keyed by a hash of the task's objectid? That would be a simple way to significantly reduce contention on the spinlock. Ideally, we'd be able to register a new IO without any spinning at all. I think even just 10 or 100 shards should be enough to completely eliminate that, yeah?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants