Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage prefixes always #954

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

scober
Copy link
Contributor

@scober scober commented Mar 10, 2022

I propose (via PR) complicating both RDG and PropertyGraph. They can both now optionally be constructed as "ephemeral". An ephemeral graph is backed by a storage location that will be deleted when the in-memory graph goes out of scope. It cannot be committed to that location but it can be written to another location. An existing graph can not be made ephemeral but an ephemeral graph can be made non-ephemeral by writing it to a new location. This functionality is used in three primary places:

  1. By users directly via the katana.local.Graph python class
  2. By analytics functions like the clusterers that create PropertyGraphs on the fly as intermediate state
  3. By tests

Comment on lines +42 to 43
return KATANA_CHECKED(katana::PropertyGraph::MakeEphemeral(
TopologyFromCSR(edge_indices, edge_destinations)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reasonable usage of this python function is to call it and then call write on the result to store it to a specific location. Will that work correctly? It relies on writing the RDG to a different location compared to where it started (in the tmp dir).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a lot about that use case when I was writing this so it should work correctly. But I need to check to see if we are testing it currently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure I test it from Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point me in the right direction to short-circuit my search a little?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pytest.mark.required_env("KATANA_TEST_DATASETS")

Through the power of searching for write then looking at the tests for importing.

Comment on lines +19 to +49
~EphemeralStoragePrefix() {
std::vector<std::string> files;
auto list_future = FileListAsync(prefix_.path(), &files);
if (!list_future.valid()) {
KATANA_LOG_WARN(
"unable to list files, not cleaning up ephemeral storage");
return;
}

auto list_future_res = list_future.get();
if (!list_future_res) {
KATANA_LOG_WARN(
"unable to list files, not cleaning up ephemeral storage: {}",
list_future_res.error());
}

std::unordered_set deletable_files(files.begin(), files.end());
auto delete_res = FileDelete(prefix_.path(), deletable_files);
if (!delete_res) {
KATANA_LOG_WARN(
"unable to delete files, not cleaning up ephemeral storage: {}",
delete_res.error());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized something you may not be aware of. On Linux/Unix, you can create, open, and immediately delete a file. The file disappears to anyone listing the directly, but it actually still exists as long as it is still open. You can use that for temp files, that you don't need to close and reopen. It avoids the need to explicitly clean up the files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract here is that PropertyGraph has a storage prefix that it is free to do whatever it wants with. So I don't think we will be able to use those sorts of temporary files without invasive changes to the storage layer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. We might need to setup a signal handler to do this work then, since destructors don't run if the program crashes. But for now we should just document that our handling of temporary files is not complete and we will leak temp files in a number of cases. One case, BTW, is python interpreter exit with a live graph reference in the global scope. Python finalizers are not guaranteed to be called in that case, so C++ destructors may not be either. This could very easily happen with someone restarting their Jupyter "kernel" (python interpreter). The result could be creating a new set of temp files every time they restart the kernel.

@tylershunt
Copy link
Contributor

I have two thoughts:

  • this could be a lot simpler if we just designated an ephemeral storage prefix and then used some kind of recursive remove to just nuke that when we die (Becomes a check in the commit path to "is this in the ephereral place")
  • I would rather the above be what happens when you call Make without a URI prefix.

@scober
Copy link
Contributor Author

scober commented Mar 10, 2022

* this could be a lot simpler if we just designated an ephemeral storage prefix and then used some kind of recursive remove to just nuke that when we die (Becomes a check in the commit path to "is this in the ephereral place")

That is true. One downside is that in that model we don't nuke the ephemeral place unless we die so we could potentially build up a lot of cruft there.

* I would rather the above be what happens when you call `Make` without a URI prefix.

MakeEphemeral has the same arguments as the current Makes that don't accept a URI. My goal is to remove those Makes and replace their usage with MakeEphemeral because that makes it clear to callers that they are doing something different.

Defines a system-wide policy for choosing a temporary directory.
simon added 4 commits March 11, 2022 14:51
Add methods to URI to check if one URI is a prefix of another.
A utility class that wraps a storage prefix and deletes all files under
that prefix when it is destroyed.
RDG and PropertyGraph both now provide a MakeEphemeral(), which creates
a graph that is backed by an ephemeral storage location and approximates an
in-memory graph.
Some instances of this behavior can be replaced with MakeEphemeral() and
some can be replaced with a call that provides a storage prefix.
@tylershunt
Copy link
Contributor

tylershunt commented Mar 11, 2022

One downside is that in that model we don't nuke the ephemeral place unless we die so we could potentially build up a lot of cruft there.

If that's a concern we could choose random sub directories of that prefix and keep the parts of one property graph, and auto remove that sub-directory when the property graph goes away (leaving the global cleanup on library unload so that we don't have to know about all the live ephemeral objects when we die).

But the check in libtsuba would be just as simple.

@tylershunt
Copy link
Contributor

My goal is to remove those Makes and replace their usage with MakeEphemeral because that makes it clear to callers that they are doing something different.

This part confused me. Callers of Make without any mention of a storage location expected there graphs to be stored somehow?

@scober
Copy link
Contributor Author

scober commented Mar 11, 2022

One downside is that in that model we don't nuke the ephemeral place unless we die so we could potentially build up a lot of cruft there.

If that's a concern we could choose random sub directories of that prefix and keep the parts of one property graph, and auto remove that sub-directory when the property graph goes away (leaving the global cleanup on library unload so that we don't have to know about all the live ephemeral objects when we die).

But the check in libtsuba would be just as simple.

This is actually the current state of this PR (sort of). I don't know if this is an actual concern but since I had already written some code to manage per-graph locations I implemented exactly the hybrid you described. I didn't actually write the signal handler to do clean up but I can tack that onto this PR before I merge.

@scober
Copy link
Contributor Author

scober commented Mar 11, 2022

My goal is to remove those Makes and replace their usage with MakeEphemeral because that makes it clear to callers that they are doing something different.

This part confused me. Callers of Make without any mention of a storage location expected there graphs to be stored somehow?

I don't think it is impossible that we would get this complaint. But I don't feel all that strongly and it is definitely a cleaner interface to have all the Make functions have the same name.

@amberhassaan
Copy link
Contributor

I'm still confused as to why we need Ephemeral graphs? Why isn't an in-memory PropertyGraph not enough? @scober @arthurp

@arthurp
Copy link
Contributor

arthurp commented Mar 15, 2022

I'm still confused as to why we need Ephemeral graphs? Why isn't an in-memory PropertyGraph not enough? @scober @arthurp

Because we need to be able to return ephemeral graphs to the remote API as handles and we cannot guarantee that the workers will still be running when the next request related to the graph comes in. So the ephemeral graphs need to be persistented for the client session or something like that.

@scober
Copy link
Contributor Author

scober commented Mar 15, 2022

I'm still confused as to why we need Ephemeral graphs? Why isn't an in-memory PropertyGraph not enough? @scober @arthurp

There is no way to keep a PropertyGraph object from trying to write to storage. That is true in principle now but it is becoming more and more true in practice as well (see MemorySupervisor for an example of this).

So the proper (but maybe a little snotty) answer to your question is that there is no such thing as an in-memory PropertyGraph. We can hide the existence of ephemeral graphs from users more than I am in the current implementation by not using the word "Ephemeral" in the Make function name, but the current design is that every PropertyGraph has an associated writable storage location.

@amberhassaan
Copy link
Contributor

So is the idea with MemorySupervisor that we can unload a graph to storage to reduce our in-memory footprint?

Copy link
Contributor

@tylershunt tylershunt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to drop Ephemeral from the name of the factory function and just promote the Make variant without a path to be what (I think) everyone expected it was anyway.


/// Make a property graph from topology and associate it with an ephemeral
/// storage prefix. This is approximates an in-memory graph.
static Result<std::unique_ptr<PropertyGraph>> MakeEphemeral(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support Tyler's suggestion of keeping the name Make

Copy link
Contributor

@amberhassaan amberhassaan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good from my side.

@scober
Copy link
Contributor Author

scober commented Mar 17, 2022

So is the idea with MemorySupervisor that we can unload a graph to storage to reduce our in-memory footprint?

Yes. In particular MemorySupervisor expects to be able to write certain large objects (I think just property arrays for now) to storage. So it may never persist the whole graph but it will want to write files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants