Storage prefixes always #954

scober · 2022-03-10T20:42:41Z

I propose (via PR) complicating both RDG and PropertyGraph. They can both now optionally be constructed as "ephemeral". An ephemeral graph is backed by a storage location that will be deleted when the in-memory graph goes out of scope. It cannot be committed to that location but it can be written to another location. An existing graph can not be made ephemeral but an ephemeral graph can be made non-ephemeral by writing it to a new location. This functionality is used in three primary places:

By users directly via the katana.local.Graph python class
By analytics functions like the clusterers that create PropertyGraphs on the fly as intermediate state
By tests

arthurp · 2022-03-10T21:25:05Z

libkatana_python_native/src/ImportData.cpp

+        return KATANA_CHECKED(katana::PropertyGraph::MakeEphemeral(
            TopologyFromCSR(edge_indices, edge_destinations)));


A reasonable usage of this python function is to call it and then call write on the result to store it to a specific location. Will that work correctly? It relies on writing the RDG to a different location compared to where it started (in the tmp dir).

I thought a lot about that use case when I was writing this so it should work correctly. But I need to check to see if we are testing it currently.

I'm pretty sure I test it from Python.

Could you point me in the right direction to short-circuit my search a little?

katana/python/test/test_convert_graph.py

Line 227 in 95ad4cf

@pytest.mark.required_env("KATANA_TEST_DATASETS")

Through the power of searching for write then looking at the tests for importing.

arthurp · 2022-03-10T21:28:38Z

libtsuba/include/katana/EphemeralStoragePrefix.h

+  ~EphemeralStoragePrefix() {
+    std::vector<std::string> files;
+    auto list_future = FileListAsync(prefix_.path(), &files);
+    if (!list_future.valid()) {
+      KATANA_LOG_WARN(
+          "unable to list files, not cleaning up ephemeral storage");
+      return;
+    }
+
+    auto list_future_res = list_future.get();
+    if (!list_future_res) {
+      KATANA_LOG_WARN(
+          "unable to list files, not cleaning up ephemeral storage: {}",
+          list_future_res.error());
+    }
+
+    std::unordered_set deletable_files(files.begin(), files.end());
+    auto delete_res = FileDelete(prefix_.path(), deletable_files);
+    if (!delete_res) {
+      KATANA_LOG_WARN(
+          "unable to delete files, not cleaning up ephemeral storage: {}",
+          delete_res.error());
+    }
+  }


I just realized something you may not be aware of. On Linux/Unix, you can create, open, and immediately delete a file. The file disappears to anyone listing the directly, but it actually still exists as long as it is still open. You can use that for temp files, that you don't need to close and reopen. It avoids the need to explicitly clean up the files.

The contract here is that PropertyGraph has a storage prefix that it is free to do whatever it wants with. So I don't think we will be able to use those sorts of temporary files without invasive changes to the storage layer.

OK. We might need to setup a signal handler to do this work then, since destructors don't run if the program crashes. But for now we should just document that our handling of temporary files is not complete and we will leak temp files in a number of cases. One case, BTW, is python interpreter exit with a live graph reference in the global scope. Python finalizers are not guaranteed to be called in that case, so C++ destructors may not be either. This could very easily happen with someone restarting their Jupyter "kernel" (python interpreter). The result could be creating a new set of temp files every time they restart the kernel.

tylershunt · 2022-03-10T23:18:54Z

I have two thoughts:

this could be a lot simpler if we just designated an ephemeral storage prefix and then used some kind of recursive remove to just nuke that when we die (Becomes a check in the commit path to "is this in the ephereral place")
I would rather the above be what happens when you call Make without a URI prefix.

scober · 2022-03-10T23:26:51Z

* this could be a lot simpler if we just designated an ephemeral storage prefix and then used some kind of recursive remove to just nuke that when we die (Becomes a check in the commit path to "is this in the ephereral place")

That is true. One downside is that in that model we don't nuke the ephemeral place unless we die so we could potentially build up a lot of cruft there.

* I would rather the above be what happens when you call `Make` without a URI prefix.

MakeEphemeral has the same arguments as the current Makes that don't accept a URI. My goal is to remove those Makes and replace their usage with MakeEphemeral because that makes it clear to callers that they are doing something different.

Defines a system-wide policy for choosing a temporary directory.

Add methods to URI to check if one URI is a prefix of another.

A utility class that wraps a storage prefix and deletes all files under that prefix when it is destroyed.

RDG and PropertyGraph both now provide a MakeEphemeral(), which creates a graph that is backed by an ephemeral storage location and approximates an in-memory graph.

Some instances of this behavior can be replaced with MakeEphemeral() and some can be replaced with a call that provides a storage prefix.

tylershunt · 2022-03-11T22:05:31Z

One downside is that in that model we don't nuke the ephemeral place unless we die so we could potentially build up a lot of cruft there.

If that's a concern we could choose random sub directories of that prefix and keep the parts of one property graph, and auto remove that sub-directory when the property graph goes away (leaving the global cleanup on library unload so that we don't have to know about all the live ephemeral objects when we die).

But the check in libtsuba would be just as simple.

tylershunt · 2022-03-11T22:07:20Z

My goal is to remove those Makes and replace their usage with MakeEphemeral because that makes it clear to callers that they are doing something different.

This part confused me. Callers of Make without any mention of a storage location expected there graphs to be stored somehow?

scober · 2022-03-11T22:11:04Z

One downside is that in that model we don't nuke the ephemeral place unless we die so we could potentially build up a lot of cruft there.

If that's a concern we could choose random sub directories of that prefix and keep the parts of one property graph, and auto remove that sub-directory when the property graph goes away (leaving the global cleanup on library unload so that we don't have to know about all the live ephemeral objects when we die).

But the check in libtsuba would be just as simple.

This is actually the current state of this PR (sort of). I don't know if this is an actual concern but since I had already written some code to manage per-graph locations I implemented exactly the hybrid you described. I didn't actually write the signal handler to do clean up but I can tack that onto this PR before I merge.

scober · 2022-03-11T22:14:51Z

My goal is to remove those Makes and replace their usage with MakeEphemeral because that makes it clear to callers that they are doing something different.

This part confused me. Callers of Make without any mention of a storage location expected there graphs to be stored somehow?

I don't think it is impossible that we would get this complaint. But I don't feel all that strongly and it is definitely a cleaner interface to have all the Make functions have the same name.

amberhassaan · 2022-03-15T16:44:52Z

I'm still confused as to why we need Ephemeral graphs? Why isn't an in-memory PropertyGraph not enough? @scober @arthurp

arthurp · 2022-03-15T17:30:59Z

I'm still confused as to why we need Ephemeral graphs? Why isn't an in-memory PropertyGraph not enough? @scober @arthurp

Because we need to be able to return ephemeral graphs to the remote API as handles and we cannot guarantee that the workers will still be running when the next request related to the graph comes in. So the ephemeral graphs need to be persistented for the client session or something like that.

scober · 2022-03-15T17:35:42Z

I'm still confused as to why we need Ephemeral graphs? Why isn't an in-memory PropertyGraph not enough? @scober @arthurp

There is no way to keep a PropertyGraph object from trying to write to storage. That is true in principle now but it is becoming more and more true in practice as well (see MemorySupervisor for an example of this).

So the proper (but maybe a little snotty) answer to your question is that there is no such thing as an in-memory PropertyGraph. We can hide the existence of ephemeral graphs from users more than I am in the current implementation by not using the word "Ephemeral" in the Make function name, but the current design is that every PropertyGraph has an associated writable storage location.

amberhassaan · 2022-03-15T20:02:06Z

So is the idea with MemorySupervisor that we can unload a graph to storage to reduce our in-memory footprint?

tylershunt

My preference would be to drop Ephemeral from the name of the factory function and just promote the Make variant without a path to be what (I think) everyone expected it was anyway.

amberhassaan · 2022-03-17T17:21:48Z

libgraph/include/katana/PropertyGraph.h

+
+  /// Make a property graph from topology and associate it with an ephemeral
+  /// storage prefix. This is approximates an in-memory graph.
+  static Result<std::unique_ptr<PropertyGraph>> MakeEphemeral(


I support Tyler's suggestion of keeping the name Make

amberhassaan

All good from my side.

scober · 2022-03-17T18:15:19Z

So is the idea with MemorySupervisor that we can unload a graph to storage to reduce our in-memory footprint?

Yes. In particular MemorySupervisor expects to be able to write certain large objects (I think just property arrays for now) to storage. So it may never persist the whole graph but it will want to write files.

scober requested review from arthurp, amberhassaan, tylershunt and e-mcginnis March 10, 2022 20:42

scober force-pushed the fix/storage-prefix branch from 159a409 to 2719728 Compare March 10, 2022 20:56

arthurp reviewed Mar 10, 2022

View reviewed changes

libsupport: URI::MakeTempDir()

be231f7

Defines a system-wide policy for choosing a temporary directory.

scober force-pushed the fix/storage-prefix branch from 2719728 to 056f20d Compare March 11, 2022 20:38

simon added 4 commits March 11, 2022 14:51

libsupport: URI prefix checking

597f6ad

Add methods to URI to check if one URI is a prefix of another.

libtsuba: EphemeralStoragePrefix

4846cb5

A utility class that wraps a storage prefix and deletes all files under that prefix when it is destroyed.

libtsuba/libgraph: MakeEphemeral

4c1f8a3

RDG and PropertyGraph both now provide a MakeEphemeral(), which creates a graph that is backed by an ephemeral storage location and approximates an in-memory graph.

refactor: stop creating PropertyGraphs with no storage prefix

9163154

Some instances of this behavior can be replaced with MakeEphemeral() and some can be replaced with a call that provides a storage prefix.

scober force-pushed the fix/storage-prefix branch from 056f20d to 9163154 Compare March 11, 2022 20:51

tylershunt approved these changes Mar 16, 2022

View reviewed changes

amberhassaan reviewed Mar 17, 2022

View reviewed changes

amberhassaan approved these changes Mar 17, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage prefixes always #954

Storage prefixes always #954

scober commented Mar 10, 2022

arthurp Mar 10, 2022

scober Mar 10, 2022

arthurp Mar 10, 2022

scober Mar 10, 2022

arthurp Mar 10, 2022

arthurp Mar 10, 2022

scober Mar 10, 2022

arthurp Mar 10, 2022

tylershunt commented Mar 10, 2022

scober commented Mar 10, 2022

tylershunt commented Mar 11, 2022 •

edited

Loading

tylershunt commented Mar 11, 2022

scober commented Mar 11, 2022

scober commented Mar 11, 2022

amberhassaan commented Mar 15, 2022

arthurp commented Mar 15, 2022

scober commented Mar 15, 2022 •

edited

Loading

amberhassaan commented Mar 15, 2022

tylershunt left a comment

amberhassaan Mar 17, 2022

amberhassaan left a comment

scober commented Mar 17, 2022

		return KATANA_CHECKED(katana::PropertyGraph::MakeEphemeral(
		TopologyFromCSR(edge_indices, edge_destinations)));

Storage prefixes always #954

Are you sure you want to change the base?

Storage prefixes always #954

Conversation

scober commented Mar 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tylershunt commented Mar 10, 2022

scober commented Mar 10, 2022

tylershunt commented Mar 11, 2022 • edited Loading

tylershunt commented Mar 11, 2022

scober commented Mar 11, 2022

scober commented Mar 11, 2022

amberhassaan commented Mar 15, 2022

arthurp commented Mar 15, 2022

scober commented Mar 15, 2022 • edited Loading

amberhassaan commented Mar 15, 2022

tylershunt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amberhassaan left a comment

Choose a reason for hiding this comment

scober commented Mar 17, 2022

tylershunt commented Mar 11, 2022 •

edited

Loading

scober commented Mar 15, 2022 •

edited

Loading