Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State synthesis using quantum state aggregator #2637

Draft
wants to merge 61 commits into
base: main
Choose a base branch
from

Conversation

annagrin
Copy link
Collaborator

Description

State synthesis alternative implementation using state aggregator before argument conversion:

Add a stage to argument synthesis for quantum devices, StateAggregator, that

  1. processes the arguments and collects all the states and kernel names and their arguments from there recursively
  2. adds all the new init and num_qubit functions in there for the kernels collected to the module

Basically, the state synthesis would be split into 4 parts:

  1. create all the init and num_qubits functions for all states in StateAggregator
  2. create substitution quake.get_state @num_qubits @init for the cudaq::state* argument in ArgumentConverter
    run synthesis as usual
  3. replace quake.get_state instruction used in quake.inist_state quake.get_number_of_qubits in ReplaceStateByKernel pass by calls to the init and get_qubits functions

The user code (in RemoteBaseRESTQPU.h) then

  1. Creates a state aggregator from top level kernel and its args
  2. gets the list of kernels and their arguments from it
  3. creates a list of argument converters for each kernel+args
  4. uses the converters to create all substs
  5. Calls argument synthesis with the references to the kernel names and substs

bmhowe23 and others added 30 commits October 17, 2024 14:33
I, Ben Howe <[email protected]>, hereby add my Signed-off-by to this commit: 86681ef

Signed-off-by: Ben Howe <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
I, Anna Gringauze <[email protected]>, hereby add my Signed-off-by to this commit: 9563371

Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
…antum-state-aggregator

Signed-off-by: Anna Gringauze <[email protected]>
@annagrin annagrin marked this pull request as draft February 18, 2025 22:56
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Signed-off-by: Anna Gringauze <[email protected]>
Copy link
Collaborator

@schweitzpgi schweitzpgi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about the Python issues with all of this and it's really not clear that this is the right approach in that context.

Specifically, in order to achieve reasonable performance in python a kernel is effectively pre-compiled to (call it) assembly code. Then when that kernel is launched (from a call-site), very little overhead is incurred. Arguments are marshalled and the pre-compiled assembly code is run.

For state handling then we have these cases:

  1. It's a simulation, so a raw pointer to the data (same address space) can be forwarded to the simulator layer to do what it needs to do. Marshalling is a simple copy of the pointer.
  2. The simulation is happening in another address space (server), so the marshalling involves transmitting the data with the kernel or independently from the client.
  3. The data isn't really on the client, which requires the server runtime to have the smarts of figuring out what state data to send to what kernel when they are launched. (I don't know if that was designed or vetted.)
  4. There is no data and the "state data" is really calls to (chains) of other kernels. In this case, it's too late for Python since the assembly code doesn't contain those calls and was built without them. One possible design is for the assembly code to call back to the runtime with a state object encoding (TBD) that allows the runtime to launch the kernels such that the qubit set is initialized into some state. But the point is we wouldn't be doing argument synthesis here as the assembly code is already compiled. Specifically, the call
  Array* a = __quantum__rt__qubit_allocate_array_with_cudaq_state_ptr(state_ptr);

would itself understand how to unmarshall the data implied by the state_ptr and return a set (array) of qubits from running the implied kernels itself. This pushes almost the entire problem to be as late as possible and not something the compiler (at least in Python) can reason about.

@@ -313,4 +313,10 @@ def AnyStateInitLike : TypeConstraint<cc_PointerType.predicate,
"state initializer types">;
def AnyStateInitType : Type<AnyStateInitLike.predicate, "initial state type">;

def AnyStatePointerType : Type<
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put up a PR to make pointer element type constraints easy to specify. (I thought I'd already done so, but I don't see it on main.)

#2640

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@@ -1418,7 +1418,7 @@ def quake_CreateStateOp : QuakeOp<"create_state", [Pure]> {
cc_PointerType:$data,
AnySignlessInteger:$length
);
let results = (outs cc_PointerType:$result);
let results = (outs AnyStatePointerType:$result);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let results = (outs AnyStatePointerType:$result);
let results = (outs PointerOf<[cc_StateType]>:$result);

with the above PR merged, this can be used instead.

"Replace `quake.init_state` instructions with call to the kernel generating the state";
let description = [{
This optimization replaces `quake.init_state`, `quake.get_number_of_qubits`,
and `quake.get_state` operations invoked on state pointers during argument
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't possible to call cudaq::get_state from within a kernel. But that isn't what this op means, so it is a confusing overloading of terms. Can we rename quake.get_state to something like quake.materialize_state?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

#include "mlir/IR/PatternMatch.h"
#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
#include "mlir/Transforms/Passes.h"
#include <span>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::span not used; not needed?

/// %0 = quake.get_state @callee.num_qubits_0 @callee.init_0 : !cc.ptr<!cc.state>
/// %1 = quake.get_number_of_qubits %0 : (!cc.ptr<!cc.state>) -> i64
/// ───────────────────────────────────────────
/// ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// ...

PatternRewriter &rewriter) const override {

auto stateOp = numQubits.getOperand();
if (auto getState = stateOp.getDefiningOp<quake::GetStateOp>()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typically we want the then block to emit the error and the fall-through code to be the "winning" case that does the transformation.

/// Replace `quake.init_state` by a call to a (modified) kernel that produced
/// the state.
///
/// ```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// ```
/// ```mlir

I believe that is legal in doxygen. It seems to be used in other files.

/// %0 = quake.get_state @callee.num_qubits_0 @callee.init_0 : !cc.ptr<!cc.state>
/// %3 = quake.init_state %2, %0 : (!quake.veq<?>, !cc.ptr<!cc.state>) -> !quake.veq<?>
/// ───────────────────────────────────────────
/// ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// ...

We can elide these triple dots in these transformation comments.

auto *ctx = &getContext();
auto func = getOperation();
RewritePatternSet patterns(ctx);
patterns.insert<ReplaceGetNumQubitsPattern, ReplaceInitStatePattern>(ctx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just the 2 patterns.

Comment on lines +110 to +111
// TODO: add an option to use the kernel info if available, i.e. for
// remote simulators
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a clearer definition of what we mean by "remote simulators". There are several possibilities and I'm not sure what the issues are with this.

  1. There are no local processes. CPU and QPU code run on server (there) together. In this case, the simulator likely shares the same address space as the CPU code, so it can dereference a pointer (like from a span) and read whatever state data there is. There is no need to clone anything, since the pointer at the time of the call must be valid or the program is on the verge of issuing a SIGSEGV.
  2. CPU code runs on client (here) and QPU code runs on the server (there). In this case, the QPU cannot be responsible for cloning an object that resides on the client. The state must be passed to the server as a copy.
  3. Hybrid model of CPU runs here and QPU runs there, but rather than bundle the kernel along with the state data, they are sent separately. Here again, the server must have the ability to receive and associate the state data to a specific kernel. Furthermore, it must be able to correctly construct the state data in the server's address space such that it will use the calling conventions of the kernel in that address space. This may or may not require the server runtime to make a copy of the data. But if it does make a copy, the kernel itself would not manage that data or make yet another copy. It merely has to pass the unowned data on to the simulator.

One of the performance sticking points with all of this is trying to avoid making copies of state data. Hence it seems like we want the kernel design to simply refuse to accept anything other than unowned references to state data. The runtime machinery and the simulator will be then be 100% responsible for making the copies (and the poor performance that entails).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remote simulators here mean host and kernel code executed in different processes, and the kernel is executed on a simulator. I'll update the comment. Synthesis for those scenarios currently inline the state data in the kernel, but we could use the alternative option from this PR as well, if the kernel data is available. I was thinking about trying it out via the option in question. This "option" addition is not part of the PR, just something to think about in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the comment for now as the issue is unrelated to the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants