Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding multi-device resources RFC. #137

Merged
merged 1 commit into from
Apr 13, 2023

Conversation

jhmueller-huawei
Copy link
Contributor

From #120

Signed-off-by: Joerg H. Mueller <[email protected]>
@moudgils moudgils merged commit 74f9471 into o3de:main Apr 13, 2023
@moudgils
Copy link
Contributor

Thanks

@martinwinter-huawei
Copy link

While we are in the midst of submitting our series of PRs, which introduce multi-device versions of our resource classes, we wanted to get feedback on two questions that would allow us to more easily continue with our final step, which is then the integration of the multi-device resources properly in o3de.

Naming Convention

The first questions relates to naming resource classes.

  1. Our current state retains single-device resources with their current name (e.g. a single-device buffer is RHI::Buffer) and introduces new multi-device resources with the prefix MultiDevice (e.g. a multi-device buffer is RHI::MultiDeviceBuffer).
    This would mean that with the introduction of multi-device resources on the RPI level and above, we need to rename all instances there. On the other hand, the backends would stay mostly the same (except for classes like the FrameGraph that needs to be altered anyways).
  2. Another way of doing it would be to instead rename the single-device variants (for example to SingleDevice*, e.g. RHI::SingleDeviceBuffer) while reusing the existing names used throughout the codebase for the multi-device versions (e.g. a multi-device buffer would now be RHI::Buffer).
    This would reduce the changes on the RPI level and above (though there are still plenty of changes related to passing device masks instead of the device itself), while the backends would see more changes.
  3. A last option would be to rename both, i.e. RHI::SingleDeviceBuffer and RHI::MultiDeviceBuffer for the buffer class, which probably be the clearest option in conveying the meaning best, but also would require the largest amount of code changes.

In our opinion, there is no clear choice here, but we wanted to get some feedback early which could reduce the cleanup procedure needed if name changes are then required when pushing this as a PR.

PR Structure

The second questions relates to structuring the final PR, which would introduce multi-device resources at the RPI level and above.
We would certainly acknowledge that having one (very) large PR is really hard to properly review, but with this final PR, there are a number of challenges when trying to split it up:

  • There are more dependencies between the resource classes that need to be kept in mind (e.g. SRGs need buffers, images, *Items etc. to function and we have seen before in our previous attempts that there are many such entanglements that are difficult or even impossible to untangle)
  • The framegraph changes would have to wait until everything is done, which also would mean that everything first would be setup as multi-device but would always have to be changed back to single-device when submitting to the CommandList's and the final commit would then revert all of that back

Therefore, we would like to get some other opinions on how to proceed, if we should go the route of more effort and less clean but submitted in smaller bites or with the less effort and more clean but large PR route.

@moudgils
Copy link
Contributor

moudgils commented Jun 16, 2023

Naming Convention - I actually prefer option 1 here. Basically we expose RHI::MultiDevicebuffer which internally contains RHI::Buffer which is assumed to be on a single device. It will unfortunately mean more changes at RPI when we transition to it. We are essentially saying that RPI can use RHI:: < ResourceType > or RHI::MultiDevice < ResourceType > to implement it's pipelines and both should work (in theory). If you want to take no perf impact then you could do something like this in RPI -> if your pipeline is not built for multiple device then could use RHI:: < ResourceType > and switch to RHI::MultiDevice < ResourceType > if it is built for multiple devices. Although it may make RPI quite a bit ugly.

PR Structure - Could we split RPI level work into multiple draft PRs split across different local branches that can go through PR process of getting 2 approvals. Once all the draft PRs have 2 approvals you create a big proper PR for the development branch that will hopefully have very few artifacts and should get the approvals needs and we can then run AR at that time.

@jhmueller-huawei
Copy link
Contributor Author

Naming Convention - I actually prefer option 1 here. Basically we expose RHI::MultiDevicebuffer which internally contains RHI::Buffer which is assumed to be on a single device. It will unfortunately mean more changes at RPI when we transition to it. We are essentially saying that RPI can use RHI:: < ResourceType > or RHI::MultiDevice < ResourceType > to implement it's pipelines and both should work (in theory). If you want to take no perf impact then you could do something like this in RPI -> if your pipeline is not built for multiple device then could use RHI:: < ResourceType > and switch to RHI::MultiDevice < ResourceType > if it is built for multiple devices. Although it may make RPI quite a bit ugly.

I think there may be some misunderstandings here? This point is just regarding the naming convention, not the use of the changed API at the RPI level. If I consider the perspective of various developers on the three options, I would judge the changes as follows:

  • Option 1 results in the least amount of changes in the RHI and backend code, but more changes in the RPI and higher up. For reviewers and developers of the RHI (like you), I understand that this is the preferred option as it leads to less disruption in the RHI code, since there won't be a commit/change that renames all single device classes and the names of the existing classes stay as they are.
  • Option 2 is kind of the reverse of option 1 and results in the least amount of changes in the RPI and higher up, but more changes in the RHI and backend code. This would be advantageous for developers using Atom, the RPI and higher level code, in O3DE directly, in Gems or in game projects. This option would, from the perspective of those higher levels, only change the API slightly (needing a device mask instead of a device, no renaming of types).
  • Option 3 changes both, higher and lower levels. The advantage here is that the introduction of multi-device resources is made very clear to everyone with the disadvantage of most code changes.

Martin tried to write his question without posting an opinion on these options, which I appreciate, to keep an open mind about the discussion, but with this I want to weigh in here: I think that option 2 would probably be best, since it affects the least amount of people, i.e., the developers using Atom in their projects. On the other hand, if we want to make the changes very clear instead, I'd go for option 3. Option 1 would be my least favorite option even though I acknowledge, that it would be the easiest to get through review, but it is not my goal to get code through as easily as possible. I'd rather get code through that is as good and usable as possible.

PR Structure - Could we split RPI level work into multiple draft PRs split across different local branches that can go through PR process of getting 2 approvals. Once all the draft PRs have 2 approvals you create a big proper PR for the development branch that will hopefully have very few artifacts and should get the approvals needs and we can then run AR at that time.

Not sure if that works, since even if we split the PRs somehow, they will depend heavily on each other. It may work, if we can create the merge requests in a way that they will be merged into each others branches.

@moudgils
Copy link
Contributor

moudgils commented Jun 16, 2023

Naming convention - To be honest based on our previous discussions I thought we were going down the Option 2 route but when the last big PR was created I figured we changed direction a little towards option 1 and didn't mind it as I was thinking we could make it work.

With option 1 the changes at RPI may not be as disruptive as it may seem at first. I figured we would try an ensure the RHI:: < ResourceType > for single device would work as well as RHI::MultiDevice < ResourceType > and we can take our time transitioning from one to another (within RPI) so it would be backwards compatible. The only con here is that now RHI needs to support both RHI:: < ResourceType > and RHI::MultiDevice < ResourceType > which will be more work and more risky. So you can create a frame graph with both RHI:: < ResourceType > and RHI::MultiDevice < ResourceType >.

Given that option 2 was our previously discussed solution I am ok with going back to option 2 (unless @akioCL or @VickyAtAZ has any other objections). It would just mean that RPI will automatically get multi device support for each resource type under the hood and we can also even ensure that it doesn't ever has to access SingleDevice < ResourceType >.

PR Structure - Yeah my proposed solution is not ideal. I was hoping we could create draft PRs to help address a lot of code style/guidelines/comments related minor artifacts before we create one big PR. If we go with option 1 this will become less of an issue.

@akioCL
Copy link
Contributor

akioCL commented Jun 16, 2023

I actually prefer option 2. The way I see it, all device resources are "multi device capable". An RHI::Texture for me is a resource that could be on one or multiple devices (depending on device mask at creation). So from outside the RHI, they would see this resource, that has the capacity of existing on multiple devices.

Option 2 is the less disruptive, since it keeps everything "as is" from the perspective of RHI clients. Option 1 is a breaking change that would affect any code that handle an RHI::Resource. This is mostly the RPI, but Gems and Samples (and maybe projects) hold references to RHI::Resources.

We still have some weird cases, like what would you do if you "map" a buffer that exist on multiple devices? Which memory you return? Do we use a staging buffer that gets copied to the multiple version in each device? Do we allow access to the single device version of a resource from outside of the RHI? or do we abstract everything through the multi device layer?

For the initial transition to resources with the capability of multi device, all function calls would be forward to the single device resource (until we actually have multiple devices registered to one resource), so the overhead would be one extra call. Even calls as "map" we would just call them on the only device resource until we have a better solution. This would in theory keep everything working without changes outside the RHI.

@jhmueller-huawei
Copy link
Contributor Author

jhmueller-huawei commented Jun 19, 2023

Great, we will go with option 2 then. We will rename the MultiDevice classes along the way but for the current PRs they have to be named differently of course. We just wanted to clarify again, since we didn't really finalize the decision last time. And yes our plan is to have multi device capable objects only in the RPI, otherwise everything would get too complicated.

Regarding the transition of the RPI code, we will create merge requests as small as possible and have you review them. We will try to do this already during the development to speed things up. There we can also discuss in detail about what to do with those weird cases like mapping a buffer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants