Skip to content

Conversation

njhollinghurst
Copy link
Contributor

Allocate space for level-2 IOMMU translation tables on demand. This should save memory in most cases but means that map_pages() can now fail with -ENOMEM. Unused pages are retained for re-use.

Add OF properties to override the default aperture size (2GB) and base address (40GB); this doesn't include any dma-iova-offset.

Various tidy-ups. The internal representation of aperture limits does include dma_iova_offset, as that is more commonly useful. Clarify the distinction between Linux pages and IOMMU table pages. Fix wrong definition of MMMU_CTRL_PT_INVALID_EN flag.

Allocate space for level-2 IOMMU translation tables on demand.
This should save memory in most cases but means that map_pages()
can now fail with -ENOMEM. Unused pages are retained for re-use.

Add OF properties to override the default aperture size (2GB)
and base address (40GB); this doesn't include any dma-iova-offset.

Various tidy-ups. The internal representation of aperture limits
*does* include dma_iova_offset, as that is more commonly useful.
Clarify the distinction between Linux pages and IOMMU table pages.
Fix wrong definition of MMMU_CTRL_PT_INVALID_EN flag.

Signed-off-by: Nick Hollinghurst <[email protected]>
This is largely to test a previous change that made IOMMU aperture
configurable and allocated lazily; it may be useful in its own right.
We expect IOMMU2 to be well-utilized e.g. when using 64MPix cameras.

Signed-off-by: Nick Hollinghurst <[email protected]>
@njhollinghurst
Copy link
Contributor Author

Would anyone like to review this? I know it's quite a sprawling change...
Or critique the "lazy allocation / never free" strategy? @pelwell @6by9 @popcornmix @jc-kynesim

The most visible difference is that it should save about 4MBytes of memory in most cases. With the second commit it ought to cope better with 2 x 64MPixel cameras, though I haven't tested that.

@jc-kynesim
Copy link
Contributor

You end up allocating many IOMMU pages individually (presumably 1024 of them by the end), would there be any benefit in trying to allocate in bigger chunks - or is that just pointlessly tricky?

@njhollinghurst
Copy link
Contributor Author

That's what the old version did, using the scatter-gather-list style of DMA mapping (so as not to require contiguous memory).

I suppose one could allocate in chunks that were bigger than one Linux page but still smaller than the total. I did wonder about this. But introducing yet another "page/chunk size" would make the code even messier.

One 16K Linux page is enough for 16M of IOVA, which didn't seem ridiculously small.

@jc-kynesim
Copy link
Contributor

You seem to have got rid of all the dma_sync_sgtable_for_cpu calls that were previously in the code. Are those now done in some other function or have they just gone? Having just been reminded by upstream of their importance I feel that they should be included even though they probably end up doing nothing in the current world.

@njhollinghurst
Copy link
Contributor Author

Yes, that (along with the assumption that dma_addr == phys_addr) was copied from another IOMMU driver, but it might not have been one for ARM64 so perhaps not a good example.

Might redundant/repeated sync calls be even worse than not having them (e.g. if they invalidate wanted cache contents)? If so, it might require keeping track of the current sync status of each page. Bother.

I see that this driver does it for every single write! Maybe I should do that...

@njhollinghurst njhollinghurst marked this pull request as draft August 22, 2025 09:18
@jc-kynesim
Copy link
Contributor

My understanding goes that it should always look like this when writing (if cached - if not then no-one cares):

dma_sync_sgtable_for_cpu(ptr, DMA_TO_DEVICE)
<fill table at ptr with stuff>
dma_sync_sgtable_for_device(ptr, DMA_TO_DEVICE)
<do whatever is needed to the the h/w to read the stuff just written>

You can add as much as you like between the syncs but it shouldn't be stuff that will be read by the h/w until after the sync_for_device. As it happens dma_sync_sgtable_for_cpu(ptr, DMA_TO_DEVICE) will no-op on most architectures including arm64 so it is all a bit hypothetical but one could conceive of an optimization which no-ops the dma_sync_sgtable_for_device if there was no preceding dma_sync_sgtable_for_cpu so it is probably worth having.

AFAIK calls should always be paired. It is theoretically either wrong or pointless to do otherwise though in practice you will get away with it if you pick the correct call+direction.

Move DMA sync calls into map()/unmap() functions and ensure
they are balanced.

Signed-off-by: Nick Hollinghurst <[email protected]>
@njhollinghurst
Copy link
Contributor Author

I've moved dma_sync_*_for_*() into the map and unmap calls, which makes it easier to ensure they are balanced.

Also removed a couple of redundant ones from probe, because DMA documentation assures me that a newly-created mapping will be synchronized for the device.

@jc-kynesim
Copy link
Contributor

That looks good to me

@njhollinghurst
Copy link
Contributor Author

njhollinghurst commented Aug 26, 2025

Unfortunately something is intermittently going wrong now I've switched back from VEC to HDMI. Possibly because the latter uses multi-hugepage-sized coherent blocks, which weren't adequately tested earlier.

Heisenbug. Not necessarily caused by this change (though I bet it is somehow related).

if (dirty_top)
dma_sync_single_for_device(mmu->dev, virt_to_phys(&mmu->top_table),
PAGE_SIZE, DMA_TO_DEVICE);
if (lxp <= (p_last >> TABLES_LXPAGE_SHIFT)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, the shift probably should be LX_PAGEWORDS_SHIFT, as at line 287.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately this never happens (and the wrong shift makes it happen even less).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants