Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-41966: Add Butler.transfer_dimension_records_from API #921

Merged
merged 5 commits into from
Dec 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions doc/changes/DM-41966.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
* Added new API ``Butler.transfer_dimension_records_from()`` to copy dimension records out of some refs and add them to the target butler.
* This and ``Butler.transfer_from()`` now copy related dimension records as well as the records associated directly with the refs.
For example, if visit is being transferred additional records such as visit_definition will also be copied.
This requires a full Butler and not a limited Butler (such as the one backed by a quantum graph).
21 changes: 21 additions & 0 deletions python/lsst/daf/butler/_butler.py
Original file line number Diff line number Diff line change
Expand Up @@ -1248,6 +1248,27 @@ def import_(
"""
raise NotImplementedError()

@abstractmethod
def transfer_dimension_records_from(
self, source_butler: LimitedButler | Butler, source_refs: Iterable[DatasetRef]
) -> None:
"""Transfer dimension records to this Butler from another Butler.

Parameters
----------
source_butler : `LimitedButler` or `Butler`
Butler from which the records are to be transferred. If data IDs
in ``source_refs`` are not expanded then this has to be a full
`Butler` whose registry will be used to expand data IDs. If the
source refs contain coordinates that are used to populate other
records then this will also need to be a full `Butler`.
source_refs : iterable of `DatasetRef`
Datasets defined in the source butler whose dimension records
should be transferred to this butler. In most circumstances.
transfer is faster if the dataset refs are expanded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to make this an iterable of DataCoordinate, since that's what's actually holding the records and you can get that from Iterable[DatasetRef] with (ref.dataId for ref in source_refs).

And even then I think that's a bit strange for a method with this name; it sounds like it should be taking a bunch of dimension records and transferring those (as well as other less-obvious associated records). But of course that's not actually the interface we need right here, so maybe this is just a naming problem.

One option might be to make this a private method, if the goal is really to support transferring datasets. On the DMTN-249 prototype I wrote a little helper class that I hoped to use to unify the transfer-from and import_/export interfaces, by providing an abstraction over "a bunch of stuff you want to transfer self-consistently". I think we might need that in the transfer APIs to avoid a bunch of methods with names like transfer_dimension_records_from_given_dataset_refs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be a public API because the embargo transfers need to be able to call it before they transfer the raws from embargo to public. They can't use Butler.transfer_from() for raws because raws are relocated to public storage outside of butler before being ingested again (but without having to go through calling ingest-raws again since they have refs already from the embargo repo and are building up the FileDataset objects). That's also why refs are the interface and not dataIds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm not thrilled with it, but if it's got a clear use case, go ahead, since the kind of generalization I want is even more design work (and I'd probably want to be even more cautious about releasing that half-baked), so we can cross the bridge of replacing this if and when we come to it.

"""
raise NotImplementedError()

@abstractmethod
def transfer_from(
self,
Expand Down
2 changes: 1 addition & 1 deletion python/lsst/daf/butler/dimensions/_universe.py
Original file line number Diff line number Diff line change
Expand Up @@ -596,7 +596,7 @@ def getEncodeLength(self) -> int:

def get_elements_populated_by(self, dimension: Dimension) -> NamedValueAbstractSet[DimensionElement]:
"""Return the set of `DimensionElement` objects whose
`~DimensionElement.populated_by` atttribute is the given dimension.
`~DimensionElement.populated_by` attribute is the given dimension.
"""
return self._populates[dimension.name]

Expand Down
143 changes: 126 additions & 17 deletions python/lsst/daf/butler/direct_butler.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
import collections.abc
import contextlib
import io
import itertools
import logging
import numbers
import os
Expand Down Expand Up @@ -1879,6 +1880,125 @@
else:
doImport(filename) # type: ignore

def transfer_dimension_records_from(
self, source_butler: LimitedButler | Butler, source_refs: Iterable[DatasetRef]
) -> None:
# Allowed dimensions in the target butler.
elements = frozenset(
element for element in self.dimensions.elements if element.hasTable() and element.viewOf is None
)

data_ids = {ref.dataId for ref in source_refs}

dimension_records = self._extract_all_dimension_records_from_data_ids(
source_butler, data_ids, elements
)

# Insert order is important.
for element in self.dimensions.sorted(dimension_records.keys()):
records = [r for r in dimension_records[element].values()]
# Assume that if the record is already present that we can
# use it without having to check that the record metadata
# is consistent.
self._registry.insertDimensionData(element, *records, skip_existing=True)
_LOG.debug("Dimension '%s' -- number of records transferred: %d", element.name, len(records))

def _extract_all_dimension_records_from_data_ids(
self,
source_butler: LimitedButler | Butler,
data_ids: set[DataCoordinate],
allowed_elements: frozenset[DimensionElement],
) -> dict[DimensionElement, dict[DataCoordinate, DimensionRecord]]:
primary_records = self._extract_dimension_records_from_data_ids(
source_butler, data_ids, allowed_elements
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an assumption here that if the destination butler already has records for some of these elements, insertDimensionData(..., skip_existing=True) is both efficient and correct as a way to resolve any conflicts. That's a reasonable-enough assumption for it be the default, but we might want to provide more control for advanced users, especially if that could avoid queries against the source butler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. For the populated_by records, wouldn't we have to query the target butler to see if they existed already?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's less important and trickier to let the user control those. If we expressed the user control as an "opt out" list of elements rather than an opt-in list we could probably still trust it, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you want a skip_elements: list[str] | None = None parameter to be added to butler.transfer_from and butler.transfer_dimension_records_from so that people could say "no detector/instrument/physical_filter or no visit_detector_region" or something. I agree that if detector/instrument/physical_filter are not present in the target repo then you probably do want to run register-instrument first, although the transfer being told to skip them wouldn't explain to people why the transfer failed in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's about what I was thinking vaguely of. I don't care deeply about adding it on this ticket if you just want to get somebody else unblocked right now.


can_query = True if isinstance(source_butler, Butler) else False

additional_records: dict[DimensionElement, dict[DataCoordinate, DimensionRecord]] = defaultdict(dict)
for original_element, record_mapping in primary_records.items():
# Get dimensions that depend on this dimension.
populated_by = self.dimensions.get_elements_populated_by(
self.dimensions[original_element.name] # type: ignore
)

for data_id in record_mapping.keys():
for element in populated_by:
if element not in allowed_elements:
continue

Check warning on line 1928 in python/lsst/daf/butler/direct_butler.py

View check run for this annotation

Codecov / codecov/patch

python/lsst/daf/butler/direct_butler.py#L1928

Added line #L1928 was not covered by tests
if element.name == original_element.name:
continue

if element.name in primary_records:
# If this element has already been stored avoid
# re-finding records since that may lead to additional
# spurious records. e.g. visit is populated_by
# visit_detector_region but querying
# visit_detector_region by visit will return all the
# detectors for this visit -- the visit dataId does not
# constrain this.
# To constrain the query the original dataIds would
# have to be scanned.
continue

Check warning on line 1942 in python/lsst/daf/butler/direct_butler.py

View check run for this annotation

Codecov / codecov/patch

python/lsst/daf/butler/direct_butler.py#L1942

Added line #L1942 was not covered by tests

if not can_query:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have a flag to not try to copy the derived records? This will break if a quantum graph is used as the source butler but I think that's fine because we shouldn't be enabling records transfer from the graph back to the primary butler because it's got all the records by definition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's also a use case for making a new repo from a QG, but I'd prefer to explicitly add that (and control what the interface for it is) rather than accidentally make it work one way and then have to continue to support that way. So I don't think we need that flag for that reason.

I am a bit more worried about cases where somebody intentionally does not want to transfer something like detector because they'd rather assert that someone has done butler registry-instrument correctly on the destination repo, but I think that's an argument for being able to control the elements being transferred as per a previous PR comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just realized that the butler transfer-from-graph command lets you specify whether to copy dimension records or not. This now breaks things because of the populated_by follow up query. We seem to have the following options:

  • if we can't query the source butler we issue a warning and transfer what we have.
  • if we can't query the source butler we query the target butler and if those populated_by records are found we return without complaint.
  • in the future we add the related records to the graph and add querying of records to the Graph Butler.
  • we remove the transfer dimensions option from the transfer-from-graph command (or change the default to False and currently raise if True). The default likely should be true anyhow since in all our cases we are transferring back to a butler that created the graph.

raise RuntimeError(

Check warning on line 1945 in python/lsst/daf/butler/direct_butler.py

View check run for this annotation

Codecov / codecov/patch

python/lsst/daf/butler/direct_butler.py#L1945

Added line #L1945 was not covered by tests
f"Transferring populated_by records like {element.name} requires a full Butler."
)

records = source_butler.registry.queryDimensionRecords( # type: ignore
element.name, **data_id.mapping # type: ignore
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably as efficient as we can make it now, but it will be really helpful when we can upload tables of data IDs to the query methods and join against them, both for efficiency and for simplifying the logic here (which is effectively a bunch table joins written in Python). Right now I think we better hope this almost never gets called.

for record in records:
additional_records[record.definition].setdefault(record.dataId, record)

# The next step is to walk back through the additional records to
# pick up any missing content (such as visit_definition needing to
# know the exposure). Want to ensure we do not request records we
# already have.
missing_data_ids = set()
for name, record_mapping in additional_records.items():
for data_id in record_mapping.keys():
if data_id not in primary_records[name]:
missing_data_ids.add(data_id)

# Fill out the new records. Assume that these new records do not
# also need to carry over additional populated_by records.
secondary_records = self._extract_dimension_records_from_data_ids(
source_butler, missing_data_ids, allowed_elements
)

# Merge the extra sets of records in with the original.
for name, record_mapping in itertools.chain(additional_records.items(), secondary_records.items()):
primary_records[name].update(record_mapping)

return primary_records

def _extract_dimension_records_from_data_ids(
self,
source_butler: LimitedButler | Butler,
data_ids: set[DataCoordinate],
allowed_elements: frozenset[DimensionElement],
) -> dict[DimensionElement, dict[DataCoordinate, DimensionRecord]]:
dimension_records: dict[DimensionElement, dict[DataCoordinate, DimensionRecord]] = defaultdict(dict)

for data_id in data_ids:
# Need an expanded record, if not expanded that we need a full
# butler with registry (allow mocks with registry too).
if not data_id.hasRecords():
if registry := getattr(source_butler, "registry", None):
data_id = registry.expandDataId(data_id)
else:
raise TypeError("Input butler needs to be a full butler to expand DataId.")

Check warning on line 1992 in python/lsst/daf/butler/direct_butler.py

View check run for this annotation

Codecov / codecov/patch

python/lsst/daf/butler/direct_butler.py#L1992

Added line #L1992 was not covered by tests
# If this butler doesn't know about a dimension in the source
# butler things will break later.
for element_name in data_id.dimensions.elements:
record = data_id.records[element_name]
if record is not None and record.definition in allowed_elements:
dimension_records[record.definition].setdefault(record.dataId, record)

return dimension_records

def transfer_from(
self,
source_butler: LimitedButler,
Expand Down Expand Up @@ -1972,30 +2092,19 @@
if element.hasTable() and element.viewOf is None
)
dataIds = {ref.dataId for ref in source_refs}
# This logic comes from saveDataIds.
for dataId in dataIds:
# Need an expanded record, if not expanded that we need a full
# butler with registry (allow mocks with registry too).
if not dataId.hasRecords():
if registry := getattr(source_butler, "registry", None):
dataId = registry.expandDataId(dataId)
else:
raise TypeError("Input butler needs to be a full butler to expand DataId.")
# If this butler doesn't know about a dimension in the source
# butler things will break later.
for element_name in dataId.dimensions.elements:
record = dataId.records[element_name]
if record is not None and record.definition in elements:
dimension_records[record.definition].setdefault(record.dataId, record)
dimension_records = self._extract_all_dimension_records_from_data_ids(
source_butler, dataIds, elements
)

handled_collections: set[str] = set()

# Do all the importing in a single transaction.
with self.transaction():
if dimension_records:
_LOG.verbose("Ensuring that dimension records exist for transferred datasets.")
for element, r in dimension_records.items():
records = [r[dataId] for dataId in r]
# Order matters.
for element in self.dimensions.sorted(dimension_records.keys()):
records = [r for r in dimension_records[element].values()]
# Assume that if the record is already present that we can
# use it without having to check that the record metadata
# is consistent.
Expand Down
6 changes: 6 additions & 0 deletions python/lsst/daf/butler/remote_butler/_remote_butler.py
Original file line number Diff line number Diff line change
Expand Up @@ -407,6 +407,12 @@ def import_(
# Docstring inherited.
raise NotImplementedError()

def transfer_dimension_records_from(
self, source_butler: LimitedButler | Butler, source_refs: Iterable[DatasetRef]
) -> None:
# Docstring inherited.
raise NotImplementedError()

def transfer_from(
self,
source_butler: LimitedButler,
Expand Down
5 changes: 5 additions & 0 deletions tests/test_butler.py
Original file line number Diff line number Diff line change
Expand Up @@ -2308,6 +2308,11 @@ def assertButlerTransfers(self, purge: bool = False, storageClassName: str = "St
# Can remove with DM-35498.
self.target_butler.registry.refresh()

# Transfer the records for one ref to test the alternative API.
with self.assertLogs(logger="lsst", level=logging.DEBUG) as log_cm:
self.target_butler.transfer_dimension_records_from(self.source_butler, [source_refs[0]])
self.assertIn("number of records transferred: 1", ";".join(log_cm.output))

# Now transfer them to the second butler, including dimensions.
with self.assertLogs(logger="lsst", level=logging.DEBUG) as log_cm:
transferred = self.target_butler.transfer_from(
Expand Down
22 changes: 21 additions & 1 deletion tests/test_simpleButler.py
Original file line number Diff line number Diff line change
Expand Up @@ -324,7 +324,8 @@
try:
flat_id, _ = butler.get("flat", dataId=dataId, collections=coll, **kwds)
except Exception as e:
raise type(e)(f"{str(e)}: dataId={dataId}, kwds={kwds}") from e
e.add_note(f"dataId={dataId}, kwds={kwds}")
raise

Check warning on line 328 in tests/test_simpleButler.py

View check run for this annotation

Codecov / codecov/patch

tests/test_simpleButler.py#L327-L328

Added lines #L327 - L328 were not covered by tests
self.assertEqual(flat_id, flat2g.id, msg=f"DataId: {dataId}, kwds: {kwds}")

# Check that bad combinations raise.
Expand Down Expand Up @@ -602,6 +603,25 @@
from_json = type(test_item).from_json(json_str, universe=butler.dimensions)
self.assertEqual(from_json, test_item, msg=f"From JSON '{json_str}' using universe")

def test_populated_by(self):
"""Test that dimension records can find other records."""
butler = self.makeButler(writeable=True)
butler.import_(filename=os.path.join(TESTDIR, "data", "registry", "hsc-rc2-subset.yaml"))

elements = frozenset(
element for element in butler.dimensions.elements if element.hasTable() and element.viewOf is None
)

# Get a visit-based dataId.
data_ids = set(butler.registry.queryDataIds("visit", visit=1232, instrument="HSC"))

# Request all the records related to it.
records = butler._extract_all_dimension_records_from_data_ids(butler, data_ids, elements)

self.assertIn(butler.dimensions["visit_detector_region"], records, f"Keys: {records.keys()}")
self.assertIn(butler.dimensions["visit_system_membership"], records)
self.assertIn(butler.dimensions["visit_system"], records)

def testJsonDimensionRecordsAndHtmlRepresentation(self):
# Dimension Records
butler = self.makeButler(writeable=True)
Expand Down
Loading