Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: EDM4HEPSchema and Newstyle FCCSchema #1245

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

prayagyadav
Copy link
Contributor

@lgray @davidlange6 @gomber

Here is a clean draft for the EDM4HEPSchema (edm4hep1) and FCCSchema based on the same. I have not yet managed to add many comments and descriptions, but I plan to add them eventually.

Workings:

  • The EDM4HEPSchema reads the edm4hep.yaml file from the assets directory. I felt this was necessary to add maximum functionality to the schema. Reading the specifications of all the 'components' and 'datatypes' from the yaml file helps to identify the 'members' (example, energy is a member of edm4hep::ReconstructedParticle datatype) and which members correspond to the various types of cross-branch relations in EDM4HEP: vector members, OneToOneRelations, OneToManyRelations and Links.

  • The Schema fetches the comments in the edm4hep.yaml file and assigns them as docstrings to the relevant branches.

  • The EDM4HEPSchema supports all these relations (With Links needing some manual boilerplate code from the user).

  • The version of the edm4hep.yaml file used is here. Please note that the way Links are represented in EDM4HEP has changed in the latest commit. @tmadlener can comment more on this. In any case, it seems necessary to find a way to track the changes from edm4hep.yaml, so that the COFFEA EDM4HEPSchema does not become obsolete after a few version changes.

Link to example Notebooks:

Tests:

  • The test sample for EDM4HEP was obtained from here (which, I believe, was pointed out by @tmadlener).
  • The test sample for the Newstyle (edm4hep1) FCC was provided by @davidlange6
  • Both the schema pass the basic tests (the tests are similar to the ones that I wrote for the oldstyle (pre-edm4hep1) FCCSchema)

Other comments:

  • I understand that this commit includes a lot of code, and its purpose may not be immediately clear, but I can provide a detailed explanation for specific snippets if needed.
  • While I have tried to add most of the functionality to the schema, I cannot figure out how to add the ExtraCode sections mentioned in edm4hep.yaml. They appear to be declarations for C++ methods specific to certain collections.

@prayagyadav prayagyadav changed the title EDM4HEPSchema and Newstyle FCCSChema feat: EDM4HEPSchema and Newstyle FCCSChema Jan 15, 2025
assert field in delayed_fields


def test_MC_daughters(delayed_events):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests should also check correctness!

assert field in delayed_fields


def test_MC_daughters(delayed_events):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests should also check correctness!

#


def test_KaonParent_to_PionDaughters_Loop(eager_events):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, I guess correctness is checked here in the end.

Perhaps some more basic checks in the prior tests would still be useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, test_KaonParent_to_PionDaughters_Loop attempts to check the correctness of Parents and Daughters relations. Sure, I can add more basic tests

@lgray
Copy link
Collaborator

lgray commented Jan 15, 2025

can you rebase all the commits into 1 so that that 12MB file isn't into the repo history at all? Thanks!

@prayagyadav
Copy link
Contributor Author

can you rebase all the commits into 1 so that that 12MB file isn't into the repo history at all? Thanks!

Done

@prayagyadav prayagyadav changed the title feat: EDM4HEPSchema and Newstyle FCCSChema feat: EDM4HEPSchema and Newstyle FCCSchema Jan 21, 2025
@lgray
Copy link
Collaborator

lgray commented Jan 21, 2025

@prayagyadav is there anything more you want to do on this PR?

@prayagyadav
Copy link
Contributor Author

yeah, gotta sort out some features which I missed. Also, I have to add more comments ...

@lgray
Copy link
Collaborator

lgray commented Jan 21, 2025

OK - ping me when you are done! It's otherwise looking good so far.

@prayagyadav
Copy link
Contributor Author

Hi, @lgray. I am sorry for the radio silence.

  • There is an issue with EDM4HEPSchema that we stumbled upon: I have hard-coded the mixins, which is not desirable if I want this schema to recognize any general EDM4HEP-based root file. For example, this schema will work for the specific edm4hep sample file I have used but will fall apart for any other EDM4HEP-based root file (eg. FCC fullsim samples) unless I have the root-file specific mixin dictionary for the said root file.
  • The solution that we came up with: Build the mixin dictionary at run time by inferring the C++ typenames. Infering the C++ typenames is supported in eager mode, but unsupported in delayed mode. After a brief discussion, we found that we need to add that functionality to uproot.dask.
  • We have this PR at uproot5 to add the typename functionality. I have tested the changes and how they translate in coffea. Here is the branch that has the necessary changes
  • So, I believe, we have to wait for uproot to merge the change and also coffea to bump the uproot dependency

@lgray
Copy link
Collaborator

lgray commented Feb 27, 2025

@prayagyadav you should be able to determine the C++ just from the file metadata right? If so if you dig through uproot a bit you can extract it without needing to resort to materializing an array.

Otherwise let's wait for that PR to go through and an uproot release.

Also @nsmith- had some interesting things to say about dealing with streamers when we were talking yesterday. May help this out.

@prayagyadav
Copy link
Contributor Author

@lgray Using this script that I found, I can list out all the relevant metadata for a given EDM4HEP based file.

I found out that some generations of FCC samples have the typename info in the podio_metadata/events___CollectionTypeInfo._1 location in the podio_metadata tree.

Unfortunately, the older generations like the Spring2021 campaign do not have the typename info in metadata.

Another issue is that the collection typenames are incorrect! Maybe it's just the problem with this script, but I am not sure.

In the long term, it would be very beneficial to get access to the Collection IDs which are stored in the metadata. Is there a way to access the podio_metadata tree at runtime, within a schema?

image

@tmadlener
Copy link

I found out that some generations of FCC samples have the typename info in the podio_metadata/events___CollectionTypeInfo._1 location in the podio_metadata tree.

Regarding the format of this metadata: I would like to point out that they are technically not part of the guarantees we give for stability as we do in other places, see also the documentation, or a concrete example of how we want to change this AIDASoft/podio#711

If you want to have the ground truth of the generation you can get to the actual full definition of the datamodel as JSON encoded string from podio_metadata.EDMDefinitions.

Another issue is that the collection typenames are incorrect! Maybe it's just the problem with this script, but I am not sure.

Maybe this is because the script gives you the names of the classes in the user-layer (see documentation), while what actually lands in the files comes from the POD layer. So if you see vector<XYZData> instead of XYZCollection that is most likely the reason for that.

@davidlange6
Copy link

@tmadlener - instead the output of the script seem shuffled? I guess Particle is not really a UserDataCollection<float> under the hood, is it?

@tmadlener
Copy link

Ah right, didn't catch that, sorry. From a quick look, I think this zip here is the issue, because there is no guarantee that the order in the idTable that links collection ids and collection names is the same as the one in CollectionTypeInfo. However, the CollectionTypeInfo has a field with the collection ID, so it can be matched. The tuple is defined here.

So Looking at events___CollectionTypeInfo._0 and matching that with the collection ID retrieved from events___idTable should give you the correct type from events___CollectionTypeInfo._1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants