Proposal for an extensible IAEA phase space format #712

ftessier · 2021-04-28T21:20:57Z

ftessier
Apr 28, 2021
Maintainer

Proposal for an extensible IAEA phase space format

(This is an evolving document)

Issue

We recognize that there is a tension between two objectives in defining a formal
specification for IAEA phase space files (psf):

Minimize the file size;
Extend the format to provide additional information in each record.

One way to reconcile these objectives is to allow optional extra fields in
each record. The original IAEA psf format allows an array of extra floats and extra
integers to be added to each record, as specified in the header.

It is now realized that this approach has led users to define a multitude of
extra fields, which are not generally bound to any format, and often only
relevant in the software which produced the psf. For end users, it becomes
difficult, or at least confusing, to determine if a given psf can be handled by
a given software. In the worst case scenario, the extra fields are not
interpreted correctly but tacitly used in a simulation with a different
meaning.

Rectification: All extra integers and floats are well defined and with fixed format. The list of definitions is kept by the IAEA. We have to make sure that the list is well documented (as before it was not considered needed assuming people use the existing interface).

Requirement

An extensible IAEA psf format that is nevertheless constrained by a formal
specification.

Proposed solution

We propose to generalize the existing extra fields and "register" them with
IAEA, meaning that they are integrated in the the official format
specification. This incurs additional management to incorporate new fields as
developers request them. But an immediate benefit is that IAEA psfs then remain
bound to a definite format, especially in time and across different software.

Ideally, the proposed solution would accommodate legacy IAEA psfs generated
before the implementation of this idea.

General idea

Associate one numeric ID to each registered field. These don't have to follow
any pre-defined structure (they could be assigned sequentially), but it will
prove useful to classify them loosely according to their meaning (as in library
book codes for subjects, remember those?). We want to keep the IDs numerical to
avoid all the idiosyncrasies of string manipulation and parsing.

Details

Here is a sample header section specifying the content of each record in a
current IAEA psf:

$RECORD_CONTENTS:
    1     // X is stored ?
    1     // Y is stored ?
    0     // Z is stored ?
    1     // U is stored ?
    1     // V is stored ?
    1     // W is stored ?
    1     // Weight is stored ?
    1     // Extra floats stored ?
    2     // Extra longs stored ?
    3     // ZLAST variable stored in the extrafloat array [ 0]
    1     // Incremental history number stored in the extralong array [ 0]
    2     // LATCH EGS variable stored in the extralong array [ 1]

We propose to continue using $RECORD_CONTENTS, but turn it into a list of
identifiers that specify the record fields, more generally. The number
in the record contents section are not flags any more, but IDs that refer to
a formal specification (variable type, byte size, units and meaning):

(edit: we have now moved on to use fixed token strings instead of numerical ID numbers).

$RECORD_CONTENTS:
    01100     // x value
    01200     // y value
    01300     // z value
    02100     // u value
    02200     // v value
    02300     // w value
    03000     // energy
    04000     // statistical weight
    05000     // particle

By starting valid IDs say at 100, the IAEA library can recognize legacy files (ID <
100) and then issue a warning and branch off to read the file using the legacy
interpreter. Better yet, the IAEA code would translate the legacy file to the
psf IDs and use the new interpreter.

The reason for spacing out the IDs is to leave room for future definitions for
the same physical quantity. For example, the ID 01101 could refer to the x
value in 64-bit precision (double), and eventually ID 01102 might refer to the
x value in 128-bit precision, when technology gets there, and so on. Another
example is for different units, for example 03000 might refer to energy in
MeV, while 03010 could be reserved for eV units, etc.

We could require that the IDs be sorted in ascending order in the header, to
simplify the reader algorithm. It would also tend to enforce a more consistent
format across psfs from various sources. But reordering can also be handled by
the reader or even the simulation software, given the list of IDs from the
record contents.

Classification

(edit: we have now moved on to use fixed token strings instead of numerical ID numbers).

As stated above, the IDs can be assigned willy-nilly without constraint, but
for mnemonic purposes it shall proves useful to create a loose hierarchy, for
example:

x0000:           broad kind of information (e.g., physical parameters)
- 0x000:         kind of information (the position)
  - 00x00:       specific kind of information (the x position)
    - 000x0:     units (cm)
      - 0000x:   precision (4 bytes)

To set things concretely, here is what a partial expansion of the hierarchy
might look like in practice:

00000s:      Basic particle parameters (r, u, E, weight, ...)
  0000x:     Legacy format
  00x00s:    Pre-defined groupings
  01000s:    Position
    01100s:      x
    01200s:      y
    01300s:      z
    01400s:      r
    01500s:      theta
    01600s:      phi
  02000s:      Direction
    02100s:      u
    02200s:      v
    02300s:      w
  03000s:    Energy
    03000s:      energy
  04000s:    Statistical
    04000s:      weight
  05000s:    Particle
    05100s:      photon
    05200s:      electron
    05300s:      position
    05400s:      proton
    05500s:      neutron
  05000s:    Region information
  06000s:    Medium information
  07000s:    Extended particle information (PDG label, half-life, ...)
20000s:      Time-related (fractional monitor units, time of flight, ...)
30000s:      History information (new history, history index, run, sequence, ...)
40000s:      Phase-space specific (npass, recycling, ...)
50000s:      Software specific (zlast, latch, ...)
(...)
90000s:      User-defined (unsupported, non-conforming to IAEA format)

In case these are not sufficient, eventually, additional or higher number ranges can be
assigned. The idea is not strict conformance, but a rough organization of the
ID numbers. In fact, this hierarchy should never be mentioned in the formal
specification intended for the user, so that the only way to know what an ID means
is to look it up in the specs. There should be nothing implicit in the
interpretation of the IDs.

Here are some possible ID examples that spring to mind:

01100:  x position, in cm, float, 4 bytes
01111:  x position, in nm, float, 8 bytes
03031:  energy, in Joules, float, 8 bytes
20000:  fractional monitor unit, dimensionless, integer, 4 bytes
21150:  time of flight in ms, float, 4 bytes

Pre-defined groups

For convenience, this system allows the definition of predefined groups of
variables, e.g.,

00100:  Basic format, expands to:
            01100   x
            01200   y
            01300   z
            02100   u
            02200   v
            03000   E
            04000   weight
            05000   particle

Constants

We can also leverage this system to define multiple record constants in the
header, for example:

$RECORD_CONSTANTS:
    01300   80.5    // z position
    04000   0.1     // statistical weight

Implementation

The implementation is left as an exercice for the reader! :-) Just kidding.
Inside the IAEA library, it implies reworking the parsing of the record
contents (and constants) sections in the header, ensuring that the legacy
format can still be processed. The new implementation would rely on an
include file which specifies the type and size associated with each ID. There
are various ways to generate a format description of the IDs from the include
file automatically (doxygen comes to mind).

On the software side, the developers of each software toolkit can decide which
IDs they implement (beyond the fundamental ones). They would rely on the same
include file definitions provided with the IAEA library, and their own patch
code to transfer the data from the IAEA psf records to the software's native
data structures.

The IDs themselves are 4-byte integers.

Strategy

The best starting point is certainly not to form a committee to discuss at
length what should be included, nor the finer points of the classification.
Instead, let's start with what we have now, ensuring legacy support. Then let
everyone propose some IDs they can use right now, and draft a first
classification based on that, leaving a decent amount of ID space for future
definitions.

For example, I am not suggesting to implement all possible distance units
(km, m, cm, mm, um, nm). Rather, let's start with, say the common 'cm', but
leave space for other units in the future if there is a good case for it.

Caveats

If a simulation software does not know how to process some ID xxxxx, then it
should alert the user, and then proceed to ignore the corresponding field in
each record. This is possible as long as the IAEA header file containing all
the field IDs is up to date. Otherwise, then the simulation ought to quit. As a
last resort, the IAEA library could also provide a trim tool for users to
remove some fields, to allow portability in a case where they are using a
software that relies on an outdated ID header.
Care should be taken to attempt, as much as possible, to create common
unique IDs for the same physical quantity, recognized by all software. Of
course, it would be possible (and at times tempting!) to define a whole suite
of software specific items, say an EGSnrc version of position data x, y, z (!).
But that makes no sense and it entirely defeats the purpose of an interchange
format. The registered IDs ought to be software agnostic. An example of an
EGSnrc-specific ID would be for latch (with no foreseen equivalent in other
software), but perhaps zlast should be declared more generally as "the
z-coordinate" of the last interaction, and be defined in the "History
information" segment (30000s in the sample above).

ftessier · 2021-04-28T21:24:15Z

ftessier
Apr 28, 2021
Maintainer Author

(Inspired by discussions with developers across different Monte Carlo software, and more specific chat with the EGSnrc developers.)

2 replies

blakewalters Apr 28, 2021
Maintainer

Nice, @ftessier et al! I like your proposal. I'm wondering if we can back out one step further and map (even) more readable format specs onto the respective 4-byte integer IDs that you've proposed here. e.g. Somehow the user/developer would able to specify something like X:units/precision/etc, or, if this applies to all position variables, position:units/precision. Maybe then this gets too much into enforcing strict rules on strings and the idiosyncrasies of parsing, as you've alluded to above. On the other hand, why do we need to shy away from string parsing/matching? Are the required C++ functions not standard across all platforms?

ftessier Apr 29, 2021
Maintainer Author

Thanks @blakewalters! I propose to include units and precision within the ID number itself, in the lower digits (right-most). We can add more digits if we want, we have ~9 digits (decimal) to play with using 31 bits. I like string-based tokens, but in my experience one ends up against some issues with issues related to sorting, nomenclature (wars), unicode chars, buffer size concerns, and generally significant overhead in the parser. I like that it is more expressive though, but at this point I would side with:

4 byte integer as the ID: the parser knows what to expect, in all cases.
A clear (and mostly automated) format reference
Clear string labels and explanations, expanded at will as part of the format spec
A brief string to comment each ID in the IAEA header, when listing the IDs (not strictly required)

In short, offload all the stringy stuff in the format reference doc, with the data itself relying on integers only.

rtownson · 2021-04-28T21:54:19Z

rtownson
Apr 28, 2021
Maintainer

Very well fleshed out, it all makes sense to me. Just a couple typos:

for example 03000 might refer to energy in
MeV, while 03001 could be reserved for eV units, etc.

I think you meant 03010.

21150: time of flight in ms, float

should be: 21150: time of flight in ms, float, 4 bytes

1 reply

ftessier Apr 28, 2021
Maintainer Author

Done, thanks!

rtownson · 2021-04-28T22:10:09Z

rtownson
Apr 28, 2021
Maintainer

I think it would make sense for IAEA to include a few basic phase-space parsing tools intended for users (not developers), such as the trim tool you suggested. Other examples could be verify-psf-checksum, and if the header/psf are combined into one file then show-header. These functions could be bundled into one psf-tools app.

0 replies

mainegra · 2021-04-29T01:07:52Z

mainegra
Apr 29, 2021
Maintainer

@ftessier this is a very well thought out approach! To me it looks like the way to the future! For psf files at least! There is only one concern in this: Will then the IAEA maintain a database of all possible "official" formats? These files could get very large depending on the additional parameters added to a record.

1 reply

ftessier Apr 29, 2021
Maintainer Author

Not quite, if I understand your question correctly. IAEA would maintain a header file with all defined (registered) fields that are possible in a record, each with its unique ID. From our discussion today I think we would hardly ever reach 100 different IDs, at least for the foreseeable future... IAEA would continue to host submitted psf data in whatever format the authors wish (using only IAEA-approved fields), but otherwise free to choose which fields they include. For general usage one would expect to see the usual variables, and only a handful of extra fields, if any.

Some critical review would go into accepting a new field ID, but once added to the header file, the idea would be that the IAEA code and consumer software would be able to read and write it (with the updated header file of course).

This is all preliminary, there are probably issues that will crop up when one tries to implement this, in general. I will try my hand at a mock up c++ implementation, just see what it could look like!

ftessier · 2021-04-29T02:28:58Z

ftessier
Apr 29, 2021
Maintainer Author

Note that a general psf format could also come to serve RAM particle storage during a simulation. For instance, I can see egs_chamber using the same IDs and record structure internally for temporary phase-space storage, or for handing off particles from BEAMnrc to DOSXYZnrc, etc.

1 reply

ftessier Apr 29, 2021
Maintainer Author

Now that I think of that, particle IDs could also be used to define the data structures for particles in a software, e.g., specify the format of an EGS_Particle or the stack, etc. This way particle information in various software would become widely recognizable, or at least referring to a common format specification.

rtownson · 2021-04-29T12:57:25Z

rtownson
Apr 29, 2021
Maintainer

I like that this maintains and improves the flexibility of the current IAEA format. E.g. you can include any number of variables in the psf and have any number constant. The naming of the "extra" variables is now clearer because all variables are defined the same way. Using numbers for IDs makes them easily parsed, but you could include optional comments next to each ID in the header to describe what parameters are included in plain text (this could be automatically done in the IAEA reader/writer).

1 reply

ftessier Apr 29, 2021
Maintainer Author

Yes, I agree. The comments would be strongly recommended (and inserted by the writer), but not strictly required, so that one can optionally expand or adjust said comment.

ftessier · 2021-04-29T19:19:52Z

ftessier
Apr 29, 2021
Maintainer Author

Note: this discussion will be continued with the IAEA and other Monte Carlo developers on the side for a while. Please don't hesitate to continue commenting here with suggestions, requirements, concerns about the future of the IAEA phase space file format. I will relay your ideas to the discussion group, until a first proposal of the format is opened publicly for discussion.

0 replies

ojalaj · 2021-04-29T19:56:26Z

ojalaj
Apr 29, 2021

This sounds very good improvement! Do you have any thoughts on preserving the backward compatibility to existing IAEA phsp files / file format?

2 replies

ftessier Apr 29, 2021
Maintainer Author

Yes, indeed this is already under consideration. Existing files will continue to be supported, internally by translating these on the fly to a format that is digestible for the new reader. Custom additions by users that were never documented will not be supported, of course, but things like zlast in EGSnrc or Penelope-specific extensions would be parsed, as much as possible!

ojalaj Apr 30, 2021

@ftessier Great - that is what I thought. Maybe the community would benefit the most if it is ensured that manufacturer-provided IAEA phsp files (e.g. the ones from Varian generated with Geant4) would work.

bwheelz36 · 2023-06-01T23:53:36Z

bwheelz36
Jun 1, 2023

Hi, @ftessier - I realize this discussion is a little old now, but I just wanted to

Say I think this is a great idea and very overdue.
Draw your attention to the open-PMD initiative being developed by @ax3l. It would make sense to me to align the IAEA standard with existing open standards in accelerator physics.

3 replies

ftessier Jun 2, 2023
Maintainer Author

@bwheelz36 the great thing about discussing this here is that it never gets old 🤓.

I will certainly look into what you refer to. In the mean time a number of key MC developers have worked with the IAEA and written a "2.0" format (which follows a better and simpler encoding compared to what I had suggested). The format is written and awaits implementation for official release. So it is entirely timely that I look up open-PMD.

I am much in favour of sticking with open and standard formats (we work in Metrology after all!)

ftessier Jun 2, 2023
Maintainer Author

I have read the openPMD standard. It is interesting to find similarities with what is discussed in the IAEA 2.0 phase space file standard. It is not immediately clear (which standard ever is! 😄) what an actual particle store openPMD metadata file would look like; can you point to any (simple) example?

ftessier Jun 2, 2023
Maintainer Author

I am particularly interest in the projects that provide particle visualization. That would be extremely useful for Monte Carlo particle phase space file data!

Proposal for an extensible IAEA phase space format #712

ftessier Apr 28, 2021 Maintainer

Proposal for an extensible IAEA phase space format

Issue

Requirement

Proposed solution

General idea

Details

Classification

Pre-defined groups

Constants

Implementation

Strategy

Caveats

Replies: 9 comments · 11 replies

ftessier Apr 28, 2021 Maintainer Author

blakewalters Apr 28, 2021 Maintainer

ftessier Apr 29, 2021 Maintainer Author

rtownson Apr 28, 2021 Maintainer

ftessier Apr 28, 2021 Maintainer Author

rtownson Apr 28, 2021 Maintainer

mainegra Apr 29, 2021 Maintainer

ftessier Apr 29, 2021 Maintainer Author

ftessier Apr 29, 2021 Maintainer Author

ftessier Apr 29, 2021 Maintainer Author

rtownson Apr 29, 2021 Maintainer

ftessier Apr 29, 2021 Maintainer Author

ftessier Apr 29, 2021 Maintainer Author

ojalaj Apr 29, 2021

ftessier Apr 29, 2021 Maintainer Author

ojalaj Apr 30, 2021

bwheelz36 Jun 1, 2023

ftessier Jun 2, 2023 Maintainer Author

ftessier Jun 2, 2023 Maintainer Author

ftessier Jun 2, 2023 Maintainer Author

ftessier
Apr 28, 2021
Maintainer

Replies: 9 comments 11 replies

ftessier
Apr 28, 2021
Maintainer Author

blakewalters Apr 28, 2021
Maintainer

ftessier Apr 29, 2021
Maintainer Author

rtownson
Apr 28, 2021
Maintainer

ftessier Apr 28, 2021
Maintainer Author

rtownson
Apr 28, 2021
Maintainer

mainegra
Apr 29, 2021
Maintainer

ftessier Apr 29, 2021
Maintainer Author

ftessier
Apr 29, 2021
Maintainer Author

ftessier Apr 29, 2021
Maintainer Author

rtownson
Apr 29, 2021
Maintainer

ftessier Apr 29, 2021
Maintainer Author

ftessier
Apr 29, 2021
Maintainer Author

ojalaj
Apr 29, 2021

ftessier Apr 29, 2021
Maintainer Author

bwheelz36
Jun 1, 2023

ftessier Jun 2, 2023
Maintainer Author

ftessier Jun 2, 2023
Maintainer Author

ftessier Jun 2, 2023
Maintainer Author