Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LifeCycleEntryType checksum is ambiguous and under-specified #91

Open
AxelMatstoms opened this issue Feb 12, 2025 · 0 comments
Open

LifeCycleEntryType checksum is ambiguous and under-specified #91

AxelMatstoms opened this issue Feb 12, 2025 · 0 comments

Comments

@AxelMatstoms
Copy link

I'm currently working on my master's thesis at Saab, with Robert Hällqvist as my supervisor.
As part of my thesis work I'm implementing support for SSP Traceability in a Python library.

The specification says the following about the checksum attribute of the LifeCycleEntryType:

This attribute gives the checksum over the phase/step information stored in the enclosing phase/step element, calculated according to the #STMD' specification.

As far as I can tell there is no further mention on how to actually calculate the checksum in the specification. I'm assuming the usecase is to verify that the information in the Step/Phase as it is being read is the same as when the current lifecycle status was assigned.

To me it doesn't seem obvious how to this in a way that is unambiguously specified and relatively easy to implement.

Calculating checksum over the XML as is

Let's say we want to assign the following step the Approved life cycle stage.

<stmd:AnalyzeSimulationTaskAndObjectives>
    <stc:Input>...</stc:Input>
    <stc:Procedure>...</stc:Procedure>
    <stc:Output>...</stc:Output>
    <stc:Rationale>...</stc:Rationale>
    <stc:LifeCycleInformation>
         <stc:Validated>...</stc:Validated>
    </stc:LifeCycleInformation>
</stmd:AnalyzeSimulationTaskAndObjectives>

We create a sha3-256 hash of the step, excluding the LifeCycleEntry that is about to be inserted, since otherwise we would be hashing data containing the hash itself. The new LifeCycleEntry is then inserted with its hash yielding something like the following:

<stmd:AnalyzeSimulationTaskAndObjectives>
    <stc:Input>...</stc:Input>
    <stc:Procedure>...</stc:Procedure>
    <stc:Output>...</stc:Output>
    <stc:Rationale>...</stc:Rationale>
    <stc:LifeCycleInformation>
         <stc:Validated>...</stc:Validated>
         <stc:Approved checksum="9dc7da9c722704815eaa1580065ad84e8a1ccb2eefb6cd0887e8fba68d077332" ...>...</stc:Approved>
    </stc:LifeCycleInformation>
</stmd:AnalyzeSimulationTaskAndObjectives>

When the reader wants to verify that the checksum matches the current version of the step, it is not sufficient to remove the LifeCycleEntry from the parsed XML and create a string representation of the step element, since the XML parser is likely to throw away some of the information required to create a byte-perfect string. For example parsing and stringifying <a ></a> will give <a></a> which will have a different hash.
So let's instead assume we use something like Regex to remove the LifeCycleEntry, and create a hash based on that. Since the reader can't assume anything about the writers XML formatting, we only remove element, and not any of the surrounding whitespace giving the following:

<stmd:AnalyzeSimulationTaskAndObjectives>
    <stc:Input>...</stc:Input>
    <stc:Procedure>...</stc:Procedure>
    <stc:Output>...</stc:Output>
    <stc:Rationale>...</stc:Rationale>
    <stc:LifeCycleInformation>
         <stc:Validated>...</stc:Validated>
         
    </stc:LifeCycleInformation>
</stmd:AnalyzeSimulationTaskAndObjectives>

The sha3-256 checksum of this is 510c2632b52c73ec297433cb37c9d39d818042b872a2ceb51b44c9118fa4d46f, different from the checksum in the XML.

We can conclude that specifying that the checksum should be over the surrounding step/phase, as is, without the LifeCycleEntry inserted is likely to lead to errors, and would be difficult to implement correctly (since we can't simply rely on an XML parser to remove the element).

Calculating checksum of canonicalized XML

The alternative the previous approach is to specify that the enclosing step/phase should be canonicalized first (for example according to C14N 2.0 specification, https://www.w3.org/TR/xml-c14n2/). So the data that would actually by hashed would instead be something like:

<stmd:AnalyzeSimulationTaskAndObjectives>
    <stc:Input xmlns:stc="http://ssp-standard.org/SSPTraceability1/SSPTraceabilityCommon">...</stc:Input>
    <stc:Procedure xmlns:stc="http://ssp-standard.org/SSPTraceability1/SSPTraceabilityCommon">...</stc:Procedure>
    <stc:Output xmlns:stc="http://ssp-standard.org/SSPTraceability1/SSPTraceabilityCommon">...</stc:Output>
    <stc:Rationale xmlns:stc="http://ssp-standard.org/SSPTraceability1/SSPTraceabilityCommon">...</stc:Rationale>
    <stc:LifeCycleInformation xmlns:stc="http://ssp-standard.org/SSPTraceability1/SSPTraceabilityCommon">
         <stc:Validated>...</stc:Validated>
    </stc:LifeCycleInformation>
</stmd:AnalyzeSimulationTaskAndObjectives>

Note that the C14N algorithm does not remove whitespace around tags, since the whitespace has semantic meaning in XML (but not in SSP Traceability?), so the reader now has to do some best guess of the writers intention regarding formatting. A more suitable specification would therefore be that the hash should be calculated by:

  1. (only when verifying the checksum) Removing the current LifeCycleEntry element.
  2. Removing all whitespace surrounding tags. (How should whitespace around text be handled?)
  3. Canonicalize the whitespace trimmed XML.
  4. Calculate the sha3-256 sum of the canonicalized XML.

This specification still has a few issues, such as requiring implementers to find an XML library that correctly implements the C14N 2.0 algorithm, or alternatively implement it themselves.
Additionally, there are still documents with different checksums under this algorithm that most SSP Traceability parsers should consider equivalent. For example changing order of elements, where the order doesn't affect semantics. One example would be:

<stmd:AnalyzeSimulationTaskAndObjectives>
    <stc:Input>...</stc:Input>
    <stc:Procedure>...</stc:Procedure>
    <stc:Output>...</stc:Output>
    <stc:Rationale>...</stc:Rationale>
    <stc:LifeCycleInformation>
         <stc:Validated>...</stc:Validated>
    </stc:LifeCycleInformation>
    <stc:Classification type="com.example.internal">...</stc:Classification>
    <ssc:Annotations>
        <ssc:Annotation type="com.example.internal2">...</ssc:Annotation>
    </ssc:Annotations>
</stmd:AnalyzeSimulationTaskAndObjectives>

compared to

<stmd:AnalyzeSimulationTaskAndObjectives>
    <stc:Input>...</stc:Input>
    <stc:Procedure>...</stc:Procedure>
    <stc:Output>...</stc:Output>
    <stc:Rationale>...</stc:Rationale>
    <stc:LifeCycleInformation>
         <stc:Validated>...</stc:Validated>
    </stc:LifeCycleInformation>
    <ssc:Annotations>
        <ssc:Annotation type="com.example.internal2">...</ssc:Annotation>
    </ssc:Annotations>
    <stc:Classification type="com.example.internal">...</stc:Classification>
</stmd:AnalyzeSimulationTaskAndObjectives>

(Annotations and Classifications have swapped order)
Of course, only the former follows the standard, as defined by the XSD file, but I would assume many parsers would treat them as equivalent (at least the one I'm working on).

Unfortunately I have no good answer on what the solution, but hopefully my thoughts can contribute to the discussion on how the standard should be changed in order to fully specify how checksums should be calculated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant