Skip to content

[FEATURE] Improve batch data validation verbosity #872

Open
@mgovers

Description

@mgovers

Related to #871 .

Background

The PGM supports a couple data validation options (see https://power-grid-model.readthedocs.io/en/stable/api_reference/python-api-reference.html#validation )

  • accumulating
    • validate_input_data
    • validate_batch_data
  • throwing
    • assert_valid_input_data
    • assert_valid_batch_data

The throwing versions first run the accumulating equivalent and then throw if the accumulated result is not empty. Below, we will therefore focus on the validate_* versions.

The issue with the existing approach

When running batch calculations, the following situations might be encountered, especially in performant environments like a production environment:

  • input data might contain omitted data (NaN values) for required but updateable attributes that are in fact provided in the update data. These absent attributes will cause validate_input_data to report errors. Instead, validate_batch_data is required.
  • Conversely, the update data might only update a small subset of all updateable attributes. Because it affects batch data, validate_batch_data is required.
  • A combination of the above two situations is also possible.

In all these cases, a lot of values could be checked on just the input data alone, either because they are not provided or not updateable in the first place. As it is, any such errors will be reported for all scenarios, rather than just once, which results in an excessively large list of issues.

Example

  • Valid values for tap_nom of a transformer are: (tap_min <= tap_nom <= tap_max) or (tap_min >= tap_nom >= tap_max) (taken from https://power-grid-model.readthedocs.io/en/stable/user_manual/components.html#transformer). tap_nom, tap_min and tap_max are not updateable, so a validation on the input data should be enough to capture this
  • However, if the input data contains no status_from (which are updateable but required attributes). Instead, the update data contains valid status_from and status_to which will resolve the issues. Then, validate_input_data will report errors on status_from, but it will be resolved when calling validate_batch_data
  • but the tap_nom is not updateable and therefore may still be invalid for every scenario when calling validate_batch_data.
  • validate_batch_data will now correctly report the error on tap_nom, but it will do so for every scenario in the batch. That is a lot of unnecessary duplication that could be caught in the validate_input_data already.

Proposed solutions

New functionality

2 new types of functionality should be added:

  • extend validation functionality on input data with partial checks.
    • TBD: Either of the following options should be selected (or both):
      • check only non-updateable attributes
      • check only provided attributes
  • extend validation functionality on batch data with partial checks.
    • TBD: Either of the following options should be selected
      • check only attributes on the input data that are not provided in any of the update data scenarios
      • check homogeneous attributes (that are the same for all update data scenarios)

Implementation

TBD:

  • Add new functions for the above functionality
    • Pro: not breaking
    • Con: more functions
  • Add new keyword arguments to existing validation functions (both validate_* and assert_valid_*). The default behavior should be the existing behavior (report all errors for all scenarios)
    • Pro: no new functions
    • Con: how to output?

Considered and rejected alternatives

  • The following changes to validate_batch_data were considered but would be breaking:
    • Adding early returns
      • This removes data from the output
    • Changing the output from a dict of scenario index + scenario errors to a dict of "all" errors for the errors that are the same across all scenarios, and then also the scenario-specific errors in the same way as before (scenario index + scenario errors)
      • This changes the output

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature or request

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions