The DataStream data downloaded in the Export has wrong JSON format and missing data #178

bardram · 2024-11-15T15:23:23Z

When exporting data, in the data-streams.json file I wold expect to get correctly formatted JSON according to the CARP Core domain model. But - alas - there are issues.

Here is an example of some JSON exported:

data-streams.json

  {
    "id": 4961,
    "data_stream_id": 631,
    "snapshot": {
      "syncPoint": {
        "synchronizedOn": "1970-01-01T00:00:00Z",
        "relativeClockSpeed": 1.0,
        "sensorTimestampAtSyncPoint": 0
      },
      "triggerIds": [
        4
      ],
      "measurements": [
        {
          "data": {
            "__type": "dk.cachet.carp.freememory",
            "freeVirtualMemory": 2999164928,
            "freePhysicalMemory": 164835328
          },
          "sensorStartTime": 1729864247872632
        },
        {
          "data": {
            "__type": "dk.cachet.carp.freememory",
            "freeVirtualMemory": 2952347648,
            "freePhysicalMemory": 117329920
          },
          "sensorStartTime": 1729864307857247
        },
        {
          "data": {
            "__type": "dk.cachet.carp.freememory",
            "freeVirtualMemory": 2957910016,
            "freePhysicalMemory": 124141568
          },
          "sensorStartTime": 1729864367870355
        },
        {
          "data": {
            "__type": "dk.cachet.carp.freememory",
            "freeVirtualMemory": 2950742016,
            "freePhysicalMemory": 113905664
          },
          "sensorStartTime": 1729864427889803
        }
      ]
    },
    "first_sequence_id": 29,
    "last_sequence_id": 32,
    "created_by": "20e2093c-e510-4e79-9385-478a19dc4723",
    "created_at": "2024-10-25T13:54:43.080504Z",
    "updated_by": "20e2093c-e510-4e79-9385-478a19dc4723",
    "updated_at": "2024-10-25T13:54:43.081204Z"
  },
  {
    "id": 4605,
    "data_stream_id": 655,
    "snapshot": {
      "syncPoint": {
        "synchronizedOn": "1970-01-01T00:00:00Z",
        "relativeClockSpeed": 1.0,
        "sensorTimestampAtSyncPoint": 0
      },
      "triggerIds": [
        9
      ],
      "measurements": [
        {
          "data": {
            "__type": "dk.cachet.carp.triggeredtask",
            "control": "Start",
            "taskName": "Task #17",
            "triggerId": 9,
            "destinationDeviceRoleName": "Polar HR Sensor"
          },
          "sensorStartTime": 1729776088592930
        }
      ]
    },
    "first_sequence_id": 1,
    "last_sequence_id": 1,
    "created_by": "20e2093c-e510-4e79-9385-478a19dc4723",
    "created_at": "2024-10-24T13:31:24.977530Z",
    "updated_by": "20e2093c-e510-4e79-9385-478a19dc4723",
    "updated_at": "2024-10-24T13:31:24.979868Z"
  },
...

Compared to the CARP Core Domain data model there are several issues with this JSON:

The JSON uses a mixture of camelCase and snake_case serialization. Default JSON in CARP Core is always camelCase. For example, first_sequence_id should be firstSequenceId (see DataStreamSequence).
The data_stream_id (which should be dataStreamId) is a JSON object (and not a number). See DataStreamId for details.
The JSON contains data that is not part of the CARP Core domain model. For example id, last_sequence_id, created_by, etc.
The snapshot is not part of the DataStreamSequence domain model. However, the SyncPoint is but is embedded in this weird "snapshot" thing.

The text was updated successfully, but these errors were encountered:

bardram · 2024-11-15T15:39:57Z

@yuanchen233 - I think the main problem here is that data is exported from the database directly rather than from the DataStreamService. We have talked about this before - the approach of bypassing the domain logic in CAPR Core, and I really do not like this. It goes against everything that DDD recommends.

To quote Working_with_models:

In domain-driven design, an object's creation is often separated from the object itself.

A repository, for instance, is an object with methods for retrieving domain objects from a data store (e.g. a database). Similarly, a factory is an object with methods for directly creating domain objects.

I am sure @Whathecode will agree ;-)

Whathecode · 2024-11-15T16:06:42Z

Yup. The endpoints on that service are severely lacking, however, but this should be a prompt to add/work on those.

yuanchen233 · 2024-11-15T18:42:43Z

The JSON uses a mixture of camelCase and snake_case serialization. Default JSON in CARP Core is always camelCase. For example, first_sequence_id should be firstSequenceId (see DataStreamSequence).

This is caused by JsonNaming is set to SnakeCaseStrategy, and all column generated from Auditable is using the snake naming schema. It is unsure to me if those extra field is still needed.

The data_stream_id (which should be dataStreamId) is a JSON object (and not a number). See DataStreamId for details.

The JSON contains data that is not part of the CARP Core domain model. For example id, last_sequence_id, created_by, etc.

The snapshot is not part of the DataStreamSequence domain model. However, the SyncPoint is but is embedded in this weird "snapshot" thing.

This is cuased by webservices' wired design of dataStream database table/entity, we have this 3 table and the notion of snapshot inherited from old Java project, and also plan to change it at #137

I think the main problem here is that data is exported from the database directly rather than from the DataStreamService. We have talked about this before - the approach of bypassing the domain logic in CAPR Core, and I really do not like this. It goes against everything that DDD recommends.

Agreed. Currently, the export endpoint ( the export button on the portal ) performs a database dump. It does not utilize the API of each ApplicationService to retrieve data, and this should be changed. This change was proposed but postponed due to the need to maintain a consistent format for the exported data of active studies.

Yup. The endpoints on that service are severely lacking, however, but this should be a prompt to add/work on those.

Most of the issues are caused by the current design of database, given that they works, retriving data through core endpoints like DataStreamServicesRequest.GetDataStream should work properly without exposing those extra information, despite the fact that the database do need some re-design.

In short, I think there are a few things we could/should do, some of them might be off topic here:

Use core dataStream endpoint to retrieve data, not export
Export feature in webservice should also use core endpoint to retrieve data, instead of a database dump
Re-design dataStream database, stop using this added snapshot
Revisit Auditable and decided if it's still needed for the current implementation of webservices
At somepoint, a re-design of all database is also desired. Currently all data is saved as snapshots which is created using core Json serializer. This is not very efficient and does not utilize any advantage a database provide, and the process time is way too long when the number of instances go up. (e.g. ReruitmentSnapshot keeps a list of all participantGroup, when we have 20000 participantGroups and a new participantGroup is created, it takes huge amount of time to build the Recruitment object, updata it and convert it back to Json and save it)

If we decide to re-design the database later, some word of suggestion from @Whathecode would be very helpful and much appreciated.

yuanchen233 · 2024-11-19T08:57:51Z

Current webservices' dataStream database implementation:

bardram · 2025-01-18T17:30:39Z

After restarting the server and me collecting some data, I finally managed to look into the data format downloaded from CAWS.

It definitely looks much correct'ish now, but there is still something wrong.

Here is an example of the data I get from CAWS:

[
...
   {
    "sequenceId": 33568,
    "studyDeploymentId": "d173a6a7-878e-498a-98e1-9dc80877a3b1",
    "deviceRoleName": "Primary Phone",
    "measurement": {
      "sensorStartTime": 1737064846099240,
      "data": {
        "__type": "dk.cachet.carp.heartbeat",
        "period": 5,
        "deviceType": "dk.cachet.carp.common.application.devices.WeatherService",
        "deviceRoleName": "Weather Service"
      }
    },
    "triggerIds": [
      0
    ],
    "syncPoint": {
      "synchronizedOn": "1970-01-01T00:00:00Z",
      "sensorTimestampAtSyncPoint": 0,
      "relativeClockSpeed": 1.0
    },
    "dataStream": {
      "studyDeploymentId": "d173a6a7-878e-498a-98e1-9dc80877a3b1",
      "deviceRoleName": "Primary Phone",
      "dataType": {
        "namespace": "dk.cachet.carp",
        "name": "heartbeat"
      }
    }
  },
  {
    "sequenceId": 33569,
    "studyDeploymentId": "d173a6a7-878e-498a-98e1-9dc80877a3b1",
    "deviceRoleName": "Primary Phone",
    "measurement": {
      "sensorStartTime": 1737064846099253,
      "data": {
        "__type": "dk.cachet.carp.heartbeat",
        "period": 5,
        "deviceType": "dk.cachet.carp.common.application.devices.AirQualityService",
        "deviceRoleName": "Air Quality Service"
      }
    },
    "triggerIds": [
      0
    ],
    "syncPoint": {
      "synchronizedOn": "1970-01-01T00:00:00Z",
      "sensorTimestampAtSyncPoint": 0,
      "relativeClockSpeed": 1.0
    },
    "dataStream": {
      "studyDeploymentId": "d173a6a7-878e-498a-98e1-9dc80877a3b1",
      "deviceRoleName": "Primary Phone",
      "dataType": {
        "namespace": "dk.cachet.carp",
        "name": "heartbeat"
      }
    }
  },
...
]

And here is an example (from CARP Core) of what we should get from the getDataStream endpoint, which is a DataStreamBatch, which again is a list of DataStreamSequence objects:

[
    {
        "dataStream": {
            "studyDeploymentId": "c9cc5317-48da-45f2-958e-58bc07f34681",
            "deviceRoleName": "Participant's phone",
            "dataType": "dk.cachet.carp.geolocation"
        },
        "firstSequenceId": 0,
        "measurements": [
            {
                "sensorStartTime": 1642505045000000,
                "data": {
                    "__type": "dk.cachet.carp.geolocation",
                    "latitude": 55.68061908805645,
                    "longitude": 12.582050313435703,
                    "sensorSpecificData": {
                        "__type": "dk.cachet.carp.signalstrength",
                        "rssi": 0
                    }
                }
            },
            {
                "sensorStartTime": 1642505144000000,
                "data": {
                    "__type": "dk.cachet.carp.geolocation",
                    "latitude": 55.680802203873114,
                    "longitude": 12.581802212861367
                }
            }
        ],
        "triggerIds": [
            0
        ]
    }
]

bardram · 2025-01-18T17:52:05Z

There are several issues with the data I get from CAWS as compared to the CARP Core Domain model:

CAWS returns a list of individual measurements - one measurement in each block, instead of a list of measurements
There is a sequenceId for each measurement instead of a firstSequenceId
The dataType in the dataStream id from CAWS is wrongly formatted - should be "dataType": "dk.cachet.carp.geolocation" - i.e., one line instead of a JSON object separated into namespace and name.

In general, it still seems like you are not using the CARP Core Domain Model and rather extract data directly from the database @yuanchen233 ....

bardram assigned yuanchen233 Nov 15, 2024

bardram mentioned this issue Nov 15, 2024

It is impossible to see what device collected a measurement in the data stream cph-cachet/carp.sensing-flutter#438

Open

bardram added bug Something isn't working enhancement New feature or request labels Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The DataStream data downloaded in the Export has wrong JSON format and missing data #178

The DataStream data downloaded in the Export has wrong JSON format and missing data #178

bardram commented Nov 15, 2024 •

edited

Loading

bardram commented Nov 15, 2024 •

edited

Loading

Whathecode commented Nov 15, 2024

yuanchen233 commented Nov 15, 2024

yuanchen233 commented Nov 19, 2024

bardram commented Jan 18, 2025 •

edited

Loading

bardram commented Jan 18, 2025

The DataStream data downloaded in the Export has wrong JSON format and missing data #178

The DataStream data downloaded in the Export has wrong JSON format and missing data #178

Comments

bardram commented Nov 15, 2024 • edited Loading

bardram commented Nov 15, 2024 • edited Loading

Whathecode commented Nov 15, 2024

yuanchen233 commented Nov 15, 2024

yuanchen233 commented Nov 19, 2024

bardram commented Jan 18, 2025 • edited Loading

bardram commented Jan 18, 2025

bardram commented Nov 15, 2024 •

edited

Loading

bardram commented Nov 15, 2024 •

edited

Loading

bardram commented Jan 18, 2025 •

edited

Loading