Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The DataStream data downloaded in the Export has wrong JSON format and missing data #178

Open
bardram opened this issue Nov 15, 2024 · 6 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@bardram
Copy link
Collaborator

bardram commented Nov 15, 2024

When exporting data, in the data-streams.json file I wold expect to get correctly formatted JSON according to the CARP Core domain model. But - alas - there are issues.

Here is an example of some JSON exported:

data-streams.json
  {
    "id": 4961,
    "data_stream_id": 631,
    "snapshot": {
      "syncPoint": {
        "synchronizedOn": "1970-01-01T00:00:00Z",
        "relativeClockSpeed": 1.0,
        "sensorTimestampAtSyncPoint": 0
      },
      "triggerIds": [
        4
      ],
      "measurements": [
        {
          "data": {
            "__type": "dk.cachet.carp.freememory",
            "freeVirtualMemory": 2999164928,
            "freePhysicalMemory": 164835328
          },
          "sensorStartTime": 1729864247872632
        },
        {
          "data": {
            "__type": "dk.cachet.carp.freememory",
            "freeVirtualMemory": 2952347648,
            "freePhysicalMemory": 117329920
          },
          "sensorStartTime": 1729864307857247
        },
        {
          "data": {
            "__type": "dk.cachet.carp.freememory",
            "freeVirtualMemory": 2957910016,
            "freePhysicalMemory": 124141568
          },
          "sensorStartTime": 1729864367870355
        },
        {
          "data": {
            "__type": "dk.cachet.carp.freememory",
            "freeVirtualMemory": 2950742016,
            "freePhysicalMemory": 113905664
          },
          "sensorStartTime": 1729864427889803
        }
      ]
    },
    "first_sequence_id": 29,
    "last_sequence_id": 32,
    "created_by": "20e2093c-e510-4e79-9385-478a19dc4723",
    "created_at": "2024-10-25T13:54:43.080504Z",
    "updated_by": "20e2093c-e510-4e79-9385-478a19dc4723",
    "updated_at": "2024-10-25T13:54:43.081204Z"
  },
  {
    "id": 4605,
    "data_stream_id": 655,
    "snapshot": {
      "syncPoint": {
        "synchronizedOn": "1970-01-01T00:00:00Z",
        "relativeClockSpeed": 1.0,
        "sensorTimestampAtSyncPoint": 0
      },
      "triggerIds": [
        9
      ],
      "measurements": [
        {
          "data": {
            "__type": "dk.cachet.carp.triggeredtask",
            "control": "Start",
            "taskName": "Task #17",
            "triggerId": 9,
            "destinationDeviceRoleName": "Polar HR Sensor"
          },
          "sensorStartTime": 1729776088592930
        }
      ]
    },
    "first_sequence_id": 1,
    "last_sequence_id": 1,
    "created_by": "20e2093c-e510-4e79-9385-478a19dc4723",
    "created_at": "2024-10-24T13:31:24.977530Z",
    "updated_by": "20e2093c-e510-4e79-9385-478a19dc4723",
    "updated_at": "2024-10-24T13:31:24.979868Z"
  },
...

Compared to the CARP Core Domain data model there are several issues with this JSON:

  • The JSON uses a mixture of camelCase and snake_case serialization. Default JSON in CARP Core is always camelCase. For example, first_sequence_id should be firstSequenceId (see DataStreamSequence).
  • The data_stream_id (which should be dataStreamId) is a JSON object (and not a number). See DataStreamId for details.
  • The JSON contains data that is not part of the CARP Core domain model. For example id, last_sequence_id, created_by, etc.
  • The snapshot is not part of the DataStreamSequence domain model. However, the SyncPoint is but is embedded in this weird "snapshot" thing.
@bardram
Copy link
Collaborator Author

bardram commented Nov 15, 2024

@yuanchen233 - I think the main problem here is that data is exported from the database directly rather than from the DataStreamService. We have talked about this before - the approach of bypassing the domain logic in CAPR Core, and I really do not like this. It goes against everything that DDD recommends.

To quote Working_with_models:

In domain-driven design, an object's creation is often separated from the object itself.

A repository, for instance, is an object with methods for retrieving domain objects from a data store (e.g. a database). Similarly, a factory is an object with methods for directly creating domain objects.

I am sure @Whathecode will agree ;-)

@Whathecode
Copy link
Member

Yup. The endpoints on that service are severely lacking, however, but this should be a prompt to add/work on those.

@yuanchen233
Copy link
Collaborator

  • The JSON uses a mixture of camelCase and snake_case serialization. Default JSON in CARP Core is always camelCase. For example, first_sequence_id should be firstSequenceId (see DataStreamSequence).

This is caused by JsonNaming is set to SnakeCaseStrategy, and all column generated from Auditable is using the snake naming schema. It is unsure to me if those extra field is still needed.

  • The data_stream_id (which should be dataStreamId) is a JSON object (and not a number). See DataStreamId for details.
  • The JSON contains data that is not part of the CARP Core domain model. For example id, last_sequence_id, created_by, etc.
  • The snapshot is not part of the DataStreamSequence domain model. However, the SyncPoint is but is embedded in this weird "snapshot" thing.

This is cuased by webservices' wired design of dataStream database table/entity, we have this 3 table and the notion of snapshot inherited from old Java project, and also plan to change it at #137

I think the main problem here is that data is exported from the database directly rather than from the DataStreamService. We have talked about this before - the approach of bypassing the domain logic in CAPR Core, and I really do not like this. It goes against everything that DDD recommends.

Agreed. Currently, the export endpoint ( the export button on the portal ) performs a database dump. It does not utilize the API of each ApplicationService to retrieve data, and this should be changed. This change was proposed but postponed due to the need to maintain a consistent format for the exported data of active studies.

Yup. The endpoints on that service are severely lacking, however, but this should be a prompt to add/work on those.

Most of the issues are caused by the current design of database, given that they works, retriving data through core endpoints like DataStreamServicesRequest.GetDataStream should work properly without exposing those extra information, despite the fact that the database do need some re-design.


In short, I think there are a few things we could/should do, some of them might be off topic here:

  • Use core dataStream endpoint to retrieve data, not export
  • Export feature in webservice should also use core endpoint to retrieve data, instead of a database dump
  • Re-design dataStream database, stop using this added snapshot
  • Revisit Auditable and decided if it's still needed for the current implementation of webservices
  • At somepoint, a re-design of all database is also desired. Currently all data is saved as snapshots which is created using core Json serializer. This is not very efficient and does not utilize any advantage a database provide, and the process time is way too long when the number of instances go up. (e.g. ReruitmentSnapshot keeps a list of all participantGroup, when we have 20000 participantGroups and a new participantGroup is created, it takes huge amount of time to build the Recruitment object, updata it and convert it back to Json and save it)

If we decide to re-design the database later, some word of suggestion from @Whathecode would be very helpful and much appreciated.

@yuanchen233
Copy link
Collaborator

Current webservices' dataStream database implementation:
DataStream_UML_Final

@bardram
Copy link
Collaborator Author

bardram commented Jan 18, 2025

After restarting the server and me collecting some data, I finally managed to look into the data format downloaded from CAWS.

It definitely looks much correct'ish now, but there is still something wrong.

Here is an example of the data I get from CAWS:

[
...
   {
    "sequenceId": 33568,
    "studyDeploymentId": "d173a6a7-878e-498a-98e1-9dc80877a3b1",
    "deviceRoleName": "Primary Phone",
    "measurement": {
      "sensorStartTime": 1737064846099240,
      "data": {
        "__type": "dk.cachet.carp.heartbeat",
        "period": 5,
        "deviceType": "dk.cachet.carp.common.application.devices.WeatherService",
        "deviceRoleName": "Weather Service"
      }
    },
    "triggerIds": [
      0
    ],
    "syncPoint": {
      "synchronizedOn": "1970-01-01T00:00:00Z",
      "sensorTimestampAtSyncPoint": 0,
      "relativeClockSpeed": 1.0
    },
    "dataStream": {
      "studyDeploymentId": "d173a6a7-878e-498a-98e1-9dc80877a3b1",
      "deviceRoleName": "Primary Phone",
      "dataType": {
        "namespace": "dk.cachet.carp",
        "name": "heartbeat"
      }
    }
  },
  {
    "sequenceId": 33569,
    "studyDeploymentId": "d173a6a7-878e-498a-98e1-9dc80877a3b1",
    "deviceRoleName": "Primary Phone",
    "measurement": {
      "sensorStartTime": 1737064846099253,
      "data": {
        "__type": "dk.cachet.carp.heartbeat",
        "period": 5,
        "deviceType": "dk.cachet.carp.common.application.devices.AirQualityService",
        "deviceRoleName": "Air Quality Service"
      }
    },
    "triggerIds": [
      0
    ],
    "syncPoint": {
      "synchronizedOn": "1970-01-01T00:00:00Z",
      "sensorTimestampAtSyncPoint": 0,
      "relativeClockSpeed": 1.0
    },
    "dataStream": {
      "studyDeploymentId": "d173a6a7-878e-498a-98e1-9dc80877a3b1",
      "deviceRoleName": "Primary Phone",
      "dataType": {
        "namespace": "dk.cachet.carp",
        "name": "heartbeat"
      }
    }
  },
...
]

And here is an example (from CARP Core) of what we should get from the getDataStream endpoint, which is a DataStreamBatch, which again is a list of DataStreamSequence objects:

[
    {
        "dataStream": {
            "studyDeploymentId": "c9cc5317-48da-45f2-958e-58bc07f34681",
            "deviceRoleName": "Participant's phone",
            "dataType": "dk.cachet.carp.geolocation"
        },
        "firstSequenceId": 0,
        "measurements": [
            {
                "sensorStartTime": 1642505045000000,
                "data": {
                    "__type": "dk.cachet.carp.geolocation",
                    "latitude": 55.68061908805645,
                    "longitude": 12.582050313435703,
                    "sensorSpecificData": {
                        "__type": "dk.cachet.carp.signalstrength",
                        "rssi": 0
                    }
                }
            },
            {
                "sensorStartTime": 1642505144000000,
                "data": {
                    "__type": "dk.cachet.carp.geolocation",
                    "latitude": 55.680802203873114,
                    "longitude": 12.581802212861367
                }
            }
        ],
        "triggerIds": [
            0
        ]
    }
]

@bardram
Copy link
Collaborator Author

bardram commented Jan 18, 2025

There are several issues with the data I get from CAWS as compared to the CARP Core Domain model:

  • CAWS returns a list of individual measurements - one measurement in each block, instead of a list of measurements
  • There is a sequenceId for each measurement instead of a firstSequenceId
  • The dataType in the dataStream id from CAWS is wrongly formatted - should be "dataType": "dk.cachet.carp.geolocation" - i.e., one line instead of a JSON object separated into namespace and name.

In general, it still seems like you are not using the CARP Core Domain Model and rather extract data directly from the database @yuanchen233 ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants