Skip to content

Feature/sdk reporting #316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 181 commits into
base: main
Choose a base branch
from
Open

Feature/sdk reporting #316

wants to merge 181 commits into from

Conversation

EspressoTrip-v2
Copy link

@EspressoTrip-v2 EspressoTrip-v2 commented Aug 4, 2025

Connection SDK reporting.
Stores sdk data for user connections on the instance. Open edition storage is disabled by default, so no impact on attached databases. Event emitter engine is extendable for other use cases as it is attached to the service context.

Additions:

  1. Event emitter engine
  2. Posgres report storage factory
  3. Mongo report storage factory

Basic flow:

  • User connects, user information, jwt and meta are stored
  • User connection refreshes or reconnects, if the connection event is within the same day the stored initial connection is updated. Else new connection is stored
  • User disconnects connection is updated as disconnected, using the previously connected time for long running connections

export type SubscribeEvents = {
[EventsEngineEventType.SDK_CONNECT_EVENT]: ClientConnectionEventData;
[EventsEngineEventType.SDK_DISCONNECT_EVENT]: ClientDisconnectionEventData;
[EventsEngineEventType.SDK_DELETE_OLD]: DeleteOldConnectionData;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What emits this event?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this instance its the report module.

client_id?: string;
user_id: string;
user_agent?: string;
jwt_exp?: Date;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make this required - added comments elsewhere.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a side effect of the Context object in most part

Comment on lines +219 to +221
user_id = ${{ type: 'varchar', value: user_id }}
AND client_id = ${{ type: 'varchar', value: client_id }}
AND connected_at = ${{ type: 1184, value: connectIsoString }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the probability is low, there could be multiple connections with the same (user_id, client_id, connected_at) combination. I'd recommend using a unique id per connection instead. Two options:

  1. Use the existing rid (request id) we generate for the logger - that could perhaps form the primary key here. This would be nice in that we could correlate these connections with the logs, but it would require some refactoring to make that id available in the request context.
  2. Return the id from the reportClientConnection event, then use that id when disconnecting.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally used that, but when we moved to updating an existing document. It was no longer viable option.

Copy link
Author

@EspressoTrip-v2 EspressoTrip-v2 Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the probability is low, there could be multiple connections with the same (user_id, client_id, connected_at) combination. I'd recommend using a unique id per connection instead. Two options:

  1. Use the existing rid (request id) we generate for the logger - that could perhaps form the primary key here. This would be nice in that we could correlate these connections with the logs, but it would require some refactoring to make that id available in the request context.
  2. Return the id from the reportClientConnection event, then use that id when disconnecting.

Could you give me an example when this (user_id, client_id, connected_at) might occur?
My understanding is the client_id is basically linked to the client db and would only change if there was a reinstall of the application... So would that mean that there is an instance where an application could be installed by the same user on a different device connected to the same instance and have the same client_id generated but with a different sdk?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases the client_id should be unique, but we don't have any guarantees on that. For example, the user can copy the database to another device (maybe not on Android/iOS, but on desktop it's easy), in which case the same client_id would be present on both. Older SDKs do not send client_id at all. And in some cases, the user_id is not unique either.

When combining with the connect_at timestamp it should be very rare that you get duplicates, but it is possible.

That said, since we're only "pre-aggregating" these connections per (user_id, client_id, day) anyway, I guess that doesn't really make a difference, and we can keep this as-is.

Comment on lines +68 to +70
client_id,
user_id,
connected_at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as for postgres - we should use connection ids here, rather than assume this combination is unique.

});
}

describe('SDK reporting storage', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests should also specifically include a case where client_id is undefined.

});
}

describe('Connection report storage', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as for MongoDB, tests should also specifically include a case where client_id is undefined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, is it feasible for these tests to primarily use the public storage APIs, rather than raw SQL? That way the tests could be shared between the Postgres and MongoDB implementations, potentially only having small number of db-specific tests if you do still need the lower-level access in some places.

Comment on lines +175 to +177
$gte: new Date(year, month, today),
$lt: new Date(year, month, nextDay)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends on the timezone of the server - not sure if that is what we want?

But I'm also wondering - do we specifically want this behavior ("if the connection event is within the same day the stored initial connection is updated")? Why not store these as separate connection events, and then handle any other logic in the aggregations?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dylan was concerned about storing the connections as logs in the ps service db, due to volume. Considering we are using per day, per week and per month, we only need day granularity. When a user refreshes the web sdk it actually causes a disconnect and a reconnect which would make multiple copies withing very a short period if they refresh a few times. So to reduce the redundant connections by the same user I did it this way. Which still gives us the minimal per day granularity.

The date objects get converted to UTC time by Mongo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we can keep it as daily connections for now - it looks like we can change that later if needed without too much effort.

The date objects get converted to UTC time by Mongo.

If we do keep daily values, we need to be very explicit about time zones. As is, if I run the server in UTC+2, the filter here would use 2025-08-21T22:00:00.000Z. If the reporting job happens to use a different timezone, it would not match up with the way it is persisted here.

Now in our hosted environment at least, it is likely that the service runs in the UTC timezone already, and this won't make a difference. But if we're relying on that, it is better to be explicit - use UTC timezones here (new Date(Date.UTC(year, month, today))), and document that UTC-based timestamps should be used when querying the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants