Description
Managed Service Monitoring
"As an operator, i want to monitor the health of various Unity services"
NOTE: S3 bucket is defined here.
An example of the health dashboard from AWS:
per venue, you'd have something like:
Service | Current |
---|---|
Jupyterhub | 🟢 |
Airflow | 🟢 |
Data Catalog | 🟢 |
Application Catalog | 🟢 |
Market place options:
- Juptyer (integrate)
- SPS (integrate)
- DS (buckets)
- SDAP (integrate)
Each Service needs a health endpoint
- includes Data Catalog and Application Catalog, app-pack-gen health metrics
UI team to design endpoint to show health dashboard https://apigateway-for-unity/project/venue/health
Question to answer during planning:
- Where is the health dashboard hosted? on mgmt console? on uiux dashboard?
- How do we add the health check endpoint into the marketplace description/output
- how will mgmt console query the health endpoint
- What format will the health endpoint return to be consumed by the dashboard
- permissions issues with getting the health endpoints
Health check SSM params should be defined in the project venue as:
/unity/healthCheck/<MARKETPLACE_ITEM>/<COMPONENT_NAME>
Acceptance Criteria
- JuptyerHub is integrated into the marketplace
- includes health endpoint
- Airflow/SPS is integrated into the marketplace
- includes health endpoint
- U-DS data bucket is integrated into the marketplace
- includes health endpoint
- Shared Service health endpoints
- takes the form: /unity/healthCheck/shared-services/
- Data Catalog
- Algorithm Catalog
- App-pack-gen (?)
- Process Mapper health check
- A service can query health endpoints and generate or store json health response (e.g. every 5 minutes)
- don't overengineer this, could simply be a lambda that queries health endpoints and creates a json document that is stored in a bucket every 5 minutes. we can optimize later.
- Unity-py client to request health status from venue endpoint and return results
- UIUX developed dashboard for displaying health of existing services
- should respond to dynamic content
- should read the above generated json file (from a bucket, webservice, etc)
Work Tickets
Link to work tickets required to implement the epic
- TBC (to be created)
- Implement lambda in Venue account to periodically gather health status unity-cs#367
- Create SSM parameter for monitoring S3 bucket name unity-cs#370
- Venue Creation scripts would copy shared services health check SSM params into venue unity-cs#374
- Add health check endpoint for U-DS catalog unity-data-services#351
- Create Backend API to Serve Health Statuses of Deployed services in a proj/venue unity-cs#381
Dependencies
Other epics or outside tickets required for this to work
- JupyterHub integration into Unity Marketplace #90
- Rework SPS Marketplace installation to use EKS/httpd unity-cs#351
Associated Risks
links to risk issues associated with this epic
- TBC
Out of scope but future work:
- Historical health (last 7 days)
- Alerting a user based on health
- Aggregating health - e.g. another dashboards that can monitor health across all venues (e.g. MMO)
- Degradation vs healthy/not healthy
- Measuring 'uptime' of a service for SLA metrics
previous
This overlaps with the idea of Common metrics /logs aggregation service #92 .
How do we plan to monitor the deployed managed services. I think to evolve into a full multi-tenant system we need to make sure we are monitoring:
Health of a service
Uptime of a service
Degredation (health?) - if it's responding to requests, how fast does it respond?
Think of a single console that can monitor all of the managed services across multiple accounts. What does this look like? how are logs/metrics/events propagated to the "central" dashboard? Or does the dashboard reach into different accounts to view things?
Metadata
Metadata
Assignees
Type
Projects
Status