Skip to content

Managed Service Monitoring #101

Open
@mike-gangl

Description

@mike-gangl

Managed Service Monitoring

"As an operator, i want to monitor the health of various Unity services"
Screenshot 2024-04-09 at 8 34 28 PM
NOTE: S3 bucket is defined here.

An example of the health dashboard from AWS:

Screenshot 2024-03-29 at 1 03 27 PM

per venue, you'd have something like:

Service Current
Jupyterhub 🟢
Airflow 🟢
Data Catalog 🟢
Application Catalog 🟢

Market place options:

  • Juptyer (integrate)
  • SPS (integrate)
  • DS (buckets)
  • SDAP (integrate)

Each Service needs a health endpoint

  • includes Data Catalog and Application Catalog, app-pack-gen health metrics

UI team to design endpoint to show health dashboard https://apigateway-for-unity/project/venue/health

Question to answer during planning:

  • Where is the health dashboard hosted? on mgmt console? on uiux dashboard?
  • How do we add the health check endpoint into the marketplace description/output
  • how will mgmt console query the health endpoint
  • What format will the health endpoint return to be consumed by the dashboard
  • permissions issues with getting the health endpoints

Health check SSM params should be defined in the project venue as:
/unity/healthCheck/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

Acceptance Criteria

  • JuptyerHub is integrated into the marketplace
    • includes health endpoint
  • Airflow/SPS is integrated into the marketplace
    • includes health endpoint
  • U-DS data bucket is integrated into the marketplace
    • includes health endpoint
  • Shared Service health endpoints
    • takes the form: /unity/healthCheck/shared-services/
    • Data Catalog
    • Algorithm Catalog
    • App-pack-gen (?)
    • Process Mapper health check
  • A service can query health endpoints and generate or store json health response (e.g. every 5 minutes)
    • don't overengineer this, could simply be a lambda that queries health endpoints and creates a json document that is stored in a bucket every 5 minutes. we can optimize later.
  • Unity-py client to request health status from venue endpoint and return results
  • UIUX developed dashboard for displaying health of existing services
    • should respond to dynamic content
    • should read the above generated json file (from a bucket, webservice, etc)

Work Tickets

Link to work tickets required to implement the epic

Dependencies

Other epics or outside tickets required for this to work

Associated Risks

links to risk issues associated with this epic

  • TBC

Out of scope but future work:

  • Historical health (last 7 days)
  • Alerting a user based on health
  • Aggregating health - e.g. another dashboards that can monitor health across all venues (e.g. MMO)
  • Degradation vs healthy/not healthy
  • Measuring 'uptime' of a service for SLA metrics

previous


This overlaps with the idea of Common metrics /logs aggregation service #92 .

How do we plan to monitor the deployed managed services. I think to evolve into a full multi-tenant system we need to make sure we are monitoring:

Health of a service
Uptime of a service
Degredation (health?) - if it's responding to requests, how fast does it respond?

Think of a single console that can monitor all of the managed services across multiple accounts. What does this look like? how are logs/metrics/events propagated to the "central" dashboard? Or does the dashboard reach into different accounts to view things?


Metadata

Metadata

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions