Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically capture resource metrics for every sandbox #216

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

0div
Copy link
Contributor

@0div 0div commented Dec 6, 2024

Description

Periodically call a envd's /metrics endpoint from orchestrator to get basic resource usage metrics. Calls it on start, on 5s interval and on end of Sandbox lifecycle

  • add logging methods for CPU percentage and Memory MB
  • bump envd version and check that version is gte this version to call this new endpoint
  • replace time.After in check loop with time.Ticker to avoid allocating a new time on every iteration

Test

Let's test this manually for now

Description by Callstackai

This PR introduces a new feature to periodically capture resource metrics for every sandbox by calling the /metrics endpoint of envd. It includes new types for metrics, updates to the API, and logging enhancements.

Diagrams of code changes
sequenceDiagram
    participant Client
    participant API
    participant Sandbox
    participant Loki
    participant EnvD

    Client->>API: GET /sandboxes/{sandboxID}/metrics
    
    API->>Loki: Query metrics for sandboxID
    Loki-->>API: Return metrics data
    
    alt Metrics Found
        API->>API: Parse metrics data
        API->>API: Transform to SandboxMetric objects
        API-->>Client: Return SandboxMetrics response
    else No Metrics
        API-->>Client: Return 404 error
    end

    loop Every 5 seconds
        Sandbox->>EnvD: GET /metrics
        EnvD-->>Sandbox: Return CPU & Memory metrics
        Sandbox->>Loki: Log metrics data
    end
Loading
Files Changed
FileSummary
packages/api/internal/api/api.gen.goAdded a new API endpoint for fetching sandbox metrics.
packages/api/internal/api/spec.gen.goUpdated Swagger specification to include the new metrics endpoint.
packages/api/internal/api/types.gen.goDefined new types for sandbox metrics.
packages/api/internal/handlers/sanbox_metrics.goImplemented the handler for fetching sandbox metrics from Loki.
packages/docker-reverse-proxy/main.tfSpecified the platform for the Docker image.
packages/envd/internal/host/metrics.goUpdated metrics structure to include memory in MiB.
packages/envd/main.goBumped the version of envd.
packages/orchestrator/internal/sandbox/checks.goReplaced time.After with time.Ticker for periodic metric logging.
packages/orchestrator/internal/sandbox/metrics.goDefined the structure for sandbox metrics.
packages/orchestrator/internal/sandbox/sandbox.goUpdated sandbox initialization to check envd version.
packages/orchestrator/internal/server/sandboxes.goAdded logging of metrics before stopping the sandbox.
packages/shared/pkg/logs/logger.goUpdated logger to handle new metrics logging.
spec/openapi.ymlUpdated OpenAPI specification to include new metrics schema.

This PR includes files in programming languages that we currently do not support. We have not reviewed files with the extensions .go, .tf, .yml. See list of supported languages.

Copy link

linear bot commented Dec 6, 2024

Copy link
Contributor

@callstackai callstackai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key Issues

Reducing the HTTP client timeout to 1 second may lead to frequent timeouts in network-dependent functions, potentially affecting reliability under network latency or high load conditions.

packages/orchestrator/internal/sandbox/sandbox.go Outdated Show resolved Hide resolved
packages/envd/internal/host/metrics.go Outdated Show resolved Hide resolved
packages/shared/pkg/logs/logger.go Outdated Show resolved Hide resolved
packages/orchestrator/internal/sandbox/checks.go Outdated Show resolved Hide resolved
packages/orchestrator/internal/sandbox/checks.go Outdated Show resolved Hide resolved
@0div 0div requested a review from jakubno December 6, 2024 20:19
Copy link
Member

@jakubno jakubno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also remove the old method for getting metrics (stats)?

packages/envd/internal/host/metrics.go Outdated Show resolved Hide resolved
packages/envd/internal/host/metrics.go Outdated Show resolved Hide resolved
@0div
Copy link
Contributor Author

0div commented Dec 6, 2024

Can we also remove the old method for getting metrics (stats)?

Ok, i was thinking maybe we could compare side by side but if you're confident that it won't be useful, ill remove it.

@jakubno
Copy link
Member

jakubno commented Dec 9, 2024

Can we also remove the old method for getting metrics (stats)?

Ok, i was thinking maybe we could compare side by side but if you're confident that it won't be useful, ill remove it.

It should be basically the same and the new ones are shinny and better

packages/shared/pkg/logs/logger.go Outdated Show resolved Hide resolved
packages/shared/pkg/logs/logger.go Outdated Show resolved Hide resolved
packages/shared/pkg/logs/logger.go Outdated Show resolved Hide resolved
Copy link
Contributor

@callstackai callstackai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key Issues

The internal variable memoryMB still uses the old name and potentially wrong unit, which can lead to incorrect calculations when compared with l.memoryMiBMax due to the difference between MiB and MB units.

packages/shared/pkg/logs/logger.go Show resolved Hide resolved
@@ -152,6 +152,41 @@ components:
items:
$ref: "#/components/schemas/SandboxLog"

SandboxMetric:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed in RPC, add total CPU 🙏🏻

@jakubno jakubno changed the base branch from main to dev December 20, 2024 13:22
@jakubno
Copy link
Member

jakubno commented Dec 20, 2024

I changed the base branch to dev we want to merge to this branch, could you please solve merge conflicts 🙏

Base automatically changed from dev to main January 1, 2025 17:09
@ValentaTomas ValentaTomas added the feature New feature label Jan 7, 2025
Copy link
Member

@jakubno jakubno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing and it's ready to go

spec/openapi.yml Show resolved Hide resolved
@@ -563,6 +598,29 @@ paths:
"500":
$ref: "#/components/responses/500"

/sandboxes/{sandboxID}/metrics:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want return only the latest? What is the expected use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only the latest for now

packages/orchestrator/internal/server/sandboxes.go Outdated Show resolved Hide resolved
0div added 2 commits January 17, 2025 17:15
…ally-capture-resource-metrics-for-every-sandbox-e2b-1024
* remove dup code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants