Skip to content

Add enhanced GitHub Actions raw logs viewer #2076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Urgau
Copy link
Member

@Urgau Urgau commented Jun 13, 2025

This PR adds an enhanced (as in colored) GitHub Actions raw logs viewer.

This is useful for us as Github Actions logs viewer is just unusable in rust-lang/rust given the size of our logs (it loads indefinitely) and the raw logs are full of ANSI sequence cluttering the output.

This works by adding a new endpoint /gha-logs/:owner/:repo/:log-id (restricted to team repos) that fetch from GitHub REST api the raw logs, embeds it directly in the HTML which are then processed by the ansi_up javascript library (loaded from jsdelivr.com CDN).

image

To prevent XSS injection (or just bugs in ansi_up) I added the Content-Security-Policy header with restriction in script-src. I also tried adding integrity for ansi_up but jsdelivr doesn't guaranteed the same content for minified libraries.

After merging, we should probably adjust rust-log-analyzer to link to this new endpoint instead of to GitHub, so people can use directly.

cc @Mark-Simulacrum

@Urgau Urgau requested a review from Kobzol June 13, 2025 17:50
Copy link
Contributor

@Kobzol Kobzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really wonderful idea ❤️ And I can't wait to use it! I can take care of the RLA integration once/if this is merged.

Ok(res)
}

async fn process_logs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some cache for this in Context? Maybe a LRU hashmap or something like that? The logs can be large and we probably don't want to spam the GH API when the page is refreshed or a couple of people go visit a link when a job fails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though of it, but given some of the size of the logs (some of them go 3-4 Mb each) I'm worried we'll have a huge cache miss due to the limited number of logs we'll be able to cache.

I also don't know how much we can safely cache, I don't how much RAM the triagebot machine has left. Or maybe we'll want to have the cache on disk?

Copy link
Contributor

@Kobzol Kobzol Jun 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't go for a disk cache. I think that even storing a Vec of ~10-100 last results would be enough. Logs are usually accessed only for a brief period of time after CI fails, when a bunch of people may get notified and they go check the logs. After a few days, people usually don't care about the log anymore. So remembering the last N logs to have a history of at least a few hours would help to avoid needless API requests, IMO.

Not sure about RAM size, but storing e.g. 20 results should hopefully be fine. @Mark-Simulacrum should know the machine specs, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, added a small cache of 50Mb for the raw logs in RAM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently allocate 1/4 of a CPU and 0.5 GB of memory: https://github.com/rust-lang/simpleinfra/blob/bcba51d544106fba945c904bf213d64a7bb7a473/terraform/shared/services/triagebot/main.tf#L47-L48

Looks like most of the memory is currently unused:

image

I think a small cache isn't bad, but I'd probably suggest that if we want this in the long run it should be backed by S3 (with a, say, 15 day TTL) -- that's pretty cheap and avoids worrying about how much space it uses, and S3 rate limiting is basically impossible. If we did that we could also serve the logs from a CDN rather than through triagebot, which seems also nice.

@Kobzol
Copy link
Contributor

Kobzol commented Jun 13, 2025

We could also just host the ansiup file ourselves through triagebot, it's relativey small (~14 KiB).

@Urgau Urgau force-pushed the gha_logs_ansi_up branch 4 times, most recently from 32638f3 to 3242c4f Compare June 14, 2025 15:11
@Urgau Urgau force-pushed the gha_logs_ansi_up branch from 3242c4f to a85fafb Compare June 14, 2025 18:22
Copy link
Contributor

@Kobzol Kobzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impl looks good, but I'd like to hear from Mark or someone else from t-infra about potential security concerns.

@Urgau Urgau force-pushed the gha_logs_ansi_up branch from a85fafb to 68996b0 Compare June 14, 2025 22:16
Ok(res)
}

async fn process_logs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently allocate 1/4 of a CPU and 0.5 GB of memory: https://github.com/rust-lang/simpleinfra/blob/bcba51d544106fba945c904bf213d64a7bb7a473/terraform/shared/services/triagebot/main.tf#L47-L48

Looks like most of the memory is currently unused:

image

I think a small cache isn't bad, but I'd probably suggest that if we want this in the long run it should be backed by S3 (with a, say, 15 day TTL) -- that's pretty cheap and avoids worrying about how much space it uses, and S3 rate limiting is basically impossible. If we did that we could also serve the logs from a CDN rather than through triagebot, which seems also nice.

tracing::info!("gha_logs: cache miss for {log_uuid}");
let logs = ctx
.github
.raw_job_logs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'm wondering if we actually have to make the request from triagebot. Could the page we serve use JS to request directly from GitHub, using the users own GitHub credentials? The plain logs appear to be served with CORS: * -- though not sure if the redirect to them is.

I seem to recall we had even built (or found?) such a page at some point...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we get the user GitHub credentials? By using a GitHub App or OAuth app for triagebot?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing GH login in triagebot is not trivial and it has been rejected before for the back-office review work, so I don't suppose that's what Mark meant (?). I guess what we could do is just fetch the plain logs URL from the user's browser.

However, it doesn't seem to work? I tried this:

<html>
    <head>
    </head>
    <body>
        <div id="body"></div>
        <script>
            async function load() {
                const data = await fetch("https://github.com/rust-lang/rust/commit/d04c7466850b10b56c28211ccb880ae34915d431/checks/44186934959/logs");
                document.getElementById("body").innerHTML = data;
            }
            load();
        </script>
    </body>
</html>

and got this:
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get a * CORS URL by visiting the original logs URL, but it is specific for each user and temporary (it lasts like 5 minutes, IIRC). And you can only get it after you visit the original URL in your browser, not via CORS. So I think that we'll have to download it through the API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Just tried with the API, and it leads to a 403 about being unauthenticated:

{
  "message": "Must have admin rights to Repository.",
  "documentation_url": "https://docs.github.com/rest/actions/workflow-jobs#download-job-logs-for-a-workflow-run",
  "status": "403"
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yeah, sounds like the final link is available - in theory, that means we could still leave the download for users, right? I.e., on triagebot side we cache for ~1 minute ttl the resolve of the API URL (stable) to the short-lived plaintext logs link and then have the user's browser download that and render it as needed. That way we don't have triagebot in the critical path for the big object and caches are much smaller per-element, etc.

That said I personally suspect that we might want to keep the rendering plausibly on the server, I'm not sure how well browsers will cope with the megabytes of HTML + JavaScript rendering all the ansi codes... it might be better to strip them on the server side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's not change anything here but keep it in mind if we run into issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's an API endpoint for figuring out the temporary URL though.

Copy link
Member Author

@Urgau Urgau Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said I personally suspect that we might want to keep the rendering plausibly on the server, I'm not sure how well browsers will cope with the megabytes of HTML + JavaScript rendering all the ansi codes... it might be better to strip them on the server side.

I have looked at our longest CI jobs (like 5/6Mb of raw jobs) and the Firefox renders it surprisingly fine (less than 1s). The longest part is downloading the raw jobs logs, not rendering it.

I don't think there's an API endpoint for figuring out the temporary URL though.

It's the same API, but instead of following the redirects we need to look at the Location: header of the GitHub API call response instead of following it to it's destination.

@Urgau Urgau force-pushed the gha_logs_ansi_up branch from 68996b0 to 93efa64 Compare June 16, 2025 18:40
/**
* Bundled by jsDelivr using Rollup v2.79.2 and Terser v5.39.0.
* Original file: /npm/[email protected]/ansi_up.js
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the license on this? Can we get it copy/pasted into this comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, forgot about the license. It's MIT license. Included it.

@Urgau Urgau force-pushed the gha_logs_ansi_up branch from 93efa64 to 0201040 Compare June 17, 2025 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants