-
Notifications
You must be signed in to change notification settings - Fork 90
Add enhanced GitHub Actions raw logs viewer #2076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really wonderful idea ❤️ And I can't wait to use it! I can take care of the RLA integration once/if this is merged.
Ok(res) | ||
} | ||
|
||
async fn process_logs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add some cache for this in Context
? Maybe a LRU hashmap or something like that? The logs can be large and we probably don't want to spam the GH API when the page is refreshed or a couple of people go visit a link when a job fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I though of it, but given some of the size of the logs (some of them go 3-4 Mb each) I'm worried we'll have a huge cache miss due to the limited number of logs we'll be able to cache.
I also don't know how much we can safely cache, I don't how much RAM the triagebot machine has left. Or maybe we'll want to have the cache on disk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't go for a disk cache. I think that even storing a Vec of ~10-100 last results would be enough. Logs are usually accessed only for a brief period of time after CI fails, when a bunch of people may get notified and they go check the logs. After a few days, people usually don't care about the log anymore. So remembering the last N logs to have a history of at least a few hours would help to avoid needless API requests, IMO.
Not sure about RAM size, but storing e.g. 20 results should hopefully be fine. @Mark-Simulacrum should know the machine specs, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, added a small cache of 50Mb for the raw logs in RAM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently allocate 1/4 of a CPU and 0.5 GB of memory: https://github.com/rust-lang/simpleinfra/blob/bcba51d544106fba945c904bf213d64a7bb7a473/terraform/shared/services/triagebot/main.tf#L47-L48
Looks like most of the memory is currently unused:

I think a small cache isn't bad, but I'd probably suggest that if we want this in the long run it should be backed by S3 (with a, say, 15 day TTL) -- that's pretty cheap and avoids worrying about how much space it uses, and S3 rate limiting is basically impossible. If we did that we could also serve the logs from a CDN rather than through triagebot, which seems also nice.
We could also just host the ansiup file ourselves through triagebot, it's relativey small (~14 KiB). |
32638f3
to
3242c4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impl looks good, but I'd like to hear from Mark or someone else from t-infra about potential security concerns.
Ok(res) | ||
} | ||
|
||
async fn process_logs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently allocate 1/4 of a CPU and 0.5 GB of memory: https://github.com/rust-lang/simpleinfra/blob/bcba51d544106fba945c904bf213d64a7bb7a473/terraform/shared/services/triagebot/main.tf#L47-L48
Looks like most of the memory is currently unused:

I think a small cache isn't bad, but I'd probably suggest that if we want this in the long run it should be backed by S3 (with a, say, 15 day TTL) -- that's pretty cheap and avoids worrying about how much space it uses, and S3 rate limiting is basically impossible. If we did that we could also serve the logs from a CDN rather than through triagebot, which seems also nice.
tracing::info!("gha_logs: cache miss for {log_uuid}"); | ||
let logs = ctx | ||
.github | ||
.raw_job_logs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I'm wondering if we actually have to make the request from triagebot. Could the page we serve use JS to request directly from GitHub, using the users own GitHub credentials? The plain logs appear to be served with CORS: *
-- though not sure if the redirect to them is.
I seem to recall we had even built (or found?) such a page at some point...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we get the user GitHub credentials? By using a GitHub App or OAuth app for triagebot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing GH login in triagebot is not trivial and it has been rejected before for the back-office review work, so I don't suppose that's what Mark meant (?). I guess what we could do is just fetch the plain logs URL from the user's browser.
However, it doesn't seem to work? I tried this:
<html>
<head>
</head>
<body>
<div id="body"></div>
<script>
async function load() {
const data = await fetch("https://github.com/rust-lang/rust/commit/d04c7466850b10b56c28211ccb880ae34915d431/checks/44186934959/logs");
document.getElementById("body").innerHTML = data;
}
load();
</script>
</body>
</html>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get a *
CORS URL by visiting the original logs URL, but it is specific for each user and temporary (it lasts like 5 minutes, IIRC). And you can only get it after you visit the original URL in your browser, not via CORS. So I think that we'll have to download it through the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Just tried with the API, and it leads to a 403 about being unauthenticated:
{
"message": "Must have admin rights to Repository.",
"documentation_url": "https://docs.github.com/rest/actions/workflow-jobs#download-job-logs-for-a-workflow-run",
"status": "403"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, yeah, sounds like the final link is available - in theory, that means we could still leave the download for users, right? I.e., on triagebot side we cache for ~1 minute ttl the resolve of the API URL (stable) to the short-lived plaintext logs link and then have the user's browser download that and render it as needed. That way we don't have triagebot in the critical path for the big object and caches are much smaller per-element, etc.
That said I personally suspect that we might want to keep the rendering plausibly on the server, I'm not sure how well browsers will cope with the megabytes of HTML + JavaScript rendering all the ansi codes... it might be better to strip them on the server side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's not change anything here but keep it in mind if we run into issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's an API endpoint for figuring out the temporary URL though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said I personally suspect that we might want to keep the rendering plausibly on the server, I'm not sure how well browsers will cope with the megabytes of HTML + JavaScript rendering all the ansi codes... it might be better to strip them on the server side.
I have looked at our longest CI jobs (like 5/6Mb of raw jobs) and the Firefox renders it surprisingly fine (less than 1s). The longest part is downloading the raw jobs logs, not rendering it.
I don't think there's an API endpoint for figuring out the temporary URL though.
It's the same API, but instead of following the redirects we need to look at the Location:
header of the GitHub API call response instead of following it to it's destination.
/** | ||
* Bundled by jsDelivr using Rollup v2.79.2 and Terser v5.39.0. | ||
* Original file: /npm/[email protected]/ansi_up.js | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the license on this? Can we get it copy/pasted into this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, forgot about the license. It's MIT license. Included it.
This PR adds an enhanced (as in colored) GitHub Actions raw logs viewer.
This is useful for us as Github Actions logs viewer is just unusable in rust-lang/rust given the size of our logs (it loads indefinitely) and the raw logs are full of ANSI sequence cluttering the output.
This works by adding a new endpoint
/gha-logs/:owner/:repo/:log-id
(restricted to team repos) that fetch from GitHub REST api the raw logs, embeds it directly in the HTML which are then processed by theansi_up
javascript library (loaded from jsdelivr.com CDN).To prevent XSS injection (or just bugs in
ansi_up
) I added theContent-Security-Policy
header with restriction inscript-src
. I also tried adding integrity foransi_up
but jsdelivr doesn't guaranteed the same content for minified libraries.After merging, we should probably adjust rust-log-analyzer to link to this new endpoint instead of to GitHub, so people can use directly.
cc @Mark-Simulacrum