Feat - implement queue #2081

Jamesbarford · 2025-04-25T08:32:58Z

This is a bit prototype-y at the moment however I thought it wise to get some early feedback as the main ideas are here.

I do have some the some questions that I though of while implementing the queue namely;

Do we want to add a retry column? As it is theoretically possible for a machine to keep picking up a job that it failed to do
How to handle a "release" build which I've hacked together and endpoint for
How to allocate a machine uuid
How to let a machine know what architecture it is on (could do a shell uname -a or something similar)?

I don't think the SQLite implementation would work at all as there is an assumption that there is a centralised database or database cluster that the whole system speaks to. I could be wrong but with the SQLite implementation it would create a version of the database local to the machine (presently 2) thus would not work. With that in mind I've not implemented anything to handle concurrent access to SQLite. The implementation, presently, is only useful for local development and perhaps tests?

Kobzol

I left some mostly arbitrary review remarks, but to have a proper review, we should split this PR into more atomic changes. The overall structure looks fine, I think that corresponds to what we have discussed before.

I would start with:

Implementing the basic DB representation and the website cron job.
Adding some visualization to the status page, so that we can observe the status of the queue and confirm that it is being correctly filled.
Work on the collector side of things, start with a basic implementation, and then add retrying and handling of errors.

To answer your questions more explicitly:

Do we want to add a retry column? As it is theoretically possible for a machine to keep picking up a job that it failed to do

The first thing is that we have to be able to detect a situation where some job is being recomputed. That shouldn't be that hard; if a collector starts and sees something in the DB marked as "in progress" with its machine ID, it should assume that it has failed before. In that case, I think that a retry could be useful. Usually, the errors are not transient, but it could happen sometimes. What we should support eventually is detecting that an invalid SHA has been pushed to the queue (that can happen e.g. when we super rarely have to force push master or bors somehow messes up), and either somehow mark the job as invalid (which we could see on the status webpage) or just remove it from the queue completely. Removing it could work, because the invalid SHA should eventually disappear from the master commit list, in which case it won't be re-added to the queue later by the site.

How to handle a "release" build which I've hacked together and endpoint for

As noted in the review comments, the site should take care of pushing new releases to the queue. The collector shouldn't ideally concern itself with special casing releases (it might download them in a different way, but it should still resolve these jobs from the queue, same as for master and try runs).

How to allocate a machine uuid

I wouldn't complicate this and just require the user to pass it explicitly to the BenchQueue command. We will then have some light automation (e.g. Ansible) to sync the collector commands amongst the few machines that we will have.

How to let a machine know what architecture it is on (could do a shell uname -a or something similar)?

Again, I would pass this as an input argument. We could have a sanity check where the collector runs rustc -vV on the downloaded compiler and checks that its host: value corresponds to the target. Or we can just deduce the target automatically from rustc and avoid passing it, but I would appreciate having the double check in the collector, just in case.

Regarding SQLite, I agree that it doesn't make sense to use it for the queue. I would just ignore it for now. As long as you don't explicitly use BenchQueue, it should be fine, although the website should either not run the cron job when executed locally, or it should not fail horribly when the logic isn't implemented in SQLite. We could either make the queue optin with some environment variable or CLI flag of the site, or add something like fn supports_queue(&self) -> bool to the DB pool implementations.

Kobzol · 2025-04-28T09:10:57Z

database/src/lib.rs

@@ -152,6 +152,16 @@ impl FromStr for CommitType {
    }
 }

+impl ToString for CommitType {


This is not super important, but usually it's idiomatic to implement Display, which is more general, and then get an implementation of ToString "for free" (there is a blanket impl of ToString for types that implement Display).

Kobzol · 2025-04-28T09:11:56Z

database/src/lib.rs

+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct CommitJob {
+    pub sha: String,
+    pub parent_sha: Option<String>,


If we keep the invariant that the parent SHA is always present when we insert things into the queue, then this shouldn't be nullable.

Kobzol · 2025-04-30T08:50:06Z