Skip to content

Feat - implement queue #2081

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Jamesbarford
Copy link
Contributor

This is a bit prototype-y at the moment however I thought it wise to get some early feedback as the main ideas are here.

I do have some the some questions that I though of while implementing the queue namely;

  • Do we want to add a retry column? As it is theoretically possible for a machine to keep picking up a job that it failed to do
  • How to handle a "release" build which I've hacked together and endpoint for
  • How to allocate a machine uuid
  • How to let a machine know what architecture it is on (could do a shell uname -a or something similar)?

I don't think the SQLite implementation would work at all as there is an assumption that there is a centralised database or database cluster that the whole system speaks to. I could be wrong but with the SQLite implementation it would create a version of the database local to the machine (presently 2) thus would not work. With that in mind I've not implemented anything to handle concurrent access to SQLite. The implementation, presently, is only useful for local development and perhaps tests?

Copy link
Contributor

@Kobzol Kobzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some mostly arbitrary review remarks, but to have a proper review, we should split this PR into more atomic changes. The overall structure looks fine, I think that corresponds to what we have discussed before.

I would start with:

  1. Implementing the basic DB representation and the website cron job.
  2. Adding some visualization to the status page, so that we can observe the status of the queue and confirm that it is being correctly filled.
  3. Work on the collector side of things, start with a basic implementation, and then add retrying and handling of errors.

To answer your questions more explicitly:

  • Do we want to add a retry column? As it is theoretically possible for a machine to keep picking up a job that it failed to do

The first thing is that we have to be able to detect a situation where some job is being recomputed. That shouldn't be that hard; if a collector starts and sees something in the DB marked as "in progress" with its machine ID, it should assume that it has failed before. In that case, I think that a retry could be useful. Usually, the errors are not transient, but it could happen sometimes. What we should support eventually is detecting that an invalid SHA has been pushed to the queue (that can happen e.g. when we super rarely have to force push master or bors somehow messes up), and either somehow mark the job as invalid (which we could see on the status webpage) or just remove it from the queue completely. Removing it could work, because the invalid SHA should eventually disappear from the master commit list, in which case it won't be re-added to the queue later by the site.

  • How to handle a "release" build which I've hacked together and endpoint for

As noted in the review comments, the site should take care of pushing new releases to the queue. The collector shouldn't ideally concern itself with special casing releases (it might download them in a different way, but it should still resolve these jobs from the queue, same as for master and try runs).

  • How to allocate a machine uuid

I wouldn't complicate this and just require the user to pass it explicitly to the BenchQueue command. We will then have some light automation (e.g. Ansible) to sync the collector commands amongst the few machines that we will have.

  • How to let a machine know what architecture it is on (could do a shell uname -a or something similar)?

Again, I would pass this as an input argument. We could have a sanity check where the collector runs rustc -vV on the downloaded compiler and checks that its host: value corresponds to the target. Or we can just deduce the target automatically from rustc and avoid passing it, but I would appreciate having the double check in the collector, just in case.

  • Regarding SQLite, I agree that it doesn't make sense to use it for the queue. I would just ignore it for now. As long as you don't explicitly use BenchQueue, it should be fine, although the website should either not run the cron job when executed locally, or it should not fail horribly when the logic isn't implemented in SQLite. We could either make the queue optin with some environment variable or CLI flag of the site, or add something like fn supports_queue(&self) -> bool to the DB pool implementations.

@@ -152,6 +152,16 @@ impl FromStr for CommitType {
}
}

impl ToString for CommitType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not super important, but usually it's idiomatic to implement Display, which is more general, and then get an implementation of ToString "for free" (there is a blanket impl of ToString for types that implement Display).

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CommitJob {
pub sha: String,
pub parent_sha: Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep the invariant that the parent SHA is always present when we insert things into the queue, then this shouldn't be nullable.

@@ -581,6 +581,21 @@ enum Commands {
self_profile: SelfProfileOption,
},

/// Benchmarks commits from the queue sequentially
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should pass the machine ID and the target as required command-line parameters, rather than somehow inferring them from the environment (primarily the ID should be just hardcoded, the target we could in theory infer, or maybe check that the rustc host target corresponds to the passed target, just to be sure).

@@ -581,6 +581,21 @@ enum Commands {
self_profile: SelfProfileOption,
},

/// Benchmarks commits from the queue sequentially
BenchFromQueue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we implement some form of self-patching, the command should probably behave in the same way as BenchNext, and only run a single benchmark before exiting, so that we can periodically update the binary from github. It does behave like this now, it's just that the work "sequentially" kind of hints that it maybe runs a loop.

log_db(&db);
println!("processing artifacts");
let client = reqwest::blocking::Client::new();
/* @Queue - How do we do this per architecture if it's not queued?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collector shouldn't be concerned with getting released artifacts in any special way. The website's cron job should insert all kinds of artifacts (master/try/released) into the queue, and the queue table should be prepared for representing all three kinds of artifacts.

);

if let Some(sha) = commit_sha {
let _ = std::panic::catch_unwind(async || {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to clarify, the catch_unwind was in the previous code only because we had to trigger the onpush endpoint in all cases, even when a panic happened, otherwise the site could behave weirdly. With the queuing approach, we don't trigger that endpont anymore, so we shouldn't need the catch_unwind calls.

}
});
// We need to send a message to this endpoint even if the collector panics
client.post(format!("{}/perf/onpush", site_url)).send()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't hit this endpoint from the queuing logic yet, because it could break the website if we hit the endpoint both from BenchNext and BenchQueue. I guess that only the benchmark logic could be extracted into a separate function, which will be called both by BenchNext and BenchQueue, but only BenchNext will trigger the endpoint.

pub sha: String,
pub parent_sha: Option<String>,
pub commit_type: CommitType,
pub pr: u32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't implement any automatic from/to SQL row conversions and we build these structs manually anyway, I would appreciate if they were more "domain-driven" and didn't allow representing invalid states.

So I would model it such that the job status looks something like this:

pub enum CommitJobStatus {
    Queued,
    InProgress { started_at: DateTime<Utc> },
    Finished { started_at: DateTime<Utc>, finished_at: DateTime<Utc> },
}

The artifact data looks something like this:

enum CommittType {
   Try { pr: u32 },
   Master { pr: u32 },
   Release { label: String }
}

and so on, so that on the Rust side, we can work more easily with these structs and avoid invalid states. Then in the DB layer, we'll just translate the domain structs into the corresponding atomic SQL attributes.

@@ -178,6 +179,15 @@ pub trait Connection: Send + Sync {

/// Removes all data associated with the given artifact.
async fn purge_artifact(&self, aid: &ArtifactId);

/// Add a jobs to the queue
async fn enqueue_commit_job(&self, target: Target, jobs: &[CommitJob]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the target already a part of CommitJob?

async fn enqueue_commit_job(&self, target: Target, jobs: &[CommitJob]);

/// Dequeue jobs
async fn dequeue_commit_job(&self, machine_id: &str, target: Target) -> Option<CommitJob>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably also set the commit job to be in progress, so it should be maybe named something like "start_commit_job"? Or something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants