-
Notifications
You must be signed in to change notification settings - Fork 16
Let's rework PAPR's architecture! #62
Comments
👍
This sound a lot like a a message+worker system (EG: celery). Homu has a queue built in, but if the requirement is within PAPR itself to be able to handle it's own testing while keeping Homu strictly merge work reusing an established message+worker system would be close to trivial.
I'm not against it, other than the need to hold our own Homu fork as we'd need to rebase and modify any time we want to pull in new features from upstream. I worry that combining merge gating and distributed testing may be coupling things that should stay separate but I wouldn't complain if it ends up being easier to code. |
Also as part of this consider being coming an app maybe? https://developer.github.com/apps/building-integrations/setting-up-a-new-integration/about-choosing-an-integration-type/ This way papr could be installed organization-wide more easily. |
So... I think a good strategy here would be to start with the minimal amount of work needed to get into CentOS CI. And then we can slowly iterate over changing the architecture? There are some changes that will require a lot of upfront investment clearly, but trying to keep things gradual ensures we can timebox improvements. Basically, I think the path of least resistance towards migrating to our CCI OCP project is something like:
Once we got that down, we can talk about swapping out GHPRB, doing fancier queueing, web UIs, etc... Does that make sense? I'll try my hand at (1) and (2) today. |
I see what you're saying, but the thing I struggle with now is basically that the current monolithic architecture is hard at least for me to understand/hack. I guess a basic problem is I just don't have a consistent amount of time to learn it, so it's more when things fail that it becomes critical path. I keep circling back to that queuing approach, it feels orthogonal to the other work. If we stood up a simple service that received the webhooks and also did the conversion to Kube jobs, and posted that for anything else to read - it wouldn't conflict with the ocp branch, and would allow iteration on a more decoupled architecture. Maybe? What still feels fuzzy to me is how result posting works wrt credentials and large log files. |
One of my main goal is to make it really easy to get started and contribute. E.g. the instructions in https://github.com/projectatomic/papr/blob/ocp/docs/RUNNING.md should hopefully get anyone get up and running within 5 minutes. In the long-term, we'll need more people to help maintain this thing if we really want to make it part of our upstream CI story.
Yeah, definitely would like to get to something like that (though I pictured using custom resources much like what Prow does). Between that and getting a One thing I'm still struggling with is whether to eventually support |
One note I'd like to make here, for lack of a better place, is that recently I was playing with replacing GHPRB with Prow. And it worked surprisingly well with the PAPR-in-OCP model. Notably, this involves setting The config I got running with was this: presubmits:
jlebon/papr-sandbox:
- name: papr-sandbox-pull
agent: kubernetes
branches:
- master
always_run: true
skip_report: true
# XXX: need to teach trigger to pass down full comment so PAPR can learn
# to handle retesting failed testsuites
#trigger: "(?m)^/retest(\\s+.*)*$"
#rerun_command: "/retest [testsuite1] [testsuite2] ..."
spec:
restartPolicy: Never
serviceAccountName: papr
volumes:
- name: config-mount
configMap:
name: papr-config
containers:
- name: papr
image: 172.30.1.1:5000/myproject/papr:latest
imagePullPolicy: Always
args:
- "--debug"
- "runtest"
- "--conf"
- "/etc/papr/config"
- "--repo"
- "$(REPO_OWNER)/$(REPO_NAME)"
- "--pull"
- "$(PULL_NUMBER)"
- "--expected-sha1"
- "$(PULL_PULL_SHA)"
securityContext:
runAsUser: 0
volumeMounts:
- name: config-mount
mountPath: /etc/papr
env:
- name: GITHUB_TOKEN
valueFrom:
secretKeyRef:
name: github-token
key: token
optional: false
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-access-key
key: id
optional: false
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-access-key
key: secret
optional: false Then, e.g. |
Hmmm. I definitely see the advantages of taking GHPRB out of the picture, though my primary focus tends to go to the merge queue aspect - I see it as a lot more important than the per-PR testing. I have a worry that prow-but-not-tide etc. is going to leave us with gaps and things to glue; although that's definitely true today with GHPRB/papr/homu, so not really a change for better/worse. In this proposal we'd still be using Homu, correct? |
Yeah, that's one missing piece. I was looking at this some more today. To summarize: there is no prioritization of prowjobs; This means that any prioritization should happen at the Kube level. Kube does support pod priorities, which means we could e.g. label merge pods as having higher priority than PR pods so that they are scheduled before them. However, by default, it also does preemption, which means that e.g. it could kick out an already running PR job to schedule a merge job, which I don't think we want. You can disable preemption, but not per-pod, only at the kubelet level. One idea is to write a custom scheduler for pods coming from prowjobs which would know how to prioritize them (could even do neat things like prioritizing some repos over others), though not sure how much work it'd be (e.g. we might have to implement node selection ourselves if there's no way to reuse the default scheduler logic).
I was looking at possibly going all in at this point. Tide seems to offer the same guarantee of never-broken-master as Homu by rerunning tests before merge. IIUC though, the way it does this is very different: instead of pushing a branch like But at merge time, tide uses the GitHub merge API, rather than pushing it itself. This does bring back discussions from servo/homu#57 -- it assumes that whatever strategy is used by GitHub will be exactly equivalent to what was tested ( There's also no autosquashing feature, though that should be trivial to implement if tide learns to do the push itself. |
Hum, that's assuming the cluster is full. The But if the cluster is full (which actually gets into a more interesting question in a cloud world about just how much money we want to spend, since there is no "full") - yes, I'd say that merge jobs should preempt PR jobs, right? |
Rough problem statement
Right now, our architecture is very much like a waterfall.
Events on GitHub cause a linear cascade of events that
eventually fires off PAPR to run for those specific events.
This has severe limitations:
and thus harder to contribute
Jenkins, OpenStack) results in a higher fault rate
Other issues plaguing the current architecture:
tell what's going on without manually checking the
internal Jenkins queue (also homu queue is sort of visible
but could be way better)
to be able to apply a set of rules as to how jobs should
prioritized. E.g. 'auto' and 'try' branch before PRs, PRs
with certain labels before others, etc...
an inconsistent user experience
Rough solution
We split up the architecture into multiple small services:
Proposed to run in CentOS CI
Need either OpenStack or Docker/Kubernetes
Bin packing problem - can we e.g. reuse Kubernetes
https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
The scheduler receives events from GitHub and queues up the
jobs. It understands .papr.yml and splits them into
individual jobs each representing a testsuite. This allows
e.g. workers that only support containerized workloads to
still participate in the pool. It also allows container work
to be weighed differently from VM/cluster work.
Workers periodically (with forced polls on events if
implementable) query PAPRQ for available jobs. PAPRQ
prioritizes jobs by a given set of rules.
What it would entail
to also handle PR events and add them to its queue. This
naturally resolves the confusing UX experience, and makes
optimizations like Add status-based test exemptions servo/homu#54
trivial to implement.
E.g. @rh-atomic-bot retry will actually know whether to
retry testing the PR, or retry testing the merge.
It also allows for more sophisticated syntax, like:
@rh-atomic-bot retry f26-sanity
Teach PAPR to connect to PAPRQ for jobs. This is either a
long-running service that polls, or is periodically started
by an external service (e.g. Jenkins, OCP)
This can come later. Rather than the workers publishing
to e.g. S3 themselves, do similar to what Cockpit does and
stream logs and updates back to PAPRQ itself. This allows us
to (1) have publicly visible streaming logs, and (2) keep
all the secrets in PAPRQ and only require workers to have a
single token.
Let's finalize this work and split it up amongst team
members so that everyone understands how it works, and can
help manage it.
Risks
Sub-alternative:
Sidecar/wrapper for Homu - PAPR intercepts github
events and forwards them to Homu as well, but also builds
its own state.
(Investigate organization-wide github events)
Transition plan
Can take down per-PR testing while keeping up testing on auto branch.
Alternatives
Customize Jenkins (Integrate better with CentOS CI)
Relationship with GHPRB there?
Hop on http://prow.k8s.io/ with Origin
Rely on Travis more
Other discussion
Standard test roles vs PAPR
PAPR describes more things, handles tasks like provisioning more declaratively
Could have stdtest in upstream git?
Colin: PAPR runs stdtest? Jonathan: Problem: Test in separate git repo. Unless upstream repo also holds stdtest definition?
The text was updated successfully, but these errors were encountered: