-
Notifications
You must be signed in to change notification settings - Fork 426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(orchestrator): startup and remove containers based on dependencies #5509
chore(orchestrator): startup and remove containers based on dependencies #5509
Conversation
🍕 Here are the new binary sizes!
|
for { | ||
isRunning, err := o.docker.IsContainerRunning(ctx, id) | ||
switch { | ||
case err != nil: | ||
var errContainerExited *dockerengine.ErrContainerExited | ||
if errors.As(err, &errContainerExited) && !isEssential { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know what ECS does when a non-essential container fails to start?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I know ECS doesn't error out or anything if the non-essential container has started and exited. I am curious what happens if the container was never able to start. I think it is not super important for us though, just in case you knew on top of you mind (if you don't, don't bother testing it 😂 !)
for { | ||
isRunning, err := o.docker.IsContainerRunning(ctx, id) | ||
switch { | ||
case err != nil: | ||
var errContainerExited *dockerengine.ErrContainerExited | ||
if errors.As(err, &errContainerExited) && !isEssential { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I know ECS doesn't error out or anything if the non-essential container has started and exited. I am curious what happens if the container was never able to start. I think it is not super important for us though, just in case you knew on top of you mind (if you don't, don't bother testing it 😂 !)
@@ -434,16 +464,19 @@ func (o *Orchestrator) stopTask(ctx context.Context, task Task) error { | |||
} | |||
|
|||
// waitForContainerToStart blocks until the container specified by id starts. | |||
func (o *Orchestrator) waitForContainerToStart(ctx context.Context, id string) error { | |||
func (o *Orchestrator) waitForContainerToStart(ctx context.Context, id string, isEssential bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, we take isEssential
as a boolean parameters, instead of letting the callers swallow the exit error from non-essential container - I assume - because all the callers of waitForContainerToStart
want the same error handling right? Essentially, we are passing a boolean parameter to avoid code duplication. Is this guess close to what you thought?
This makes sense to me, though I am wary of boolean parameters: among all the cons it has 😂 , it makes code harder to read. For example, it is not obvious what o.waitForContainerToStart(ctx, opts.ContainerName, true)
means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah your guess is absolutely right 😆
I want to reduce the duplicacy where each caller has to handle the error accordingly.
I have two options in this case
- Adjust the doc comment to make it more clear
- Let each caller handle the error scenario of non essential container exited.
Which one do you suggest among these two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a brief search I think there are 3 calls to waitForContainerToStart
, and only 2 of them need to handle the non-essential case, so the duplication is not bad. Therefore, if you ask me - I would probably have opted for 2. But I will leave the final decision to you 👍🏼
healthy, err := o.docker.IsContainerHealthy(ctx, ctrId) | ||
if err != nil { | ||
if !isEssential { | ||
fmt.Printf("non-essential container %q is not healthy: %v\n", ctrId, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit nit: just because the container could be in non-running state or maybe docker inspect
has some unexpected error (which does not indicate "unhealthiness" of the container):
fmt.Printf("non-essential container %q is not healthy: %v\n", ctrId, err) | |
fmt.Printf("check health status for container %q: %v\n", ctrId, err) |
The wrapped error returned from IsContainerHealthy
is enough to tell the users whether it's because the container is indeed unhealthy, or it doesn't have health check, or other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also.....on a second thought, if I have a manifest like this:
sidecars:
container1:
depends_on:
container2: healthy
container2:
# not essential
healtchcheck:...
And then container2 turns out to be unhealthy, should I really start container1 🤔 it seems from the implementation that we will just log and start contaienr1. Curious to hear your thoughts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this scenario. ECS tries to deploy container2
and it becomes unhealthy. ECS does not try to start container1
nor Circuit breaker is triggered. CloudFormation just errors our after 3 hours
.
Resource timed out waiting for completion
So Essential-lity
and dependsOn
are two different entities.
Changed the logic to completely stop and remove all the containers even if non-essential
container becomes unhealthy.
name, state := name, state | ||
eg.Go(func() error { | ||
ctrId := o.containerID(name) | ||
isEssential := definitions[name].IsEssential |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, it's a bit counter-intuitive that a container would care about whether the containers that it depends on are essential 💭 For example, if the depends_on
specify that container A can start only after container B has started, why would container A care whether container B is essential?
It seems like we are using isEssential
here majorly to decide:
- whether to error out from
waitForContainerToStart
- whether to
log.Successf
or to justfmt.Printf
- whether to error out from
IsContainerHealthy
For 1,
return o.waitForContainerToStart(ctx, o.containerID(containerName), a.task.Containers[containerName].IsEssential) |
For 2, I think
return o.waitForContainerToStart(ctx, o.containerID(containerName), a.task.Containers[containerName].IsEssential) |
For 3, I have the same question in https://github.com/aws/copilot-cli/pull/5509/files#r1428611649.
Essentially, I feel like we are mixing two routes of logics here:
- To complain or just log when an essential/non-essential contaienr is not functioning
- To determine whether to start a container based on its dependency
The function that handles 2, I think, typically should not care about the essential-ity of the containers.
Let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing me about this. yeah, essential-ity and dependsOn are different entities.
I changed the logic to reflect the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! LGTM overall
Co-authored-by: Wanxian Yang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more questions and we are good to go!!!! Amazing work Adi!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!
@@ -475,6 +475,7 @@ func (o *Orchestrator) waitForContainerToStart(ctx context.Context, id string) e | |||
case err != nil: | |||
return fmt.Errorf("check if %q is running: %w", id, err) | |||
case isRunning: | |||
log.Successf("Successfully started container %s\n", id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but some time later after this is merged -
I feel like it'd be easier to refactor the code in the future if we add a logger to Orchestrator:
type Orchestrator struct {
logger log.Logger
}
func New() *Orchestrator {
return &Orchestrator { logger: log.New() }
}
func (o *Orchestrator) waitForContainerToStart(ctx context.Context, id string) error {
o.logger.Successf("...")
}
This would be easier for us to a) mute the logger by creating a no-op loggeer in New() b) create a logger that logs to an open file instead of stdout/err and c) refactor the codes, because we can easily locate all the logging by "Go to reference" of the logger
(vs. now we need to do a global search on the keyword "log")
This is final PR of Essential and DependsOn
UX with the below manifest config