Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random / Sporadic 502 gateway timeouts #4433

Closed
DP19 opened this issue Aug 12, 2019 · 21 comments
Closed

Random / Sporadic 502 gateway timeouts #4433

DP19 opened this issue Aug 12, 2019 · 21 comments

Comments

@DP19
Copy link

DP19 commented Aug 12, 2019

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
Bug Report

NGINX Ingress controller version:
0.25.0

Kubernetes version (use kubectl version):
v1.12.10

Environment:
aws / eks

  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release): Amazon Linux
  • Kernel (e.g. uname -a): 4.14.106-97.85.amzn2.x86_64

What happened:
We're seeing random and sporadic 502's being returned and unable to reliably reproduce.

What you expected to happen:
ingress should respond with a 200

How to reproduce it (as minimally and precisely as possible):
Unsure as it happens very sporadically

Anything else we need to know:

messages from ingress controllers:
"*2169 upstream prematurely closed connection while reading response header from upstream"
"*1360038 connect() failed (113: No route to host) while connecting to upstream"
"*1655177 upstream timed out (110: Connection timed out) while connecting to upstream"

This was working a week ago; now we're receiving these 502's from multiple deployments (some of which have not changed in over a month). We've checked the load on the upstream pods and they are handing traffic well and we can port-forward to them directly and not have any 502's or connection issues.

@aledbf
Copy link
Member

aledbf commented Aug 12, 2019

"*2169 upstream prematurely closed connection while reading response header from upstream"

This means your app closed the connection.

"*1360038 connect() failed (113: No route to host) while connecting to upstream"
"*1655177 upstream timed out (110: Connection timed out) while connecting to upstream"

These two could be related to networking issues (No route to host) or your pod died
If these errors are random you can enable retries for HTTP 502 using the annotation nginx.ingress.kubernetes.io/proxy-next-upstream: error timeout http_502

https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/configmap/#proxy-next-upstream
https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/#custom-timeouts

@aledbf
Copy link
Member

aledbf commented Aug 12, 2019

Also, keep in mind post requests are not retried unless you enable that https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/configmap/#retry-non-idempotent

@aledbf
Copy link
Member

aledbf commented Aug 13, 2019

Closing. Please reopen if the behavior persists after changing the settings.

@aledbf aledbf closed this as completed Aug 13, 2019
@Timvissers
Copy link

We're having identical issues (eks 1.13). Any reference is welcome..

@DP19
Copy link
Author

DP19 commented Oct 15, 2019

@Timvissers - After seeing some other issues recently I stumbled upon this issue in k8s - kubernetes/kubernetes#74839
and a corresponding blog post
https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/

I think this is actually the root cause of this issue I was seeing here. Best bet is to upgrade to 1.15 if possible as some of the popular fixes will actually cause all communication to some nodes to start failing after a while. Hope this helps

@Timvissers
Copy link

Thx @DP19
Not sure what you mean with your remark on the 2 suggested workarounds. It didn't work for you?

@DP19
Copy link
Author

DP19 commented Oct 16, 2019

@Timvissers - Sorry they're linked in some other issues that are related. Here's the Docker libnetwork repo where they're discussing practically the same issue and there he has two work-arounds:

moby/libnetwork#1090

In the blog post where they suggest adding a daemon set to run a startup script to add this line to conntrack "echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal" others in the PR for this issue in k8s are suggesting that this will actually "cause the contrack table to be full and all connectivity to the servers are lost".
kubernetes/kubernetes#74840 (comment)

So we haven't implemented any fix for our environments yet.

@Timvissers
Copy link

Just to say that my issue was aws specific: aws/amazon-vpc-cni-k8s#641

@miclefebvre
Copy link

@DP19 We are experiencing the same situation on EKS but using 1.14.
It doesn't seem to be the cni because we're not using the version stated in the bug.

Also why it worked weeks ago for you and stopped working, was it un update ?

Did you find any solution ?

I'll try another ingress, but it seems to be network related.

@DP19
Copy link
Author

DP19 commented Jan 8, 2020

@miclefebvre - so its actually two issues for us. We were affected by the cni issue which was resolved with rolling back the cni driver. But we have a long standing issue while running 1.14 that will be resolved once we eks releases 1.15. I opened this issue once the cni plugin issue came up but it never really worked without any issue due to the kube-proxy bug in 1.14 described in the blog post in my previous comment

@miclefebvre
Copy link

@DP19 Thanks,

I think my problem is a little bit different because what I receive is

*14892883 connect() failed (111: Connection refused) while connecting to upstream

Strange

@aledbf
Copy link
Member

aledbf commented Jan 8, 2020

*14892883 connect() failed (111: Connection refused) while connecting to upstream

@miclefebvre please check your applications have liveness and readiness probes.

@aledbf
Copy link
Member

aledbf commented Jan 8, 2020

@miclefebvre if you can't do that, you could use the annotation nginx.ingress.kubernetes.io/proxy-next-upstream: error timeout http_502 to activate retries in that case

@miclefebvre
Copy link

@aledbf It has probes and I did added the annotation but it's still failing.

@miclefebvre
Copy link

miclefebvre commented Jan 9, 2020

Note that the App loads, only some resources returns 502 and it's random and not always

@DP19
Copy link
Author

DP19 commented Jan 9, 2020

@miclefebvre - I would bypass the ingress using port forward and see if it still happens. This is a great place to start as well if you haven’t tried everything here and will help get to a root cause - https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/

@aledbf
Copy link
Member

aledbf commented Jan 9, 2020

@miclefebvre I suggest you check the ingress controller pod logs to check the retries (pods). Also, maybe your app is restarting?

@miclefebvre
Copy link

@DP19 I've already did and it works perfectly using port forward.

@aledbf No the restart count on my pods stay the same.

@miclefebvre
Copy link

@DP19 @aledbf I tried with Ambassador and it does the same thing.

I found out that if my cluster has only 1 node, the problem disappears. I'm starting to think that's an AWS EKS problem with networking. Because I only have 1 pod in my deployment.

@miclefebvre
Copy link

I have a repro steps but I don't know what can I do with it. Maybe it can helps someone.

On EKS cluster with 2 nodes:

  • Create a deployment with an Nginx Ingress in a namespace X
  • Everything works
  • Delete the namespace X (so everything in it is deleted)
  • Recreate the same deployment with the same namespace X
  • Random 503 when calling the App
  • Roll the 2 EKS Nodes without removing the deployment
  • The problem disapears
  • Delete the namespace X (so everything in it is deleted)
  • Recreate the same deployment with the same namespace X
  • Again Random 503 when calling the App

Note when I only delete the pods, the problem doesn't occur.

@GregCKrause
Copy link

Wanted to leave this here in case it helps anyone else. I was getting intermittent 502's (browser confusingly reported a CORS error), and eventually realized that my Kubernetes manifests were using a shared metadata selector label. Be sure these are unique for each service!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants