Random / Sporadic 502 gateway timeouts #4433

DP19 · 2019-08-12T14:04:18Z

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
Bug Report

NGINX Ingress controller version:
0.25.0

Kubernetes version (use kubectl version):
v1.12.10

Environment:
aws / eks

Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release): Amazon Linux
Kernel (e.g. uname -a): 4.14.106-97.85.amzn2.x86_64

What happened:
We're seeing random and sporadic 502's being returned and unable to reliably reproduce.

What you expected to happen:
ingress should respond with a 200

How to reproduce it (as minimally and precisely as possible):
Unsure as it happens very sporadically

Anything else we need to know:

messages from ingress controllers:
"*2169 upstream prematurely closed connection while reading response header from upstream"
"*1360038 connect() failed (113: No route to host) while connecting to upstream"
"*1655177 upstream timed out (110: Connection timed out) while connecting to upstream"

This was working a week ago; now we're receiving these 502's from multiple deployments (some of which have not changed in over a month). We've checked the load on the upstream pods and they are handing traffic well and we can port-forward to them directly and not have any 502's or connection issues.

The text was updated successfully, but these errors were encountered:

aledbf · 2019-08-12T14:15:56Z

"*2169 upstream prematurely closed connection while reading response header from upstream"

This means your app closed the connection.

"*1360038 connect() failed (113: No route to host) while connecting to upstream"
"*1655177 upstream timed out (110: Connection timed out) while connecting to upstream"

These two could be related to networking issues (No route to host) or your pod died
If these errors are random you can enable retries for HTTP 502 using the annotation nginx.ingress.kubernetes.io/proxy-next-upstream: error timeout http_502

https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/configmap/#proxy-next-upstream
https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/#custom-timeouts

aledbf · 2019-08-12T14:18:05Z

Also, keep in mind post requests are not retried unless you enable that https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/configmap/#retry-non-idempotent

aledbf · 2019-08-13T12:56:50Z

Closing. Please reopen if the behavior persists after changing the settings.

Timvissers · 2019-10-15T12:36:02Z

We're having identical issues (eks 1.13). Any reference is welcome..

DP19 · 2019-10-15T17:41:44Z

@Timvissers - After seeing some other issues recently I stumbled upon this issue in k8s - kubernetes/kubernetes#74839
and a corresponding blog post
https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/

I think this is actually the root cause of this issue I was seeing here. Best bet is to upgrade to 1.15 if possible as some of the popular fixes will actually cause all communication to some nodes to start failing after a while. Hope this helps

Timvissers · 2019-10-16T07:41:43Z

Thx @DP19
Not sure what you mean with your remark on the 2 suggested workarounds. It didn't work for you?

DP19 · 2019-10-16T13:50:36Z

@Timvissers - Sorry they're linked in some other issues that are related. Here's the Docker libnetwork repo where they're discussing practically the same issue and there he has two work-arounds:

moby/libnetwork#1090

In the blog post where they suggest adding a daemon set to run a startup script to add this line to conntrack "echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal" others in the PR for this issue in k8s are suggesting that this will actually "cause the contrack table to be full and all connectivity to the servers are lost".
kubernetes/kubernetes#74840 (comment)

So we haven't implemented any fix for our environments yet.

Timvissers · 2019-10-30T16:55:26Z

Just to say that my issue was aws specific: aws/amazon-vpc-cni-k8s#641

miclefebvre · 2020-01-08T20:45:05Z

@DP19 We are experiencing the same situation on EKS but using 1.14.
It doesn't seem to be the cni because we're not using the version stated in the bug.

Also why it worked weeks ago for you and stopped working, was it un update ?

Did you find any solution ?

I'll try another ingress, but it seems to be network related.

DP19 · 2020-01-08T21:13:59Z

@miclefebvre - so its actually two issues for us. We were affected by the cni issue which was resolved with rolling back the cni driver. But we have a long standing issue while running 1.14 that will be resolved once we eks releases 1.15. I opened this issue once the cni plugin issue came up but it never really worked without any issue due to the kube-proxy bug in 1.14 described in the blog post in my previous comment

miclefebvre · 2020-01-08T21:45:12Z

@DP19 Thanks,

I think my problem is a little bit different because what I receive is

*14892883 connect() failed (111: Connection refused) while connecting to upstream

Strange

aledbf · 2020-01-08T22:44:35Z

*14892883 connect() failed (111: Connection refused) while connecting to upstream

@miclefebvre please check your applications have liveness and readiness probes.

aledbf · 2020-01-08T22:50:50Z

@miclefebvre if you can't do that, you could use the annotation nginx.ingress.kubernetes.io/proxy-next-upstream: error timeout http_502 to activate retries in that case

miclefebvre · 2020-01-09T14:02:53Z

@aledbf It has probes and I did added the annotation but it's still failing.

miclefebvre · 2020-01-09T14:21:12Z

Note that the App loads, only some resources returns 502 and it's random and not always

DP19 · 2020-01-09T14:25:06Z

@miclefebvre - I would bypass the ingress using port forward and see if it still happens. This is a great place to start as well if you haven’t tried everything here and will help get to a root cause - https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/

aledbf · 2020-01-09T14:55:49Z

@miclefebvre I suggest you check the ingress controller pod logs to check the retries (pods). Also, maybe your app is restarting?

miclefebvre · 2020-01-09T14:58:02Z

@DP19 I've already did and it works perfectly using port forward.

@aledbf No the restart count on my pods stay the same.

miclefebvre · 2020-01-09T22:47:23Z

@DP19 @aledbf I tried with Ambassador and it does the same thing.

I found out that if my cluster has only 1 node, the problem disappears. I'm starting to think that's an AWS EKS problem with networking. Because I only have 1 pod in my deployment.

miclefebvre · 2020-01-09T23:18:01Z

I have a repro steps but I don't know what can I do with it. Maybe it can helps someone.

On EKS cluster with 2 nodes:

Create a deployment with an Nginx Ingress in a namespace X
Everything works
Delete the namespace X (so everything in it is deleted)
Recreate the same deployment with the same namespace X
Random 503 when calling the App
Roll the 2 EKS Nodes without removing the deployment
The problem disapears
Delete the namespace X (so everything in it is deleted)
Recreate the same deployment with the same namespace X
Again Random 503 when calling the App

Note when I only delete the pods, the problem doesn't occur.

GregCKrause · 2022-03-03T04:23:02Z

Wanted to leave this here in case it helps anyone else. I was getting intermittent 502's (browser confusingly reported a CORS error), and eventually realized that my Kubernetes manifests were using a shared metadata selector label. Be sure these are unique for each service!

aledbf closed this as completed Aug 13, 2019

fahminlb33 mentioned this issue Feb 19, 2024

Periodically client receives a 502 response code for POST requests to Qdrant. qdrant/qdrant#3263

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random / Sporadic 502 gateway timeouts #4433

Random / Sporadic 502 gateway timeouts #4433

DP19 commented Aug 12, 2019

aledbf commented Aug 12, 2019 •

edited

Loading

aledbf commented Aug 12, 2019

aledbf commented Aug 13, 2019

Timvissers commented Oct 15, 2019

DP19 commented Oct 15, 2019

Timvissers commented Oct 16, 2019

DP19 commented Oct 16, 2019

Timvissers commented Oct 30, 2019

miclefebvre commented Jan 8, 2020

DP19 commented Jan 8, 2020

miclefebvre commented Jan 8, 2020

aledbf commented Jan 8, 2020

aledbf commented Jan 8, 2020

miclefebvre commented Jan 9, 2020

miclefebvre commented Jan 9, 2020 •

edited

Loading

DP19 commented Jan 9, 2020

aledbf commented Jan 9, 2020

miclefebvre commented Jan 9, 2020

miclefebvre commented Jan 9, 2020

miclefebvre commented Jan 9, 2020

GregCKrause commented Mar 3, 2022

Random / Sporadic 502 gateway timeouts #4433

Random / Sporadic 502 gateway timeouts #4433

Comments

DP19 commented Aug 12, 2019

aledbf commented Aug 12, 2019 • edited Loading

aledbf commented Aug 12, 2019

aledbf commented Aug 13, 2019

Timvissers commented Oct 15, 2019

DP19 commented Oct 15, 2019

Timvissers commented Oct 16, 2019

DP19 commented Oct 16, 2019

Timvissers commented Oct 30, 2019

miclefebvre commented Jan 8, 2020

DP19 commented Jan 8, 2020

miclefebvre commented Jan 8, 2020

aledbf commented Jan 8, 2020

aledbf commented Jan 8, 2020

miclefebvre commented Jan 9, 2020

miclefebvre commented Jan 9, 2020 • edited Loading

DP19 commented Jan 9, 2020

aledbf commented Jan 9, 2020

miclefebvre commented Jan 9, 2020

miclefebvre commented Jan 9, 2020

miclefebvre commented Jan 9, 2020

GregCKrause commented Mar 3, 2022

aledbf commented Aug 12, 2019 •

edited

Loading

miclefebvre commented Jan 9, 2020 •

edited

Loading