Fix bridge connection reset due to invalid packets #2275

guillon · 2018-10-02T09:39:49Z

Add drop of conntrack INVALID packets in input
such that invalid packets due to TCP window overflow do
not cause a connection reset.

Due to some netfilter/conntrack limitations, invalid packets
are never treated as NAT'ed but reassigned to the
host and considered martians.
This causes a RST response from the host and resets the connection.
As soon as NAT is setup, for bridge networks for instance,
invalid packets have to be dropped in input.

The implementation adds a generic DOCKER-INPUT chain prefilled
with a rule for dropping invalid packets and a return rule.
As soon as some bridge network is setup, the DOCKER-INPUT
chain call is inserted in the filter table INPUT chain.

Fixes #1090.

Signed-off-by: Christophe Guillon [email protected]

Add drop of conntrack INVALID packets in input such that invalid packets due to TCP window overflow do not cause a connection reset. Due to some netfilter/conntrack limitations, invalid packets are never treated as NAT'ed but reassigned to the host and considered martians. This causes a RST response from the host and resets the connection. As soon as NAT is setup, for bridge networks for instance, invalid packets have to be dropped in input. The implementation adds a generic DOCKER-INPUT chain prefilled with a rule for dropping invalid packets and a return rule. As soon as some bridge network is setup, the DOCKER-INPUT chain call is inserted in the filter table INPUT chain. Fixes moby#1090. Signed-off-by: Christophe Guillon <[email protected]>

guillon · 2019-06-10T13:27:39Z

Note that this pull request was never treated, while it fixes an unresolved (but closed) issue #1090.

Maybe one should re-open the issue #1090, or should I add a new issue ?

thaJeztah · 2019-06-16T09:33:34Z

ping @arkodg @euanh ptal

arkodg

@guillon thanks for addressing this issue
IMHO https://github.com/docker/libnetwork/blob/83e2bc1e11f2faa907737f96f55a45ba2634ded3/iptables/iptables.go#L202
would be a right place to add the invalid rule only for the specified interface in the FORWARD chain

Can you please add test cases as well in
https://github.com/docker/libnetwork/blob/master/drivers/bridge/setup_ip_tables_test.go

guillon · 2019-06-25T13:52:45Z

@arkodg

would be a right place to add the invalid rule only for the specified interface in the FORWARD chain

Unfortunately, invalid packets, including benign out-of-window packets must be treated (actually dropped) in the input rule, before the FORWARD chain is processed. Otherwise, netfilter will reset connection before passing the packet to the FORWARD chain.
Which mean that we cannot be more specific, any invalid INPUT packet has to be dropped.
I've been trying the proposed solution (add the rule to forward chain) and it does not work indeed.

Can you please add test cases as well in
https://github.com/docker/libnetwork/blob/master/drivers/bridge/setup_ip_tables_test.go

Yes, I will allocate some time as soon as possible to add tests.

arkodg · 2019-06-25T17:05:36Z

@guillon I have not tried this out, but kube-proxy fixed this issue by inserting a similar rule in the FORWARD chain

kubernetes/kubernetes#74840

guillon · 2019-06-26T07:32:44Z

Seems odd, but I will verify this again, thanks for the pointer.

guillon · 2019-06-26T15:07:16Z

I can confirm that adding the rules in the FORWARD chain does not work.
From my experiments the fix done in kube-proxy does not solve the issue, either they did not really test it, or their initial issue is different then ours.

arkodg · 2019-06-26T16:46:28Z

@guillon if you can share your reproduction steps, I'd be happy to take a stab at it as well

thaJeztah · 2019-06-27T10:15:09Z

I can confirm that adding the rules in the FORWARD chain does not work.
From my experiments the fix done in kube-proxy does not solve the issue, either they did not really test it, or their initial issue is different then ours.

Should we open a ticket for that in the kubernetes issue tracker?

guillon · 2019-06-27T13:05:02Z

@arkodg I have forked a testbench for the kubernetes issue which I modified to exhibit the issue solved by this pull request. Get it at : https://github.com/guillon/k8s-issue-74839
If you can play with it and possibly test yourself some variants of iptables and the pull request it would be great!

guillon · 2019-06-27T13:09:14Z

@thaJeztah I still have to exhibit the problem under kubernetes, in order to sort things out, I'm working on it...
Actually with the initial testbench from k8s, as the server and client are both NATed (i.e. specifically because the server is NATed in the setup), the problem should not appear, hence I still do not understand how they produced the problem with the testbench, I have to investigate on a local kubernetes setup (which I just started...).

guillon · 2019-06-27T14:06:32Z

Actually, when installing kubernetes on some physical Ubuntu hosts as described in https://vitux.com/install-and-deploy-kubernetes-on-ubuntu/, the kubernetes networking does not use NAT, hence I can't reproduce the issue over kubernetes.
I may have to search for a kubernetes setup with bridge network...

arkodg · 2019-06-27T22:22:52Z

@guillon awesome work with the repro steps and the comments, was able to recreate this issue very easily

I observed that these INAVLID packets traverse from the PREROUTING to INPUT chain instead of the FORWARD chain . This is the part that does not add up .

My hunch is that for some reason INVALID packets do not get matched in nat-PREROUTING where they should based on this pretty picture

which is why adding this rule did the trick

sudo iptables -I PREROUTING 1 -t mangle -m conntrack --ctstate INVALID -j DROP

I verified the above using rules such as

sudo iptables -t mangle -A PREROUTING -j LOG -m state --state INVALID

and

sudo iptables -A INPUT -j LOG -m state --state INVALID

I'm hesitant to add these rules to the INPUT chain because you cannot even filter it on the docker0 interface and maybe for a particular use-case the user might want to send a RST out . Here's the log -

IN=ens3 OUT= MAC=02:4d:90:46:c6:ce:02:69:d2:4b:17:24:08:00 SRC=172.31.34.149 DST=172.31.42.122 LEN=49 TOS=0x00 PREC=0x00 TTL=64 ID=17787 DF PROTO=TCP SPT=9000 DPT=34945 WINDOW=229 RES=0x00 ACK PSH URGP=0)

guillon · 2019-06-28T09:38:48Z

Indeed the filtering can't be done on docker0, all invalid packets in conntrack need to be dropped. The first routing decision in your diagram is that an invalid packet in prerouting is routed to the host as fallback and DNAT is abandonned,
From what I understood at the time of the fix, it's the only way to setup the bridge interface correctly.

Note that since last year all our services (swarms + bare docker clients) are working perfectly with the fix.

Don't remember if I mentionned it, but the overflow window error is typical in WAN setups due to WAN optimizations at the WAN routers level (packets deduplication in particular and also some aggressive packet streaming optimization strategies).
For instance with WAN deduplication, the local optimizing router has cache of IP packet fragments based on content hashing, hence, from a client downloading for instance a large artifact, the input packet rate can suddenly become very high, and the client interface does not even has the time to adjust it's receiving window.

The use case which revealed this issue was dockerized clients downloading from a distant artifacts database service, where WAN deduplication is quite effective.
The RST issue was causing very frequent and hard to understand - until I got it - download stalls+aborts (caused because the server receives the RST and stop trnasmiting, while the client wait forever at 0 b/s until some timeout occurs).

I have made at that time also a test over a WAN where I can also produce this by "artificially" making a client very slow, actually a bare curl with a limited input throughput to download a big artifact. This is sufficient to produce the issue. Download once, and again a second time (the WAN optimization being active), then the throughput rapidly reached over 50 Mbytes/s. If the client is too slow, suddenly the download stops to 0 b/s as the server received the infamous RST from the conntrack engine.

arkodg · 2019-06-28T21:10:25Z

Thanks for describing the background, in this case is there any way to capture packets on both ends and determine if the WAN router and host are abiding by the TCP Congestion Control RFCs or not

Dropping these packets might solve your issue, but it still masks the problem because TCP is supposed to handle this case and the workaround causes more retransmissions on the network eating up the bandwidth

guillon · 2019-07-01T09:25:49Z

I'm afraid I can't tell whether the WAN routers behave correctly, I'm not expert enough. Actually it happens that the WAN is loaded, the routers are being updated or [badly] tuned by the IT departement or other events over which I have no control.
Probably, indeed, in such cases the WAN routers behaves sub-optimally and perhaps incorrectly.

In the facts, when the networks behaves incorrectly, containerized services fail, while bare services remain functional. This has been reported multiple times in other contexts and even in IaaS vendors setups when we follow cross references from the issue #1090.

What I've seen recommended is that as soon as setting up NAT tables, INVALID packets should be dropped. I agree that invalid packets can be created by bogus WAN routers, but I think we shouldn't care.
For instance: https://www.spinics.net/lists/netfilter/msg51409.html also referred here: https://serverfault.com/questions/309691/why-is-our-firewall-ubuntu-8-04-rejecting-the-final-packet-fin-ack-psh-wit/312687#312687

This seems to be a surprising flaw in the design of the iptables conntrack module, I couldn't really find an explanation for this weird behavior which makes NAT less stable when INVALID packet are not dropped explicitly.

arkodg · 2019-07-02T16:13:46Z

Would it be possible to raise this issue with the conntreck folks, and see if this behavior is intentional or not before we consider adding such a rule

rogueresistor · 2020-03-25T02:10:58Z

This is still an issue, and still something that needs to be fixed, is there any other place to subscribe for potential changes and or fixes?

thaJeztah · 2020-06-23T09:04:49Z

Did someone take the time to discuss this with the conntrack people? Reading the discussion above, that's what's currently stalling this PR

44int · 2020-09-17T03:28:50Z

Is there any progress in this PR?
I've been having problems as well, and I've found that adding the following lines helps.

iptables -I INPUT -m conntrack --ctstate INVALID -j DROP

There seems to be a widespread way to add the above as a workaround, as shown in the link below. I don't know if it's bad network equipment manner or a libnetwork issue, but I think it would be in the community's best interest to provide solution or conclusion.

https://imbstack.com/2020/05/03/debugging-docker-connection-resets.html
https://medium.com/swlh/fix-a-random-network-connection-reset-issue-in-docker-kubernetes-5c57a11de170

leakingtapan · 2020-10-26T06:10:16Z

Any plan to merge this fix? as it has been outstanding for more than 4 years, and ppl are wasting tons of time in tracing down such issues.

guillon force-pushed the 1090-fix-rst-on-invalid branch from 531bb7c to fc8f042 Compare October 2, 2018 10:25

guillon mentioned this pull request Jun 10, 2019

Add workaround for spurious retransmits leading to connection resets #1090

Open

arkodg reviewed Jun 18, 2019

View reviewed changes

arkodg mentioned this pull request Jan 14, 2020

Intermittent connection resets in Swarm using IPVS due to invalid conntrack packets moby/moby#40374

Open

niusmallnan mentioned this pull request Apr 25, 2020

Move the drop invalid rule to CATTLE_INPUT Chain cnrancher/plugin-manager#2

Merged

corhere added the carry-to-mobymoby label Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bridge connection reset due to invalid packets #2275

Fix bridge connection reset due to invalid packets #2275

guillon commented Oct 2, 2018

guillon commented Jun 10, 2019

thaJeztah commented Jun 16, 2019

arkodg left a comment •

edited

Loading

guillon commented Jun 25, 2019 •

edited

Loading

arkodg commented Jun 25, 2019

guillon commented Jun 26, 2019

guillon commented Jun 26, 2019

arkodg commented Jun 26, 2019

thaJeztah commented Jun 27, 2019

guillon commented Jun 27, 2019

guillon commented Jun 27, 2019 •

edited

Loading

guillon commented Jun 27, 2019

arkodg commented Jun 27, 2019 •

edited

Loading

guillon commented Jun 28, 2019

arkodg commented Jun 28, 2019

guillon commented Jul 1, 2019

arkodg commented Jul 2, 2019

rogueresistor commented Mar 25, 2020

thaJeztah commented Jun 23, 2020

44int commented Sep 17, 2020 •

edited

Loading

leakingtapan commented Oct 26, 2020

Fix bridge connection reset due to invalid packets #2275

Are you sure you want to change the base?

Fix bridge connection reset due to invalid packets #2275

Conversation

guillon commented Oct 2, 2018

guillon commented Jun 10, 2019

thaJeztah commented Jun 16, 2019

arkodg left a comment • edited Loading

Choose a reason for hiding this comment

guillon commented Jun 25, 2019 • edited Loading

arkodg commented Jun 25, 2019

guillon commented Jun 26, 2019

guillon commented Jun 26, 2019

arkodg commented Jun 26, 2019

thaJeztah commented Jun 27, 2019

guillon commented Jun 27, 2019

guillon commented Jun 27, 2019 • edited Loading

guillon commented Jun 27, 2019

arkodg commented Jun 27, 2019 • edited Loading

guillon commented Jun 28, 2019

arkodg commented Jun 28, 2019

guillon commented Jul 1, 2019

arkodg commented Jul 2, 2019

rogueresistor commented Mar 25, 2020

thaJeztah commented Jun 23, 2020

44int commented Sep 17, 2020 • edited Loading

leakingtapan commented Oct 26, 2020

arkodg left a comment •

edited

Loading

guillon commented Jun 25, 2019 •

edited

Loading

guillon commented Jun 27, 2019 •

edited

Loading

arkodg commented Jun 27, 2019 •

edited

Loading

44int commented Sep 17, 2020 •

edited

Loading