Added required call to allocate VIPs when endpoints are restored #2468

balrajsingh · 2017-12-06T03:34:59Z

Tracking down libnetwork/1790#issuecomment-308222053. The report is that if on a single node several services are started, and if this node is then rebooted, all the services appear to come back but some of them are no longer reachable.

On probing, the cause turned out to be an invalid assignment of IP addresses to services when they were restored. Specifically, the same IP address was assigned to one service's VIP and also a different service's endpoint. The result was that packets got delivered to the wrong container and caused symptoms like services or ports unreachable.

This is very likely to also be the cause of moby/#35675 and other duplicate-IP or overlapping IP issues.

The reason for this problem seems to be that the code path followed when services are restored, at no point contacts or informs IPAM about the IP addresses used as the restored service's VIP. So IPAM thinks that those IP addresses are still available and hands them out to endpoints and new services, causing the observed chaos.

To work out the right fix, I compared the code path when a service is created from the CLI to the code path when a service is restored on reboot. To me this fix is the bit that should have always been on the restore path but was omitted. With this fix IPAM gets correctly informed and it's state is consistent with what I see on the network.

I have tested this fix on a single node running several services and when there multiple nodes with multiple managers running many services (specifically 2 nodes and 2 managers). In both cases, without the fix a reboot would cause IP address overlaps on the ingress network. With the fix there are no overlaps.

While the fix seems to work, I'm not sure if it is at exactly the right point in this function, or indeed if it is the right or complete fix. Please take a look and let me know.

@dperny @fcrisciani

Signed-off-by: Balraj Singh [email protected]

codecov · 2017-12-06T03:45:37Z

Codecov Report

Merging #2468 into master will decrease coverage by 0.34%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #2468      +/-   ##
==========================================
- Coverage   62.07%   61.72%   -0.35%     
==========================================
  Files          49      128      +79     
  Lines        6842    21120   +14278     
==========================================
+ Hits         4247    13037    +8790     
- Misses       2162     6677    +4515     
- Partials      433     1406     +973

jbscare · 2017-12-06T14:59:42Z

One thing I wanted to mention, which came up on moby/moby#35675, is that our situation includes a current service (A) that overlaps with a deleted service (K). (In particular, docker network inspect -v output still includes a VIP and an endpoint for K, and the addresses for that VIP and endpoint are allocated to two endpoints in A.)

We're not sure if the overlap happened before or after K was deleted (I can try to find out, although I'm not sure where to look), but I wanted to mention it just in case it's relevant to this PR. We definitely have restarted Docker and/or rebooted nodes at some point in here.

Thanks for tracking this down!

thaJeztah · 2017-12-06T18:12:11Z

Thanks for the additional information @jbscare - appreciated 👍

balrajsingh · 2017-12-06T18:45:40Z

Thanks @jbscare. I can't immediately see a sequence of operations that would land up in the state you see, but still I'm tempted to say that the cause of what you see is this issue. The reason is that when the module that keeps track of IP addrs in use and the actual addresses used go out of sync, it doesn't cause an immediate fatal error. It just limps along and can get itself into very unusual states including the one you see. Even when it's in a bad way, some of the mesh networking keeps it working so the actual failure is discovered quite late (unless of course you do a sensitive test like yours where you track connections to individual containers).

If it's possible to do the following test it would help a lot. It's a bit crude but should give a strong indication of whether this is the cause of the problem:

First reset to a clean state. So remove all services and then reboot the VMs.
On a freshly booted machine with no services running, start several of your services.
Run docker network inspect -v ingress and verify that there are no overlaps between VIPs and EndpointIPs (also save the output).
Reboot the Leader node (use docker node ls to find the leader).
Do a docker network inspect -v ingress once all the services are running again after reboot. This time you should see some IP overlaps.

These steps were just to verify that you do see overlaps.

Now for a crude way to temporarily avoid overlaps - if this works then you are very likely hitting this issue:

Reset again to a clean state by removing all services and rebooting all VMs.
Start all your services
Now remove all your services and repeat the start, remove sequence a few times. The effect of this will be to move the VIPs in use to further down the IP range.
Now start your services again as you would actually like them.
Verify that there are no IP overlaps in docker network inspect -v ingress (and save the output).
Reboot the Leader node.
A docker network inspect -v ingress after this reboot should not have overlaps. Also all your tests that track the connections should work fine. Obviously this isn't a workaround since the overlaps will still happen, just a bit later.

Thanks again!

dperny · 2017-12-06T20:29:24Z

This is definitely in the wrong place, because we shouldn't be allocating in IsServiceAllocated; we should just be returning false, and then allocating in the next function. But the fix looks good.

God, this part of the code is such a mess. It desperately needs to be refactored.

dperny · 2017-12-06T22:33:30Z

Ok, so taking a deeper dig... This doesn't LGTM, because we shouldn't be allocating vips in IsServiceAllocated. Though it might fix the problem, it makes the code worse. And it doesn't explain why the vip loop in AllocateService isn't doing this work.

anshulpundir · 2017-12-07T00:02:58Z

I agree with @dperny that the function IsServiceAllocated() doesn't seem like the right place to be doing this. From the sound of it, it seems like IsServiceAllocated() should be a 'const' function with no side effects.

anshulpundir

@balrajsingh can you also please see if a unit-test can be added for the fix ?

balrajsingh · 2017-12-07T00:59:35Z

Fair enough, it should then be done where it makes most sense per the overall design.
Note that AllocateService is only called on the "create from CLI" path and not on the "restore" path. IsServiceAllocated is called at roughly the equivalent point on the "restore" path.
If I make IsServiceAllocated return false the current logic doesn't go on to make the required calls through to IPAM.
Some refactoring is going to be needed to do this right.

GordonTheTurtle · 2017-12-13T14:33:36Z

Please sign your commits following these rules:
https://github.com/moby/moby/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "libnetwork_1790" [email protected]:balrajsingh/swarmkit.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842354609176
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

abhi · 2017-12-13T15:08:41Z

manager/allocator/cnmallocator/networkallocator.go

@@ -404,6 +404,9 @@ func (na *cnmNetworkAllocator) IsServiceAllocated(s *api.Service, flags ...func(
 	vipLoop:
 		for _, vip := range s.Endpoint.VirtualIPs {
 			if na.IsVIPOnIngressNetwork(vip) && networkallocator.IsIngressNetworkNeeded(s) {
+				if _, ok := na.services[s.ID]; !ok {


Please add the comment here explaining the scenario which is , If the ingress network is required and the allocation is not done for the ingress network.

balrajsingh · 2017-12-13T20:39:45Z

Thanks @abhi and thanks also for suggesting this fix.
I'm trying to write a unit test to check for the IP overlap condition that this fixes but it will take me a bit more time to get it done.
In the mean time please check if this fix looks ok.

abhi · 2017-12-13T23:27:53Z

manager/allocator/allocator_test.go

+
+					VirtualIPs: []*api.Endpoint_VirtualIP{
+						{
+							NetworkID: "ingress-nw-id",


I think ingress subnet is /16. This should be fine for unit test.

abhi · 2017-12-13T23:27:57Z

manager/allocator/allocator_test.go

+	s := store.NewMemoryStore(nil)
+	assert.NotNil(t, s)
+	defer s.Close()
+


extra empty line

abhi · 2017-12-13T23:28:15Z

manager/allocator/allocator_test.go

+								PublishedPort: uint32(8001 + i),
+							},
+						},
+


abhi · 2017-12-13T23:28:26Z

manager/allocator/allocator_test.go

+							PublishedPort: uint32(8001 + i),
+						},
+					},
+


abhi · 2017-12-13T23:29:32Z

manager/allocator/allocator_test.go

+				DesiredState: api.TaskStateRunning,
+			}
+			assert.NoError(t, store.CreateTask(tx, tsk))
+


abhi · 2017-12-13T23:30:55Z

manager/allocator/allocator_test.go

+			panic("missing task networks")
+		}
+		if len(task.Networks[0].Addresses) == 0 {
+			panic("missing task network address")


Why panic we should use assert like its used in all other tests.

abhi · 2017-12-13T23:31:06Z

manager/allocator/allocator_test.go

+
+		assignedIP := task.Networks[0].Addresses[0]
+		if assignedIPs[assignedIP] {
+			t.Fatalf("task %s assigned duplicate IP %s", task.ID, assignedIP)


same as above. use assert

abhi · 2017-12-13T23:31:22Z

manager/allocator/allocator_test.go

+			t.Fatalf("task %s assigned duplicate IP %s", task.ID, assignedIP)
+		}
+		assignedIPs[assignedIP] = true
+


abhi · 2017-12-13T23:31:25Z

manager/allocator/allocator_test.go

+		assignedIPs[assignedIP] = true
+
+		if assignedVIPs[assignedIP] {
+			t.Fatalf("a service and task %s have the same IP %s", task.ID, assignedIP)


abhi · 2017-12-13T23:31:31Z

manager/allocator/allocator_test.go

+		if assignedVIPs[assignedIP] {
+			t.Fatalf("a service and task %s have the same IP %s", task.ID, assignedIP)
+		}
+


abhi

Few comments

abhi · 2017-12-13T23:52:46Z

manager/allocator/allocator_test.go

+		}
+		assignedIP := task.Networks[0].Addresses[0]
+		if assignedIPs[assignedIP] {
+			t.Fatalf("task %s assigned duplicate IP %s", task.ID, assignedIP)


I don't think this t.Fatalf is right every where. We need to return the assert

Follow this https://github.com/docker/swarmkit/blob/master/manager/allocator/allocator_test.go#L824

On leader change or leader reboot the restore logic in the allocator was allocating overlapping IP address for VIPs and Endpoints in the ingress network. The fix added as part of this commit ensures that during restore we allocate the existing VIP and endpoint. Signed-off-by: Balraj Singh <[email protected]>

abhi

LGTM

abhi · 2017-12-14T03:56:21Z

@anshulpundir @dperny, @balrajsingh has found the root cause. This issue happens only when service is attached to ingress network (handled a little different) The fix is ready for a final review. PTAL

fcrisciani · 2017-12-14T22:32:11Z

manager/allocator/allocator_test.go

+	}
+
+	assignedVIPs := make(map[string]bool)
+	assignedIPs := make(map[string]bool)


you could have used 1 map, anyway there must be no overlap between VIP and task IPs

dperny

LGTM after fixing the one style nit.

dperny · 2017-12-14T23:05:54Z

manager/allocator/allocator_test.go

+
+	assignedVIPs := make(map[string]bool)
+	assignedIPs := make(map[string]bool)
+	hasNoIPOverlapServices := func(fakeT assert.TestingT, service *api.Service) bool {


strictly speaking, you should either use fakeT instead of t in your calls to assert functions, or you should change the function signature to replace this variable with an underscore _. The former is the probably correct option. Or, you can just change the variable name of fakeT to t and shadow the variable t instead of closing over it.

Signed-off-by: Balraj Singh <[email protected]>

fcrisciani · 2017-12-15T17:09:40Z

LGTM

abhi

LGTM

dperny · 2017-12-15T18:18:59Z

Yeah LGTM.

dperny · 2017-12-15T18:25:08Z

strictly speaking, abhi and flavio aren't maintainers, but they know this code probably better than most of the maintainers, so i'm merging off their LGTMs.

Tracking down libnetwork/1790#issuecomment-308222053. The report is that if on a single node several services are started, and if this node is then rebooted, all the services appear to come back but some of them are no longer reachable. On probing, the cause turned out to be an invalid assignment of IP addresses to services when they were restored. Specifically, the same IP address was assigned to one service's VIP and also a different service's endpoint. The result was that packets got delivered to the wrong container and caused symptoms like services or ports unreachable. This is very likely to also be the cause of moby/#35675 and other duplicate-IP or overlapping IP issues. The reason for this problem seems to be that the code path followed when services are restored, at no point contacts or informs IPAM about the IP addresses used as the restored service's VIP. So IPAM thinks that those IP addresses are still available and hands them out to endpoints and new services, causing the observed chaos. To work out the right fix, I compared the code path when a service is created from the CLI to the code path when a service is restored on reboot. To me this fix is the bit that should have always been on the restore path but was omitted. With this fix IPAM gets correctly informed and it's state is consistent with what I see on the network. I have tested this fix on a single node running several services and when there multiple nodes with multiple managers running many services (specifically 2 nodes and 2 managers). In both cases, without the fix a reboot would cause IP address overlaps on the ingress network. With the fix there are no overlaps. While the fix seems to work, I'm not sure if it is at exactly the right point in this function, or indeed if it is the right or complete fix. Please take a look and let me know. Signed-off-by: abhi <[email protected]>

[Backport 17.06.5] #2468 Added required call to allocate VIPs when endpoints are restored

anshulpundir reviewed Dec 7, 2017

View reviewed changes

GordonTheTurtle added the dco/no label Dec 13, 2017

abhi reviewed Dec 13, 2017

View reviewed changes

balrajsingh force-pushed the libnetwork_1790 branch from 05a57a9 to cf08f3e Compare December 13, 2017 16:32

GordonTheTurtle removed the dco/no label Dec 13, 2017

abhi reviewed Dec 13, 2017

View reviewed changes

manager/allocator/allocator_test.go Outdated

s := store.NewMemoryStore(nil)

assert.NotNil(t, s)

defer s.Close()

Copy link

Contributor

abhi Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra empty line

abhi reviewed Dec 13, 2017

View reviewed changes

manager/allocator/allocator_test.go Outdated

PublishedPort: uint32(8001 + i),

},

},

Copy link

Contributor

abhi Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty line

abhi reviewed Dec 13, 2017

View reviewed changes

manager/allocator/allocator_test.go Outdated

PublishedPort: uint32(8001 + i),

},

},

Copy link

Contributor

abhi Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty line

abhi reviewed Dec 13, 2017

View reviewed changes

balrajsingh force-pushed the libnetwork_1790 branch from a6b7a7b to e5515c3 Compare December 14, 2017 00:55

abhi approved these changes Dec 14, 2017

View reviewed changes

fcrisciani reviewed Dec 14, 2017

View reviewed changes

dperny reviewed Dec 14, 2017

View reviewed changes

Added unit test TestAllocatorRestoreForDuplicateIPs

2397ddf

Signed-off-by: Balraj Singh <[email protected]>

balrajsingh force-pushed the libnetwork_1790 branch from e5515c3 to 2397ddf Compare December 15, 2017 16:52

abhi approved these changes Dec 15, 2017

View reviewed changes

dperny merged commit a6519e2 into moby:master Dec 15, 2017

fcrisciani mentioned this pull request Dec 15, 2017

[Backport 17.12] Added required call to allocate VIPs when endpoints are restored #2472

Closed

abhi mentioned this pull request Dec 16, 2017

Vendoring swarmkit a6519e28ff2a558f5d32b2dab9fcb0882879b398 moby/moby#35811

Merged

abhi mentioned this pull request Jan 4, 2018

[Backport 17.06.5] #2468 Added required call to allocate VIPs when endpoints are restored #2482

Merged

nishanttotla added a commit that referenced this pull request Jan 4, 2018

Merge pull request #2482 from abhi/bump_v17.06.5

93d12c7

[Backport 17.06.5] #2468 Added required call to allocate VIPs when endpoints are restored

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added required call to allocate VIPs when endpoints are restored #2468

Added required call to allocate VIPs when endpoints are restored #2468

balrajsingh commented Dec 6, 2017

codecov bot commented Dec 6, 2017 •

edited

Loading

jbscare commented Dec 6, 2017 •

edited

Loading

thaJeztah commented Dec 6, 2017

balrajsingh commented Dec 6, 2017

dperny commented Dec 6, 2017

dperny commented Dec 6, 2017

anshulpundir commented Dec 7, 2017 •

edited

Loading

anshulpundir left a comment

balrajsingh commented Dec 7, 2017

GordonTheTurtle commented Dec 13, 2017

abhi Dec 13, 2017

balrajsingh commented Dec 13, 2017

abhi Dec 13, 2017

abhi Dec 13, 2017

abhi Dec 13, 2017

abhi Dec 13, 2017

abhi Dec 13, 2017

abhi Dec 13, 2017 •

edited

Loading

abhi Dec 13, 2017

abhi Dec 13, 2017

abhi Dec 13, 2017

abhi Dec 13, 2017

abhi left a comment

abhi Dec 13, 2017

abhi Dec 13, 2017

abhi left a comment

abhi commented Dec 14, 2017

fcrisciani Dec 14, 2017

dperny left a comment

dperny Dec 14, 2017

fcrisciani commented Dec 15, 2017

abhi left a comment

dperny commented Dec 15, 2017

dperny commented Dec 15, 2017

Added required call to allocate VIPs when endpoints are restored #2468

Added required call to allocate VIPs when endpoints are restored #2468

Conversation

balrajsingh commented Dec 6, 2017

codecov bot commented Dec 6, 2017 • edited Loading

Codecov Report

jbscare commented Dec 6, 2017 • edited Loading

thaJeztah commented Dec 6, 2017

balrajsingh commented Dec 6, 2017

dperny commented Dec 6, 2017

dperny commented Dec 6, 2017

anshulpundir commented Dec 7, 2017 • edited Loading

anshulpundir left a comment

Choose a reason for hiding this comment

balrajsingh commented Dec 7, 2017

GordonTheTurtle commented Dec 13, 2017

Choose a reason for hiding this comment

balrajsingh commented Dec 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhi Dec 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhi left a comment

Choose a reason for hiding this comment

abhi commented Dec 14, 2017

Choose a reason for hiding this comment

dperny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fcrisciani commented Dec 15, 2017

abhi left a comment

Choose a reason for hiding this comment

dperny commented Dec 15, 2017

dperny commented Dec 15, 2017

codecov bot commented Dec 6, 2017 •

edited

Loading

jbscare commented Dec 6, 2017 •

edited

Loading

anshulpundir commented Dec 7, 2017 •

edited

Loading

abhi Dec 13, 2017 •

edited

Loading