Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added required call to allocate VIPs when endpoints are restored #2468

Merged
merged 2 commits into from
Dec 15, 2017

Conversation

balrajsingh
Copy link
Contributor

Tracking down libnetwork/1790#issuecomment-308222053. The report is that if on a single node several services are started, and if this node is then rebooted, all the services appear to come back but some of them are no longer reachable.

On probing, the cause turned out to be an invalid assignment of IP addresses to services when they were restored. Specifically, the same IP address was assigned to one service's VIP and also a different service's endpoint. The result was that packets got delivered to the wrong container and caused symptoms like services or ports unreachable.

This is very likely to also be the cause of moby/#35675 and other duplicate-IP or overlapping IP issues.

The reason for this problem seems to be that the code path followed when services are restored, at no point contacts or informs IPAM about the IP addresses used as the restored service's VIP. So IPAM thinks that those IP addresses are still available and hands them out to endpoints and new services, causing the observed chaos.

To work out the right fix, I compared the code path when a service is created from the CLI to the code path when a service is restored on reboot. To me this fix is the bit that should have always been on the restore path but was omitted. With this fix IPAM gets correctly informed and it's state is consistent with what I see on the network.

I have tested this fix on a single node running several services and when there multiple nodes with multiple managers running many services (specifically 2 nodes and 2 managers). In both cases, without the fix a reboot would cause IP address overlaps on the ingress network. With the fix there are no overlaps.

While the fix seems to work, I'm not sure if it is at exactly the right point in this function, or indeed if it is the right or complete fix. Please take a look and let me know.

@dperny @fcrisciani

Signed-off-by: Balraj Singh [email protected]

@codecov
Copy link

codecov bot commented Dec 6, 2017

Codecov Report

Merging #2468 into master will decrease coverage by 0.34%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #2468      +/-   ##
==========================================
- Coverage   62.07%   61.72%   -0.35%     
==========================================
  Files          49      128      +79     
  Lines        6842    21120   +14278     
==========================================
+ Hits         4247    13037    +8790     
- Misses       2162     6677    +4515     
- Partials      433     1406     +973

@jbscare
Copy link

jbscare commented Dec 6, 2017

One thing I wanted to mention, which came up on moby/moby#35675, is that our situation includes a current service (A) that overlaps with a deleted service (K). (In particular, docker network inspect -v output still includes a VIP and an endpoint for K, and the addresses for that VIP and endpoint are allocated to two endpoints in A.)

We're not sure if the overlap happened before or after K was deleted (I can try to find out, although I'm not sure where to look), but I wanted to mention it just in case it's relevant to this PR. We definitely have restarted Docker and/or rebooted nodes at some point in here.

Thanks for tracking this down!

@thaJeztah
Copy link
Member

Thanks for the additional information @jbscare - appreciated 👍

@balrajsingh
Copy link
Contributor Author

Thanks @jbscare. I can't immediately see a sequence of operations that would land up in the state you see, but still I'm tempted to say that the cause of what you see is this issue. The reason is that when the module that keeps track of IP addrs in use and the actual addresses used go out of sync, it doesn't cause an immediate fatal error. It just limps along and can get itself into very unusual states including the one you see. Even when it's in a bad way, some of the mesh networking keeps it working so the actual failure is discovered quite late (unless of course you do a sensitive test like yours where you track connections to individual containers).

If it's possible to do the following test it would help a lot. It's a bit crude but should give a strong indication of whether this is the cause of the problem:

  • First reset to a clean state. So remove all services and then reboot the VMs.
  • On a freshly booted machine with no services running, start several of your services.
  • Run docker network inspect -v ingress and verify that there are no overlaps between VIPs and EndpointIPs (also save the output).
  • Reboot the Leader node (use docker node ls to find the leader).
  • Do a docker network inspect -v ingress once all the services are running again after reboot. This time you should see some IP overlaps.

These steps were just to verify that you do see overlaps.

Now for a crude way to temporarily avoid overlaps - if this works then you are very likely hitting this issue:

  • Reset again to a clean state by removing all services and rebooting all VMs.
  • Start all your services
  • Now remove all your services and repeat the start, remove sequence a few times. The effect of this will be to move the VIPs in use to further down the IP range.
  • Now start your services again as you would actually like them.
  • Verify that there are no IP overlaps in docker network inspect -v ingress (and save the output).
  • Reboot the Leader node.
  • A docker network inspect -v ingress after this reboot should not have overlaps. Also all your tests that track the connections should work fine. Obviously this isn't a workaround since the overlaps will still happen, just a bit later.

Thanks again!

@dperny
Copy link
Collaborator

dperny commented Dec 6, 2017

This is definitely in the wrong place, because we shouldn't be allocating in IsServiceAllocated; we should just be returning false, and then allocating in the next function. But the fix looks good.

God, this part of the code is such a mess. It desperately needs to be refactored.

@dperny
Copy link
Collaborator

dperny commented Dec 6, 2017

Ok, so taking a deeper dig... This doesn't LGTM, because we shouldn't be allocating vips in IsServiceAllocated. Though it might fix the problem, it makes the code worse. And it doesn't explain why the vip loop in AllocateService isn't doing this work.

@anshulpundir
Copy link
Contributor

anshulpundir commented Dec 7, 2017

I agree with @dperny that the function IsServiceAllocated() doesn't seem like the right place to be doing this. From the sound of it, it seems like IsServiceAllocated() should be a 'const' function with no side effects.

Copy link
Contributor

@anshulpundir anshulpundir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@balrajsingh can you also please see if a unit-test can be added for the fix ?

@balrajsingh
Copy link
Contributor Author

Fair enough, it should then be done where it makes most sense per the overall design.
Note that AllocateService is only called on the "create from CLI" path and not on the "restore" path. IsServiceAllocated is called at roughly the equivalent point on the "restore" path.
If I make IsServiceAllocated return false the current logic doesn't go on to make the required calls through to IPAM.
Some refactoring is going to be needed to do this right.

@GordonTheTurtle
Copy link

Please sign your commits following these rules:
https://github.com/moby/moby/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "libnetwork_1790" [email protected]:balrajsingh/swarmkit.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842354609176
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

@@ -404,6 +404,9 @@ func (na *cnmNetworkAllocator) IsServiceAllocated(s *api.Service, flags ...func(
vipLoop:
for _, vip := range s.Endpoint.VirtualIPs {
if na.IsVIPOnIngressNetwork(vip) && networkallocator.IsIngressNetworkNeeded(s) {
if _, ok := na.services[s.ID]; !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the comment here explaining the scenario which is , If the ingress network is required and the allocation is not done for the ingress network.

@balrajsingh
Copy link
Contributor Author

Thanks @abhi and thanks also for suggesting this fix.
I'm trying to write a unit test to check for the IP overlap condition that this fixes but it will take me a bit more time to get it done.
In the mean time please check if this fix looks ok.


VirtualIPs: []*api.Endpoint_VirtualIP{
{
NetworkID: "ingress-nw-id",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ingress subnet is /16. This should be fine for unit test.

s := store.NewMemoryStore(nil)
assert.NotNil(t, s)
defer s.Close()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra empty line

PublishedPort: uint32(8001 + i),
},
},

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty line

PublishedPort: uint32(8001 + i),
},
},

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty line

DesiredState: api.TaskStateRunning,
}
assert.NoError(t, store.CreateTask(tx, tsk))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty line

panic("missing task networks")
}
if len(task.Networks[0].Addresses) == 0 {
panic("missing task network address")
Copy link
Contributor

@abhi abhi Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why panic we should use assert like its used in all other tests.


assignedIP := task.Networks[0].Addresses[0]
if assignedIPs[assignedIP] {
t.Fatalf("task %s assigned duplicate IP %s", task.ID, assignedIP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above. use assert

t.Fatalf("task %s assigned duplicate IP %s", task.ID, assignedIP)
}
assignedIPs[assignedIP] = true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty line

assignedIPs[assignedIP] = true

if assignedVIPs[assignedIP] {
t.Fatalf("a service and task %s have the same IP %s", task.ID, assignedIP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assert

if assignedVIPs[assignedIP] {
t.Fatalf("a service and task %s have the same IP %s", task.ID, assignedIP)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty line

Copy link
Contributor

@abhi abhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments

}
assignedIP := task.Networks[0].Addresses[0]
if assignedIPs[assignedIP] {
t.Fatalf("task %s assigned duplicate IP %s", task.ID, assignedIP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this t.Fatalf is right every where. We need to return the assert

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On leader change or leader reboot the restore logic in the allocator
was allocating overlapping IP address for VIPs and Endpoints in the
ingress network.  The fix added as part of this commit ensures
that during restore we allocate the existing VIP and endpoint.

Signed-off-by: Balraj Singh <[email protected]>
Copy link
Contributor

@abhi abhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@abhi
Copy link
Contributor

abhi commented Dec 14, 2017

@anshulpundir @dperny, @balrajsingh has found the root cause. This issue happens only when service is attached to ingress network (handled a little different) The fix is ready for a final review. PTAL

}

assignedVIPs := make(map[string]bool)
assignedIPs := make(map[string]bool)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could have used 1 map, anyway there must be no overlap between VIP and task IPs

Copy link
Collaborator

@dperny dperny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after fixing the one style nit.


assignedVIPs := make(map[string]bool)
assignedIPs := make(map[string]bool)
hasNoIPOverlapServices := func(fakeT assert.TestingT, service *api.Service) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strictly speaking, you should either use fakeT instead of t in your calls to assert functions, or you should change the function signature to replace this variable with an underscore _. The former is the probably correct option. Or, you can just change the variable name of fakeT to t and shadow the variable t instead of closing over it.

@fcrisciani
Copy link

LGTM

Copy link
Contributor

@abhi abhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dperny
Copy link
Collaborator

dperny commented Dec 15, 2017

Yeah LGTM.

@dperny dperny merged commit a6519e2 into moby:master Dec 15, 2017
@dperny
Copy link
Collaborator

dperny commented Dec 15, 2017

strictly speaking, abhi and flavio aren't maintainers, but they know this code probably better than most of the maintainers, so i'm merging off their LGTMs.

abhi pushed a commit to abhi/swarmkit that referenced this pull request Jan 4, 2018
Tracking down libnetwork/1790#issuecomment-308222053. The report is that if on a single node several services are started, and if this node is then rebooted, all the services appear to come back but some of them are no longer reachable.
On probing, the cause turned out to be an invalid assignment of IP addresses to services when they were restored. Specifically, the same IP address was assigned to one service's VIP and also a different service's endpoint. The result was that packets got delivered to the wrong container and caused symptoms like services or ports unreachable.
This is very likely to also be the cause of moby/#35675 and other duplicate-IP or overlapping IP issues.
The reason for this problem seems to be that the code path followed when services are restored, at no point contacts or informs IPAM about the IP addresses used as the restored service's VIP. So IPAM thinks that those IP addresses are still available and hands them out to endpoints and new services, causing the observed chaos.
To work out the right fix, I compared the code path when a service is created from the CLI to the code path when a service is restored on reboot. To me this fix is the bit that should have always been on the restore path but was omitted. With this fix IPAM gets correctly informed and it's state is consistent with what I see on the network.
I have tested this fix on a single node running several services and when there multiple nodes with multiple managers running many services (specifically 2 nodes and 2 managers). In both cases, without the fix a reboot would cause IP address overlaps on the ingress network. With the fix there are no overlaps.
While the fix seems to work, I'm not sure if it is at exactly the right point in this function, or indeed if it is the right or complete fix. Please take a look and let me know.

Signed-off-by: abhi <[email protected]>
nishanttotla added a commit that referenced this pull request Jan 4, 2018
[Backport 17.06.5] #2468 Added required call to allocate VIPs when endpoints are restored
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants