Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 17.06.5] #2468 Added required call to allocate VIPs when endpoints are restored #2482

Merged
merged 1 commit into from
Jan 4, 2018

Conversation

abhi
Copy link
Contributor

@abhi abhi commented Jan 4, 2018

Since cherry pick doesnt apply cleanly, I manually backported #2468

Cherry Picked from #2468:
Addressing duplicate IP issue on ingress network from 5fd25d2
Added required call to allocate VIPs when endpoints are restored 2397ddf

<----Original Commit Message----->
Tracking down libnetwork/1790#issuecomment-308222053. The report is that if on a single node several services are started, and if this node is then rebooted, all the services appear to come back but some of them are no longer reachable.
On probing, the cause turned out to be an invalid assignment of IP addresses to services when they were restored. Specifically, the same IP address was assigned to one service's VIP and also a different service's endpoint. The result was that packets got delivered to the wrong container and caused symptoms like services or ports unreachable.
This is very likely to also be the cause of moby/#35675 and other duplicate-IP or overlapping IP issues.
The reason for this problem seems to be that the code path followed when services are restored, at no point contacts or informs IPAM about the IP addresses used as the restored service's VIP. So IPAM thinks that those IP addresses are still available and hands them out to endpoints and new services, causing the observed chaos.
To work out the right fix, I compared the code path when a service is created from the CLI to the code path when a service is restored on reboot. To me this fix is the bit that should have always been on the restore path but was omitted. With this fix IPAM gets correctly informed and it's state is consistent with what I see on the network.
I have tested this fix on a single node running several services and when there multiple nodes with multiple managers running many services (specifically 2 nodes and 2 managers). In both cases, without the fix a reboot would cause IP address overlaps on the ingress network. With the fix there are no overlaps.
While the fix seems to work, I'm not sure if it is at exactly the right point in this function, or indeed if it is the right or complete fix. Please take a look and let me know.

Signed-off-by: abhi [email protected]

Tracking down libnetwork/1790#issuecomment-308222053. The report is that if on a single node several services are started, and if this node is then rebooted, all the services appear to come back but some of them are no longer reachable.
On probing, the cause turned out to be an invalid assignment of IP addresses to services when they were restored. Specifically, the same IP address was assigned to one service's VIP and also a different service's endpoint. The result was that packets got delivered to the wrong container and caused symptoms like services or ports unreachable.
This is very likely to also be the cause of moby/#35675 and other duplicate-IP or overlapping IP issues.
The reason for this problem seems to be that the code path followed when services are restored, at no point contacts or informs IPAM about the IP addresses used as the restored service's VIP. So IPAM thinks that those IP addresses are still available and hands them out to endpoints and new services, causing the observed chaos.
To work out the right fix, I compared the code path when a service is created from the CLI to the code path when a service is restored on reboot. To me this fix is the bit that should have always been on the restore path but was omitted. With this fix IPAM gets correctly informed and it's state is consistent with what I see on the network.
I have tested this fix on a single node running several services and when there multiple nodes with multiple managers running many services (specifically 2 nodes and 2 managers). In both cases, without the fix a reboot would cause IP address overlaps on the ingress network. With the fix there are no overlaps.
While the fix seems to work, I'm not sure if it is at exactly the right point in this function, or indeed if it is the right or complete fix. Please take a look and let me know.

Signed-off-by: abhi <[email protected]>
@codecov
Copy link

codecov bot commented Jan 4, 2018

Codecov Report

Merging #2482 into bump_v17.06.5 will increase coverage by <.01%.
The diff coverage is 0%.

@@                Coverage Diff                @@
##           bump_v17.06.5    #2482      +/-   ##
=================================================
+ Coverage          60.56%   60.57%   +<.01%     
=================================================
  Files                121      121              
  Lines              20124    20126       +2     
=================================================
+ Hits               12189    12191       +2     
- Misses              6583     6598      +15     
+ Partials            1352     1337      -15

@dperny
Copy link
Collaborator

dperny commented Jan 4, 2018

LGTM. Small change, mostly test.

@nishanttotla nishanttotla merged commit 93d12c7 into moby:bump_v17.06.5 Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants