-
Notifications
You must be signed in to change notification settings - Fork 254
fix: remove veth pair in vm ns if previously leaked and fix validation #3940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes issues in transparent-vlan mode related to veth interface management and validation. It addresses leaked veth pairs in the VM namespace and corrects validation logic for moved interfaces.
- Proactively removes potentially leaked vnet and container veth interfaces before creating new ones
- Fixes validation logic to check the correct interface name in the correct namespace after moving veth interfaces
- Updates test cases to handle the new function signature and mock behavior
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
File | Description |
---|---|
network/transparent_vlan_endpointclient_linux.go | Adds proactive cleanup of veth interfaces and fixes validation parameters |
network/transparent_vlan_endpointclient_linux_test.go | Updates test calls and mock configuration for the modified validation function |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
/azp run Azure Container Networking PR |
Azure Pipelines successfully started running 1 pipeline(s). |
33e3145
to
4db2645
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
return ExecuteInNS(client.nsClient, client.vnetNSName, func() error { | ||
_, ifDetectedErr := client.netioshim.GetNetworkInterfaceByName(client.vlanIfName) | ||
return errors.Wrap(ifDetectedErr, "failed to get vlan veth in namespace") | ||
return ExecuteInNS(client.nsClient, nsName, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference in passing the nsName
rather than using client.vnetNSName
? We are still passing client.vnetNSName
as nsName
always. Is the client different in setLinkNetNSAndConfirm
calls ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there isn't a difference right now but if in the future we want to move a link to a namespace other than client.vnetNSName, we could do so without changing this code.
@@ -310,6 +310,16 @@ func (client *TransparentVlanEndpointClient) PopulateVM(epInfo *EndpointInfo) er | |||
logger.Info("Failed to parse the mac address", zap.String("defaultHostVethHwAddr", defaultHostVethHwAddr)) | |||
} | |||
|
|||
// Proactively clean up any leftover veth interfaces before creating new ones | |||
if err = client.netlink.DeleteLink(client.vnetVethName); err != nil { | |||
logger.Info("Could not proactively clean up vnet veth", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this cleanup different from the one we have during the ADD call ? It can happen at this point to ?referring to this:
The above shouldn't be necessary since we already validate the existence of the veth interfaces on ADD, but it seems like sometimes the veth creation can pass validation, but then disappear for a short period of time before re-appearing, bypassing the cleanup logic during the add.
Curious to know if we know the reason for disappearing for a short period of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we believe the delete link could silently fail, but if that happens, the next call to create the veth pair will fail the ADD call (file exists... ) and then the container runtime will retry, and eventually the leaked interfaces should be deleted. we are investigating the reason for the disappearance with the azure linux team at the moment.
Reason for Change:
In transparent-vlan mode, removes the vnet veth interface and container veth interface in the vm namespace if they exist prior to creating the pair in the vm namespace. This won't disrupt existing connections because these pairs are one per container, and if either side of the veth pair were in the vm namespace, the container's networking would be broken. The vnet veth interface must be in the vnet namespace and the container veth interface must be in the container namespace in a working setup (otherwise it is broken and we need to clean up). Removing one side of the veth should remove the other.
The above shouldn't be necessary since we already validate the existence of the veth interfaces on ADD, but it seems like sometimes the veth creation can pass validation, but then disappear for a short period of time before re-appearing, bypassing the cleanup logic during the add.
Also fixes an improper validation check after moving the vnet veth into the vnet namespace (though this did not cause the issue that triggered this fix). Previously it would check the wrong interface name, now it will check the interface name passed in the namespace passed in.
Issue Fixed:
See above
Requirements:
Notes:
Tested on a multitenancy linux transparent vlan setup with no issues