Skip to content

fix: remove veth pair in vm ns if previously leaked and fix validation #3940

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

QxBytes
Copy link
Contributor

@QxBytes QxBytes commented Aug 15, 2025

Reason for Change:

In transparent-vlan mode, removes the vnet veth interface and container veth interface in the vm namespace if they exist prior to creating the pair in the vm namespace. This won't disrupt existing connections because these pairs are one per container, and if either side of the veth pair were in the vm namespace, the container's networking would be broken. The vnet veth interface must be in the vnet namespace and the container veth interface must be in the container namespace in a working setup (otherwise it is broken and we need to clean up). Removing one side of the veth should remove the other.

The above shouldn't be necessary since we already validate the existence of the veth interfaces on ADD, but it seems like sometimes the veth creation can pass validation, but then disappear for a short period of time before re-appearing, bypassing the cleanup logic during the add.

Also fixes an improper validation check after moving the vnet veth into the vnet namespace (though this did not cause the issue that triggered this fix). Previously it would check the wrong interface name, now it will check the interface name passed in the namespace passed in.

Issue Fixed:

See above

Requirements:

Notes:
Tested on a multitenancy linux transparent vlan setup with no issues

@QxBytes QxBytes self-assigned this Aug 15, 2025
@QxBytes QxBytes added cni Related to CNI. fix Fixes something. multitenancy labels Aug 15, 2025
@QxBytes QxBytes requested a review from Copilot August 15, 2025 20:04
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes issues in transparent-vlan mode related to veth interface management and validation. It addresses leaked veth pairs in the VM namespace and corrects validation logic for moved interfaces.

  • Proactively removes potentially leaked vnet and container veth interfaces before creating new ones
  • Fixes validation logic to check the correct interface name in the correct namespace after moving veth interfaces
  • Updates test cases to handle the new function signature and mock behavior

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
network/transparent_vlan_endpointclient_linux.go Adds proactive cleanup of veth interfaces and fixes validation parameters
network/transparent_vlan_endpointclient_linux_test.go Updates test calls and mock configuration for the modified validation function

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

@QxBytes
Copy link
Contributor Author

QxBytes commented Aug 15, 2025

/azp run Azure Container Networking PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@QxBytes QxBytes marked this pull request as ready for review August 15, 2025 21:11
@QxBytes QxBytes requested a review from a team as a code owner August 15, 2025 21:11
@QxBytes QxBytes requested review from nairashu and behzad-mir August 15, 2025 21:11
@QxBytes QxBytes force-pushed the alew/transparent-vlan-clean-vnet-container-nic branch from 33e3145 to 4db2645 Compare August 15, 2025 22:34
behzad-mir
behzad-mir previously approved these changes Aug 20, 2025
Copy link
Contributor

@behzad-mir behzad-mir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

return ExecuteInNS(client.nsClient, client.vnetNSName, func() error {
_, ifDetectedErr := client.netioshim.GetNetworkInterfaceByName(client.vlanIfName)
return errors.Wrap(ifDetectedErr, "failed to get vlan veth in namespace")
return ExecuteInNS(client.nsClient, nsName, func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference in passing the nsName rather than using client.vnetNSName ? We are still passing client.vnetNSName as nsName always. Is the client different in setLinkNetNSAndConfirm calls ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there isn't a difference right now but if in the future we want to move a link to a namespace other than client.vnetNSName, we could do so without changing this code.

@@ -310,6 +310,16 @@ func (client *TransparentVlanEndpointClient) PopulateVM(epInfo *EndpointInfo) er
logger.Info("Failed to parse the mac address", zap.String("defaultHostVethHwAddr", defaultHostVethHwAddr))
}

// Proactively clean up any leftover veth interfaces before creating new ones
if err = client.netlink.DeleteLink(client.vnetVethName); err != nil {
logger.Info("Could not proactively clean up vnet veth",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this cleanup different from the one we have during the ADD call ? It can happen at this point to ?referring to this:

The above shouldn't be necessary since we already validate the existence of the veth interfaces on ADD, but it seems like sometimes the veth creation can pass validation, but then disappear for a short period of time before re-appearing, bypassing the cleanup logic during the add.

Curious to know if we know the reason for disappearing for a short period of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we believe the delete link could silently fail, but if that happens, the next call to create the veth pair will fail the ADD call (file exists... ) and then the container runtime will retry, and eventually the leaked interfaces should be deleted. we are investigating the reason for the disappearance with the azure linux team at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cni Related to CNI. fix Fixes something. multitenancy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants