Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDP QUIC Hole Punching Fails due to Port Reuse #3165

Open
ethan-gallant opened this issue Feb 1, 2025 · 2 comments
Open

UDP QUIC Hole Punching Fails due to Port Reuse #3165

ethan-gallant opened this issue Feb 1, 2025 · 2 comments

Comments

@ethan-gallant
Copy link

We're leveraging LibP2Ps hole punching functionality to mesh together several private networks. The network layout looks something like this

We have multiple apps running in Kubernetes clusters, each cluster has:

  • 1 Relay Node with
    • A public address with 443 open
    • A private address with 443 also open (for local peers to register with the Relay for relaying)
  • 2 Peer Nodes with
    • A public address that is different than that of the Relay where ports cannot be manually opened / forwarded
    • A private address on the same network as the cluster relay

Below visualizes how this looks in actuality

flowchart TD
  subgraph Cluster1
    R1["Relay Node<br>Public: 443 open<br>Private: 443 open"]
    P1["Peer Node 1<br>Public: no ports<br>Private: 443"]
    P2["Peer Node 2<br>Public: no ports<br>Private: 443"]
    P1 -- "register via Private:443" --> R1
    P2 -- "register via Private:443" --> R1
  end

  subgraph Cluster2
    R2["Relay Node<br>Public: 443 open<br>Private: 443 open"]
    P3["Peer Node 1<br>Public: no ports<br>Private: 443"]
    P4["Peer Node 2<br>Public: no ports<br>Private: 443"]
    P3 -- "register via Private:443" --> R2
    P4 -- "register via Private:443" --> R2
  end

  subgraph Cluster3
    R3["Relay Node<br>Public: 443 open<br>Private: 443 open"]
    P5["Peer Node 1<br>Public: no ports<br>Private: 443"]
    P6["Peer Node 2<br>Public: no ports<br>Private: 443"]
    P5 -- "register via Private:443" --> R3
    P6 -- "register via Private:443" --> R3
  end
Loading

The connection flow looks something like below:

  1. The peers connect locally over the network to eachother
graph TD
    subgraph "Initial State & Local Discovery"
        subgraph "Cluster A"
            PA1[Peer A1]
            PA2[Peer A2]
            PA1 --- PA2
        end
        subgraph "Cluster B"
            PB1[Peer B1]
            PB2[Peer B2]
            PB1 --- PB2
        end
    end
Loading
  1. The peers connect through the Relays to eachother
graph TD
    subgraph "Relay Connection Phase"
        subgraph "Cluster A"
            PA1[Peer A1]
            PA2[Peer A2]
            RA[Relay A]
        end
        subgraph "Cluster B"
            PB1[Peer B1]
            PB2[Peer B2]
            RB[Relay B]
        end
        PA1 --> RA
        PA2 --> RA
        RA --> RB
        RB --> PB1
        RB --> PB2
    end
Loading
  1. They use DCUTR to establish P2P connections
graph TD
    subgraph "Final Direct P2P State"
        subgraph "Cluster A"
            PA1[Peer A1]
            PA2[Peer A2]
            PA1 --- PA2
        end
        subgraph "Cluster B"
            PB1[Peer B1]
            PB2[Peer B2]
            PB1 --- PB2
        end
        PA1 --- PB1
        PA1 --- PB2
        PA2 --- PB1
        PA2 --- PB2
    end
Loading

The problem we're experiencing is that during the hole punch from A1 to A2 we require the HolePunch to happen over a set range of ports since 443 does not work for hole punching.

We've ensured that LibP2P is listening on these ports and added a filter function to ensure that both peers agree on the same port range (40000-40050) however when observing the traffic leaving both peer pods it originates from port 443.

graph TD
    subgraph "Cluster A"
        A1["Peer A1<br>Outbound: \"443\""]
        A2["Peer A2<br>Outbound: \"443\""]
    end
    subgraph "Cluster B"
        B1["Peer B1<br>Inbound: \"40000-40050\""]
        B2["Peer B2<br>Inbound: \"40000-40050\""]
    end
    A1 -- "443 -> 40000-40050" --> B1
    A1 -- "443 -> 40000-40050" --> B2
    A2 -- "443 -> 40000-40050" --> B1
    A2 -- "443 -> 40000-40050" --> B2
Loading

This means that the hole punch fails since both peers outbound ports do not match their expected inbound port.

We were able to resolve the bug by removing this particular line

ctx = quicreuse.WithAssociation(ctx, t)

@MarcoPolo mentioned in this issue it's worth filing a proper bug for to track as this PR seems to be the root cause
#2936

Version Information

@MarcoPolo
Copy link
Collaborator

Thank you for taking the time to file this and include nice diagrams. A couple of questions:

  • Can you share the ListenAddrs or ListenAddrStrings you use to configure the Peers?
  • Does disabling reuse port in QUIC solve this for you? (by passing the libp2p.QUICReuse(quicreuse.NewConnManager, quicreuse.DisableReuseport()) into the constructor example)
  • Can you explain why port 443 doesn't work for holepunching in your setup?
  • Does this setup have a NAT? Do all peers in the same cluster share the same public IP address that is NAT'd?
  • Are you actually hole punching, or are you simply dialing back a peer on its public IP+port?
    • In other words, when it works does only this code get run, or does it fall through to the hole punching logic below (L132)?
    • Could you please add some logging here and here to see what observed address are being shared?

@ethan-gallant
Copy link
Author

ethan-gallant commented Feb 2, 2025

Thank you for taking the time to file this and include nice diagrams. A couple of questions:

  • Can you share the ListenAddrs or ListenAddrStrings you use to configure the Peers?
    Here is the output from the Peers. I use the ListenAddrsFactory to filter out the hole punch addresses to avoid trying to direct connect when it's known that the connection will fail without using hole punching.

Peer A:

My Listen Addrs are: [/ip4/10.42.211.60/udp/443/quic-v1 /ip4/10.42.211.60/udp/443/quic-v1/webtransport /ip4/10.42.211.60/udp/42060/quic-v1 /ip4/10.42.211.60/udp/42060/quic-v1/webtransport /ip4/10.42.211.60/udp/42073/quic-v1 /ip4/10.42.211.60/udp/42073/quic-v1/webtransport /ip4/10.42.211.60/udp/42086/quic-v1 /ip4/10.42.211.60/udp/42086/quic-v1/webtransport /ip4/[REDACTED: PUBLIC_IP_A]/udp/42060/quic-v1 /ip4/[REDACTED: PUBLIC_IP_A]/udp/42060/quic-v1/webtransport /ip4/[REDACTED: PUBLIC_IP_A]udp/42073/quic-v1 /ip4/[REDACTED: PUBLIC_IP_A]/udp/42073/quic-v1/webtransport /ip4/[REDACTED: PUBLIC_IP_A]/udp/42086/quic-v1 /ip4/[REDACTED: PUBLIC_IP_A]/udp/42086/quic-v1/webtransport]

My Factory Addrs are: [/ip4/10.42.211.60/udp/443/quic-v1 /ip4/10.42.211.60/udp/443/quic-v1/webtransport/certhash/uEiA7jU2InPtJjAsh3yjlJYlAnkUGhHzSs4TmBD7EEUTVJQ/certhash/uEiB7X1De7JiClvw8AN5YScWeOFoS_D-qV7o3LxNfj_XwGg]

Peer B:

My Listen Addrs are: [/ip4/10.49.210.19/udp/443/quic-v1 /ip4/10.49.210.19/udp/443/quic-v1/webtransport /ip4/10.49.210.19/udp/42060/quic-v1 /ip4/10.49.210.19/udp/42060/quic-v1/webtransport /ip4/10.49.210.19/udp/42073/quic-v1 /ip4/10.49.210.19/udp/42073/quic-v1/webtransport /ip4/10.49.210.19/udp/42086/quic-v1 /ip4/10.49.210.19/udp/42086/quic-v1/webtransport /ip4/[REDACTED: PUBLIC_IP_B]/udp/42060/quic-v1 /ip4/[REDACTED: PUBLIC_IP_B]/udp/42060/quic-v1/webtransport /ip4/[REDACTED: PUBLIC_IP_B]/udp/42073/quic-v1 /ip4/[REDACTED: PUBLIC_IP_B]/udp/42073/quic-v1/webtransport /ip4/[REDACTED: PUBLIC_IP_B]/udp/42086/quic-v1 /ip4/[REDACTED: PUBLIC_IP_B]/udp/42086/quic-v1/webtransport]

My Factory Addrs are: [/ip4/10.49.210.19/udp/443/quic-v1 /ip4/10.49.210.19/udp/443/quic-v1/webtransport/certhash/uEiCqvguXff8aIGikGSNAIuGWRsv8bhWeBwMeV42KwEoZlw/certhash/uEiD56kV9I10sQH92dzH9huWLMGYXE62n9ZX6Wxo6AcHxYQ]
  • Does disabling reuse port in QUIC solve this for you? (by passing the libp2p.QUICReuse(quicreuse.NewConnManager, quicreuse.DisableReuseport()) into the constructor example)

Unfortunately it does not appear so

  • Can you explain why port 443 doesn't work for holepunching in your setup?

A few reasons:

  • AWS hole punching tends to work better on high port numbers, 40000+. I've found this range is much more reliable and consistent for hole punching behavior on EC2 and OVH Cloud.
  • Since many peers could be behind the same NAT I've found many ports in a higher less traffic-congested range increases the chances of a successful punch.
  • Does this setup have a NAT?

Yes each Peer in this case gets scheduled to a Kubernetes Node (EC2 instance) that is a part of the cluster.

  • Do all peers in the same cluster share the same public IP address that is NAT'd?

No but some peers in the cluster might be scheduled behind the same NAT as the NAT is per-EC2 instance which many peers could be scheduled onto.

  • Are you actually hole punching, or are you simply dialing back a peer on its public IP+port?

    • In other words, when it works does only this code get run, or does it fall through to the hole punching logic below (L132)?

I did confirm that the hole punching logic is called based on the logging

2025-02-02T18:23:22.137Z	DEBUG	p2p-holepunch	holepunch/holepuncher.go:130	got inbound proxy conn	{"peer": "12D3KooWGzGM12WyWALYUJmrGoeEb6trNRzoRXrcWQkxNWvWfWHj"}
  • Could you please add some logging here and here to see what observed address are being shared?

Added, it should be noted that I have a filter function which forwards only the addresses that will work for hole punching. I've added the relevant logs below

Peer A:

[OwnAddrs] Own addrs post-filter [/ip4/[REDACTED: PUBLIC_IP_A]/udp/42060/quic-v1 /ip4/[REDACTED: PUBLIC_IP_A]/udp/42060/quic-v1/webtransport /ip4/[REDACTED: PUBLIC_IP_A]/udp/42073/quic-v1 /ip4/[REDACTED: PUBLIC_IP_A]/udp/42073/quic-v1/webtransport /ip4/[REDACTED: PUBLIC_IP_A]/udp/42086/quic-v1 /ip4/[REDACTED: PUBLIC_IP_A]/udp/42086/quic-v1/webtransport]

[ObsAddrs] Post-filtered addresses for hole punch: [/ip4/[REDACTED: PUBLIC_IP_A]/udp/42060/quic-v1 /ip4/[REDACTED: PUBLIC_IP_A]/udp/42060/quic-v1/webtransport /ip4/[REDACTED: PUBLIC_IP_A]/udp/42073/quic-v1 /ip4/[REDACTED: PUBLIC_IP_A]/udp/42073/quic-v1/webtransport /ip4/[REDACTED: PUBLIC_IP_A]/udp/42086/quic-v1 /ip4/[REDACTED: PUBLIC_IP_A]/udp/42086/quic-v1/webtransport]

Peer B:

[OwnAddrs] Own addrs post-filter [/ip4/[REDACTED: PUBLIC_IP_B]/udp/42060/quic-v1 /ip4/[REDACTED: PUBLIC_IP_B]/udp/42060/quic-v1/webtransport /ip4/[REDACTED: PUBLIC_IP_B]/udp/42073/quic-v1 /ip4/[REDACTED: PUBLIC_IP_B]/udp/42073/quic-v1/webtransport /ip4/[REDACTED: PUBLIC_IP_B]/udp/42086/quic-v1 /ip4/[REDACTED: PUBLIC_IP_B]/udp/42086/quic-v1/webtransport]

[ObsAddrs] Post-filtered addresses for hole punch: [/ip4/[REDACTED: PUBLIC_IP_B]/udp/42060/quic-v1 /ip4/[REDACTED: PUBLIC_IP_B]/udp/42060/quic-v1/webtransport /ip4/[REDACTED: PUBLIC_IP_B]/udp/42073/quic-v1 /ip4/[REDACTED: PUBLIC_IP_B]/udp/42073/quic-v1/webtransport /ip4/[REDACTED: PUBLIC_IP_B]/udp/42086/quic-v1 /ip4/[REDACTED: PUBLIC_IP_B]/udp/42086/quic-v1/webtransport]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants