Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: WiFi / Network connectivity issues with 2.4.13+ #5458

Open
koliha opened this issue Nov 26, 2024 · 27 comments
Open

[Bug]: WiFi / Network connectivity issues with 2.4.13+ #5458

koliha opened this issue Nov 26, 2024 · 27 comments
Labels
bug Something isn't working

Comments

@koliha
Copy link

koliha commented Nov 26, 2024

Category

WiFi

Hardware

T-Beam, Heltec V3, Station G2

Firmware Version

2.5.13

Description

After a period of time (6-18hrs, no longer than 24hr), WiFi enabled ESP32 based devices on are losing network connectivity (not physical connectivity, just no longer passing traffic). I have tested this with 2.5.13 and 2.5.14, but not older 2.5.x builds. I'm going to setup a heltec v3 for testing so I can capture console traffic. I'll post that here when I'm able to do so.

When the issue occurs:

  • You can de-auth the node or power off the access point that it's connected to and it reconnects to wifi (same or another AP)
  • When it reconnects you can see the node request DHCP and get a response (dhcp server logs)
  • The DHCP exchange is successfully sniffed by my Ubiquiti setup
  • Traffic logs (ubiquiti) show WiFi rx but no tx for the node during this issue

I am seeing this on:

  • Two different wireless networks using Ubiquiti hardware
  • One wireless network running off a Netgear Nighthawk
  • Station G2, Heltec V3, and a T-Beam

Relevant log output

No response

@koliha koliha added the bug Something isn't working label Nov 26, 2024
@koliha
Copy link
Author

koliha commented Nov 26, 2024

This is what I see from the WiFi side -- this pattern repeats.
Screenshot 2024-11-26 at 4 48 49 PM

It's not related to signal, here is a closer AP with the same pattern/behavior:
Screenshot 2024-11-26 at 4 42 43 PM

If I check my DHCP logs for today I see a DHCP lease renewals up until the wifi drop at 2:57pm. The 3:16pm wifi drop is seen as well, but nothing for the 3:24/26/27 reconnects.
Screenshot 2024-11-26 at 4 50 49 PM

Not sure if I can set a static IP, but that's my next step in attempting to troubleshoot.

@garthvh
Copy link
Member

garthvh commented Nov 27, 2024

Are you connected to the public mqtt server? Do you have device logs?

@CTassisF
Copy link
Contributor

Have you tried the latest "Bleeding" firmware? A bug affecting Wi-Fi connectivity (reported in #5387) was recently fixed in PR #5439.

@koliha
Copy link
Author

koliha commented Nov 27, 2024

Are you connected to the public mqtt server? Do you have device logs?

No device logs. Using a public MQTT server (used by tens of others), but not "the" public MQTT server. MQTT server resides on my local network (connecting via private address).

Have you tried the latest "Bleeding" firmware? A bug affecting Wi-Fi connectivity (reported in #5387) was recently fixed in PR #5439.

This seems extremely relevant. Only using pre-built/published builds, so I'll probably wait for a release to test.

I have a static IP set as of 8pm on the problematic node and will report back the results. Based on #5387 I would assume that it's still going to occur with a static. Rather than try to set something up to collect logs, I'll probably wait out the next alpha to see if it solves.

@CTassisF
Copy link
Contributor

There were recent changes to MQTT servers in "private" (RFC1918) networks (discussed in #5203) that might be related to your issue. However, if I were to make an educated guess, most (if not all) of these Wi-Fi issues were likely caused by the memory leaks fixed in MQTT::onReceive.

@koliha
Copy link
Author

koliha commented Nov 27, 2024

There were recent changes to MQTT servers in "private" (RFC1918) networks (discussed in #5203) that might be related to your issue. However, if I were to make an educated guess, most (if not all) of these Wi-Fi issues were likely caused by the memory leaks fixed in MQTT::onReceive.

I am assuming the same re: MQTT::onReceive. I'm using public dns and IP and relying on a nat mirroring rule to route it back inside, so the RFC1918 issue shouldn't impact, but still great info to know.

@fifieldt
Copy link
Contributor

Potentially also relevant: we were power saving even when wifi was connected. That's fixed now: #5443

@koliha
Copy link
Author

koliha commented Nov 27, 2024

Potentially also relevant: we were power saving even when wifi was connected. That's fixed now: #5443

Saw that but it's not related.

Static IP has kept it from falling off completely overnight, but it's disconnecting+reconnecting to wifi over and over and seems a little wonky overall (pretty sure it's not beaconing device metrics to MQTT consistently). 2 minutes of no ping responses, responds for 5-7 seconds, then back to no response.

I'll wait for a release that incorporates #5387 before I attempt to troubleshoot further.

@leshniak
Copy link

I'm also experiencing the issue on 2.5.14.f2ee0df and latest beta. Waiting for the mentioned fixes.

@LowVoltagePirate
Copy link

I've build the files from the master repo and can confirm that the issue seems fixed, my t-beam is now online >48 without any issues

@leshniak
Copy link

I’ve also decided to make my own build yesterday. So far so good, it maintains a stable WiFi connection and heap usage stabilized at 90% after 24h.

image

@leshniak
Copy link

leshniak commented Nov 29, 2024

Hmm, the same issue sometimes occurs while connecting to the node via web interface and TCP. The device rebooted itself after few minutes. No logs for now unfortunately, due to remote location.

I have ~70 nodes in my NodeDB, if that matters.

Edit: looks like all is fine if I connect to the node shortly after the reboot.

@CTassisF
Copy link
Contributor

Meshtastic Firmware 2.5.15.79da236 Alpha was released with fixes for #5387.

@fifieldt
Copy link
Contributor

Fixed by #5387

@Matzebhv
Copy link

Matzebhv commented Dec 1, 2024

Had to reopen, Firmware 2.5.15.79da236 does not fix this issue. My 2 Supremes are disconnectig after a couple of hours.
Same behavior as described here #5458 (comment)

@commanderts
Copy link

Same problem here with an Heltec v3...

@koliha
Copy link
Author

koliha commented Dec 2, 2024

Also seeing the same behavior with my Station G2s on 2.5.15. I will try to pull logs later today.

@thebentern thebentern reopened this Dec 2, 2024
@Xaositek
Copy link

Xaositek commented Dec 2, 2024

I notice you're on a UniFi Network, here's my settings for you to compare against. I'm running UniFi Network 9.0.92, Firmware 6.7.9 and physical devices are two U6 Pros and a U6 In-Wall - I have not yet been able to reproduce any issues. I have a Heltec V3 and a T-Beam (Original) connected via WiFi, they run for weeks and zero issues; my Heltec V3 is on 2.5.15, flashed 3 days ago and no problems.

image

@koliha
Copy link
Author

koliha commented Dec 2, 2024

I notice you're on a UniFi Network, here's my settings for you to compare against. I'm running UniFi Network 9.0.92, Firmware 6.7.9 and physical devices are two U6 Pros and a U6 In-Wall - I have not yet been able to reproduce any issues. I have a Heltec V3 and a T-Beam (Original) connected via WiFi, they run for weeks and zero issues; my Heltec V3 is on 2.5.15, flashed 3 days ago and no problems.

Thanks for the reply. I'm seeing this on two fairly complex networks w/Ubiquiti as well as a very simple cable modem + netgear nighthawk consumer router setup. Reportedly the issue does not occur if MQTT is disabled - do you have your node connected via mqtt?

I'm currently timing the issue, re-checking the behavior post-patch, and capturing logs. After that I'll try turning MQTT off and see if it makes any difference. It would be nice to narrow it down. I highly doubt it's actually something to do with the code around networking.

@Xaositek
Copy link

Xaositek commented Dec 2, 2024

Yes, It is connected via MQTT to my local MQTT server, not public MQTT.

It is using a DNS name which will internally resolve to a local IP.

@Matzebhv
Copy link

Matzebhv commented Dec 2, 2024

It has nothing to do with unifi ore something. No special setup here, one node is connected to my fritzbox and the other is connected to my workplace with extreme enterprise ap. Long fast ist uplink only and 3 channels with moderate traffic are setup with up/downlink. If i disable mqtt the node will stay online.

@garthvh
Copy link
Member

garthvh commented Dec 2, 2024

It has nothing to do with unifi ore something. No special setup here, one node is connected to my fritzbox and the other is connected to my workplace with extreme enterprise ap. Long fast ist uplink only and 3 channels with moderate traffic are setup with up/downlink. If i disable mqtt the node will stay online.

If your mqtt is on the default topic that is likely too much traffic for the node to handle.

@leshniak
Copy link

leshniak commented Dec 3, 2024

If your mqtt is on the default topic that is likely too much traffic for the node to handle.

Mine is connecting to a local mosquitto instance (not bridged) and behaves the same way.

@koliha
Copy link
Author

koliha commented Dec 3, 2024

If your mqtt is on the default topic that is likely too much traffic for the node to handle.

Mine is connecting to a local mosquitto instance (not bridged) and behaves the same way.

Similar. One node is connecting to self-hosted mosquitto on LAN, the other is connecting to the same through the internet (NAT). Decently low traffic overall (way less than the official/dev server)

@leshniak
Copy link

leshniak commented Dec 3, 2024

The attached log from mosquitto shows how unstable the connection is. Then I've triggered a reboot via LoRa and it's all gone.

I think this looks a bit weird:

1733217530: New connection from 192.168.5.85:55328 on port 1883.
1733217530: New client connected from 192.168.5.85:55328 as !2f93dc9c (p2, c1, k15).
1733217554: Client !2f93dc9c has exceeded timeout, disconnecting.
1733217561: New connection from 192.168.5.85:55329 on port 1883.
1733217561: New client connected from 192.168.5.85:55329 as !2f93dc9c (p2, c1, k15).
1733217584: Client !2f93dc9c has exceeded timeout, disconnecting.
1733217589: New connection from 192.168.5.85:50628 on port 1883.
1733217589: New client connected from 192.168.5.85:50628 as !2f93dc9c (p2, c1, k15).
1733217678: New connection from 192.168.5.85:51534 on port 1883.
1733217678: Client !2f93dc9c already connected, closing old connection.
1733217678: New client connected from 192.168.5.85:51534 as !2f93dc9c (p2, c1, k15).

I have a Wemos D1 Mini placed in the exact same location, with ESPHome firmware and it's super stable despite having a ~10 dBm weaker WiFi signal.

log.txt

@garthvh
Copy link
Member

garthvh commented Dec 3, 2024

Is the mqtt JSON functionality being used? Who is hosting the private brokers?

@koliha
Copy link
Author

koliha commented Dec 3, 2024

Is the mqtt JSON functionality being used? Who is hosting the private brokers?

Not in any of my use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

10 participants