State consistency and "stuck" states #274

iay · 2024-12-25T12:29:23Z

I am seeing some inconsistencies in reported states in Home Assistant similar to some of the things @clintebbesen has reported. I'm making a separate issue because I don't want to assume that we have the same underlying problem. Maybe we do, maybe we don't.

Sometimes I will see collections of states reported in Home Assistant which are essentially impossible to reconcile with the code. For example, the lambda that drives all of this will only set a target's active state to false at the same time it sets angle to 0... and yet, sometimes I will see active as false and all other states for that target as 0 except the angle. This implies that the state in ESPHome is probably not the same as the state in Home Assistant.

My working theory for why this might be happening is that digging into the ESPHome stack it looks like if it can't report an update without blocking, the update is sometimes just discarded. There will be a log message to that effect: Cannot send message because of TCP buffer space but that won't be visible remotely because there's, um, no buffer space to put it in. But it's probably visible on the local UART.

If an update is discarded, Home Assistant won't get it and the states will be inconsistent. That missed update will never be re-sent, as far as I can tell. The Home Assistant state will of course be updated the next time the EPL updates the state, but the lambda is careful to never do this if it doesn't think the state has changed.

This doesn't affect things like luminance because they always update every minute. The template entities used for target states don't do that. I think this can be tested by setting up a couple of buttons to explicitly publish the entities for a target, waiting for the target to become inconsistent and then pushing the button to see if that makes things consistent again. I'll report once I have had a chance to try this.

The text was updated successfully, but these errors were encountered:

iay · 2024-12-25T22:11:56Z

Added this to my config:

button:
  - platform: template
    name: Republish target 1
    on_press:
      - lambda: |-
          id(target1_x).publish_state(id(target1_x).state);
          id(target1_y).publish_state(id(target1_y).state);
          id(target1_speed).publish_state(id(target1_speed).state);
          id(target1_resolution).publish_state(id(target1_resolution).state);
          id(target1_distance).publish_state(id(target1_distance).state);
          id(target1_angle).publish_state(id(target1_angle).state);
          id(target1_active).publish_state(id(target1_active).state);
    entity_category: "diagnostic"

  - platform: template
    name: Republish occupancy
    on_press:
      - lambda: |-
          id(occupancy).publish_state(id(occupancy).state);
          id(zone1_occupancy).publish_state(id(zone1_occupancy).state);
          id(zone2_occupancy).publish_state(id(zone2_occupancy).state);
          id(zone3_occupancy).publish_state(id(zone3_occupancy).state);
          id(zone4_occupancy).publish_state(id(zone4_occupancy).state);
    entity_category: "diagnostic"

Now waiting for the issue to recur.

iay · 2025-01-02T14:47:15Z

I'm seeing another syndrome in which targets are detected in zone 1, so zone 1 target count is non-zero, but zone 1 occupancy (and overall occupancy) is clear. This also seems like it should be impossible.

I tried my "fix occupancy" button but it didn't help. I note that the occupancy sensors have delayed_off filters so my lambda is probably wrong: I should be republishing raw_state and not state, as the latter might be different if the internal timer hasn't expired yet. I will adjust my test code for another attempt.

Meanwhile, rebooting the device has made the issue hide again for now.

iay · 2025-01-02T14:49:18Z

Hm. Because they are binary sensors they don't have raw_state in the way that ordinary ones do. I guess I will have to read the ESPHome code again.

iay · 2025-01-02T15:06:52Z

Looks like binary_sensor aggressively de-duplicates published states very early in the process, discarding published states before even passing them to filters. So if one gets lost, it has to change away and back again before that will be picked up. I think this might force publishing the current state to HA, though, so I'm trying that next:

  - platform: template
    name: Fix occupancy
    on_press:
      - lambda: |-
          id(occupancy).send_state_internal(id(occupancy).state, false);
          id(zone1_occupancy).send_state_internal(id(zone1_occupancy).state, false);
          id(zone2_occupancy).send_state_internal(id(zone2_occupancy).state, false);
          id(zone3_occupancy).send_state_internal(id(zone3_occupancy).state, false);
          id(zone4_occupancy).send_state_internal(id(zone4_occupancy).state, false);
    entity_category: "diagnostic"

MenesesPT · 2025-01-02T21:19:45Z

First of all, what a great find! 🕵️

I set my debug level to Verbose in order to test your theory. And after a few hours and using USB, I finally managed to capture some (3) Cannot send message because of TCP buffer space messages.

I suppose that your last template should pretty much confirm if this is the problem related to the out-of-sync values (if the tcp buffer manages to free itself a bit).
In my case, it didn't happen to the occupancy values, so I could not test it, and it hasn't happened since... I'll keep monitoring it to also try to test it.

Maybe increasing the log level from Debug to Info in production would reduce the possibility of this happening, since I suspect the buffer is getting full of the log messages? However, it would probably make troubleshooting users' problems harder.
And/or force an occupancy state every minute or so to avoid/minimize the damage caused by it getting out of sync?

[D][bh1750.sensor:159]: 'Illuminance': Got illuminance=0.7lx
[V][sensor:043]: 'Illuminance': Received new state 0.679134
[D][sensor:093]: 'Illuminance': Sending state 0.67913 lx with 1 decimals of accuracy
[V][esp32_ble:348]: (BLE) gap_event_handler - 3
[V][bluetooth_proxy:058]: Proxying raw packet from 08:D1:F9:00:11:22, length 3. RSSI: -67 dB
[V][bluetooth_proxy:061]: Proxying 1 packets
[V][api.connection:1534]: Cannot send message because of TCP buffer space
[V][sensor:043]: 'Target 1 X': Received new state 249.000000
[D][sensor:093]: 'Target 1 X': Sending state 249.00000 mm with 0 decimals of accuracy
[V][api.connection:1534]: Cannot send message because of TCP buffer space
[V][sensor:043]: 'Target 1 Y': Received new state 1581.000000
[D][sensor:093]: 'Target 1 Y': Sending state 1581.00000 mm with 0 decimals of accuracy
[V][api.connection:1534]: Cannot send message because of TCP buffer space
[V][sensor:043]: 'Target 1 Distance': Received new state 1601.439697
[D][sensor:093]: 'Target 1 Distance': Sending state 1601.43970 mm with 0 decimals of accuracy
[V][sensor:043]: 'Target 1 Angle': Received new state -33.977837
[D][sensor:093]: 'Target 1 Angle': Sending state -33.97784 ° with 0 decimals of accuracy

iay · 2025-01-02T21:41:44Z

Thanks for the confirmation that the "buffer space" thing is happening in real life. I can't get USB logs off my unit where it has to live, so I haven't been able to confirm it at my end yet, it's been purely theoretical for me so far. Even now I guess it's possible we're not seeing the same things.

I did suspect that the occupancy values might be more likely to see this issue because they are set and published towards the end of the lambda. So I thought perhaps they might be most likely to be dropped. What you're seeing is interesting, though, as it's not those: it's happening quite early in the lambda and then apparently recovering for later values.

As to how one might fix this (if this is the cause, and I'm not 100% on that yet) one option would be to have a way of not publishing all of the rapidly varying data for target tracking, or to drastically slow it down. At the moment, there's a throttle in the code that means it won't happen more than (IIRC) four times a second but honestly I personally don't need any positional information except when drawing zones in the UI and even then not at anything like that rate. However, if I understand the code correctly all of that data is always being sent on every change, even if the Home Assistant host is just going to ignore it because the entity is disabled. Perhaps some way to modify the data rate and/or mask publication of the various target entities would help reduce the data rate to the point where failures went away in normal use.

One other thing I'll note from your logs is that the event immediately prior to the sensor update that causes the first drop is very, very suspicious. I deduce that you have the Bluetooth proxy running, and I do as well. I do wonder whether proxying packets across causes a disruption that lets data get backed up and the buffer filled. In that case this might happen more in situations where there's more Bluetooth traffic to proxy which might account for the variations. I guess just turning off the proxy and seeing whether the problem went away might be a valuable experiment (I'm leaning towards using some more cheap ESP32 dev boards as dedicated proxies anyway, although at the moment my EPL is doing really important work proxying my Bluetooth fridge thermometer...)

Any thoughts welcome.

MenesesPT · 2025-01-03T11:41:38Z

I've been for the past couple weeks trying to recreate the problem of entities not being coherent (the problem you mentioned) without much luck. It had only happened once to me (I've had an automation set to notify me of it in case I missed it) during Christmas Eve. At that time, I had more people home than normal, so the 2.4GHz Wi-Fi and Bluetooth would be a little more crowded. When I saw your theory I decided to test it via USB but wasn't getting anywhere as well. So yesterday I enabled a few BLE transmitters and switched to using Wi-Fi on the devices I could, and after an hour or so I managed to get the messages on the logs above.

My EPL receive great Wi-Fi coverage, maybe moving it to a spot where transmissions might fail more often might increase this problem since the buffer doesn't get cleared so fast. I agree with you and suspect that transmitting all the Bluetooth proxy data might also play a big role in this.

Moreover, prior to this for the past couple months I've been using a modified version that run the occupancy code without throttling (every 100ms), but was only updating the position sensors every second (angle, speed and resolution were removed) and I've never noticed the issue.

I'll continue testing whenever I get time to try to get more info on this (hopefully the problem resurges), and I am also waiting for a new unit to arrive that will be installed farther away from the access point so that one might face this problem more often.

I'll try to test using/not using proxy; throttling less/more the sensor updates; hindering the Wi-Fi communication. Am I missing something?

iay · 2025-01-03T12:13:49Z

I think that list covers the things I can think of that might affect this.

I'm particularly interested in the proxy/not-proxy axis. I've had good success with bundling the proxy functionality in with everything else in the past, but the firmware has a few more sensors now and maybe more data to shift so it might be more sensitive to disruption.

I'm not aware of any way to get things like packet retransmission counts out of the network stack, but it might be down there somewhere and shed some light on what's happening. The reason that I'm interested in the Bluetooth component is that even with small amounts of Bluetooth proxy traffic, the way the radio in the ESP32 works might cause issues even without bad network connectivity: as I understand it, the ESP's radio can only work with either BTLE or WiFi at any given time; when it's listening for BTLE packets to proxy, it's entirely blind to WiFi and vice versa. So just looking for packets to proxy disrupts the transmission of data packets to Home Assistant, or of receiving acknowledgements for data already sent.

Of course the protocols would normally correct for those losses by retransmitting, but there seems to be part of the stack that seems to feel that if it can't find space right now then it shouldn't in fact wait around.

One interpretation of your log data is that the Illuminance data was sent but the acknowledgement came back while the radio was looking for Bluetooth, meaning the buffer wasn't cleared when it tried to switch back and forward the BTLE data. I can't help but think that's someone's bug, but it may not be one I can dig down to without an unreasonable investment of effort.

iay · 2025-01-05T10:46:02Z

I saw another inconsistent state case, this one with overall occupancy "Detected" while everything else was clear and target values all zero.

Poking my "Fix occupancy" button to force republication of all of the occupancy values did cause that to clear down properly, which adds more weight to the idea that the states on the ESPHome side are consistent as expected and the problem is the lost update to Home Assistant for whatever reason that's happening.

As the occupancy is all that I'm using in my automations just now, I'm tempted to follow your (@MenesesPT) earlier suggestion of just forcing a call on that internally every minute or so; I'm fairly sure that would be effective. It's really treating the symptoms and not the disease, though so I will think about it a bit more.

iay · 2025-01-05T11:49:32Z

Perhaps related:

Update to HA 2024.10.4 - update of SOME values is not working syssi/esphome-jk-bms#621
Warn about data loss on API communication esphome/esphome#7321
[api] ensure fair network sharing + prevent lost state changes via deferred publish at high event load esphome/esphome#7547

It looks like "if too many things happen at once or in the wrong order then stuff gets lost" is something the ESPHome folks are very aware of. That doesn't help us directly, but it's nice to know we're not alone. The proposed mitigation sounds workable to me although it seems like it's something of an overcomplicated way of compensating for an underlying design issue. As one commenter says:

Looking at the API and the way it handles TCP, it's a miracle this isn't happening more often.

I also don't know when the proposed fix might hit the codebase.

I note that the claim is made there that if you publish states through MQTT rather than the HA API, the dropped data issue doesn't occur. I don't know whether that might be an option for me. Seems like it might be quite the digression.

iay · 2025-01-05T12:18:35Z

I found a couple of claims in old ESPHome issues that the TCP buffer space message "usually indicates you are running out of ram". I don't think that is what is happening here, but on the off chance I've added the following sensor to a couple of my nodes:

sensor:
    - platform: template
      name: Free memory
      lambda: return heap_caps_get_free_size(MALLOC_CAP_INTERNAL)/1024;
      unit_of_measurement: 'kB'
      state_class: measurement
      update_interval: 60s
      entity_category: "diagnostic"

A bare proxy comes in at 97kB a couple of minutes after boot, the EPL is showing 62kB. It will be interesting to see what happens to these over time.

MenesesPT · 2025-01-06T09:37:31Z

At the time I was able to log the TCP buffer space messages, I had already added a sensor to measure the free memory. Here is the memory available and the maximum continuous block as well around the time it happened.

I don't know exactly the timestamp but I have it stored and will check it out later today.
It would be curious if it happened around 20h32.

I have a few extra sensors added to my config, so it rests at ~57kB instead of the 62kB.

MenesesPT · 2025-01-21T22:19:13Z

In my testing, not having Bluetooth enabled helped, but I suspect it was just because there was less data being transmitted, and not the actual cause of the problem. The occurrences were so rare that I couldn't get any consistent data out of it.

Since last week I've been using the modified code you mentioned:

api: ensure fair network sharing + prevent lost state changes via deferred publish at high event load esphome/esphome#7547

And since then, it seems to be working fine, but I haven't faced this problem often, so maybe you could try it and check if it mitigates your problem as well, @iay ?

iay · 2025-01-21T22:31:29Z

I have to admit to not knowing how to run with an alternative set of ESPHome sources and I probably won't get time to get educated until at least the weekend. I was kind of hoping they might merge that code for the next release, but the issue has gone quiet...

Meanwhile, I've been lucky enough not to see the problem at my end in a week. I had to swap out some of my WiFi gear, and that's probably reduced error rates a bit and reduced the chance of it happening. Or it could be just chance.

MenesesPT · 2025-01-21T22:50:59Z

I was also hopeful for it to get merged soon!
For reference, you could add this to your configuration to use the PR:

external_components:
  - source: github://pr#7547
    components: [api]
    refresh: 0s

It seems to be really inconsistent. I'm glad it's working better for you now!

iay · 2025-02-04T14:42:21Z

(Belatedly) Thanks for the pointer to how to make tweaks like this locally, that will be really useful going forward. I didn't have time to try it on my system, unfortunately, and I still haven't had the issue again (at least, I haven't noticed it).

I note that the 1.2.2 release of the EPL firmware incorporates the external component spec you included, so I have updated my fork to that and am running it now. I'm guessing that if no-one sees continuing problems over the next while then this and all the similar issues can be closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State consistency and "stuck" states #274

State consistency and "stuck" states #274

iay commented Dec 25, 2024 •

edited

Loading

iay commented Dec 25, 2024

iay commented Jan 2, 2025

iay commented Jan 2, 2025

iay commented Jan 2, 2025

MenesesPT commented Jan 2, 2025

iay commented Jan 2, 2025

MenesesPT commented Jan 3, 2025

iay commented Jan 3, 2025

iay commented Jan 5, 2025

iay commented Jan 5, 2025

iay commented Jan 5, 2025

MenesesPT commented Jan 6, 2025

MenesesPT commented Jan 21, 2025

iay commented Jan 21, 2025

MenesesPT commented Jan 21, 2025

iay commented Feb 4, 2025

State consistency and "stuck" states #274

State consistency and "stuck" states #274

Comments

iay commented Dec 25, 2024 • edited Loading

iay commented Dec 25, 2024

iay commented Jan 2, 2025

iay commented Jan 2, 2025

iay commented Jan 2, 2025

MenesesPT commented Jan 2, 2025

iay commented Jan 2, 2025

MenesesPT commented Jan 3, 2025

iay commented Jan 3, 2025

iay commented Jan 5, 2025

iay commented Jan 5, 2025

iay commented Jan 5, 2025

MenesesPT commented Jan 6, 2025

MenesesPT commented Jan 21, 2025

iay commented Jan 21, 2025

MenesesPT commented Jan 21, 2025

iay commented Feb 4, 2025

iay commented Dec 25, 2024 •

edited

Loading