-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VLAN]: Orchagent reports VLAN removal failure due to invalid order of event processing #20941
Comments
Discuss with @qiluo-msft and @prsunny to see if there is a generic way to maintain order of objects in swss |
@nazariig , based on this section of code - https://github.com/sonic-net/sonic-swss/blob/caed910210cba0ff7f6cbbdc403d58a2ffc24d55/orchagent/portsorch.cpp#L5590C1-L5631C6, its different consumer m_tosync for each table. Meaning, different tables have different multimaps. So can you clarify or have an evidence of this statement? Since the events are stored in a multimap container (SONiC implementation from day 1), the ordering of items eventually can be changed due to key sorting algorithm. |
@liuh-80 to check |
The issue may relate with this part:
When VLAN config in CONFIG_DB change, vlanmgrd will handle CONFIG_DB change and change APPL_DB. then orchagent will handle APPL_DB change. however the VLAN_TABLE has higher priority than VLAN_MEMBER_TABLE, so according to following code, VLAN delete event will select first:
I will validate my theory, but if this is true, then change the priority may fix the issue. but a risk is VLAN member create event may select before VLAN create event. |
agree. if we change order, it will break create flow |
@prsunny i didn't check that statement. That was only my guess, since i did not dive too deep into the bug. Based on the code snippet you provided, this is still an ordering issue |
So this is a classic chicken-egg issue. If you fix one part, you will definitely break the other part. IMHO, solving object dependency issue for init flow (currently done in SWSS) using some magic prio numbers is not reliable enough. |
After change priority, issue still happen, found another issue, the compare operation will sort last selected table to first, which means this is not a stable sort algorithm, will check and update later:
|
After remove the last used time part from compare method, event can be select correctly, however the vlan and ports event been blocked because following check failed, seems last used time is important for handle port event:
|
Verified the issue caused by LastUsedTime with following commands: sudo truncate -s 0 /var/log/syslog sudo kill -s SIGSTOP $(pgrep -f /usr/bin/orchagent) sudo config vlan member del 4094 Ethernet64 sudo truncate -s 0 /var/log/syslog admin@vlab-01:~$ sudo cat /var/log/syslog Next step will find a draft fix solution. |
Currently when vlan still have member, orchagent will delay the vlan delete, which means the error message just a warning, the vlan still will be delete later.
Also swss-common pops with 'SPOP' command, which will not keep the original insert order: https://redis.io/docs/latest/commands/spop/ So, if the orchagent want keeps the order of events, need:
|
Description
This appears to be a timing issue caused by SWSS event processing while being in a busy state.
Looks like we have a situation, when two tasks are stuck in SWSS queue (VLAN member and VLAN removal) and then being processed at once via single loop.
Since the events are stored in a multimap container (SONiC implementation from day 1), the ordering of items eventually can be changed due to key sorting algorithm. This means, that SWSS will be processing events in a different order, comparing to what was originally generated by CLI or controller.
Steps to reproduce the issue:
syslog:
swss.rec
sairedis.rec
Describe the results you received:
Describe the results you expected:
No errors are expected
Output of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
The text was updated successfully, but these errors were encountered: