Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VLAN]: Orchagent reports VLAN removal failure due to invalid order of event processing #20941

Open
nazariig opened this issue Nov 27, 2024 · 11 comments
Labels
Bug 🐛 Issue for 202405 Triaged this issue has been triaged

Comments

@nazariig
Copy link
Collaborator

Description

This appears to be a timing issue caused by SWSS event processing while being in a busy state.
Looks like we have a situation, when two tasks are stuck in SWSS queue (VLAN member and VLAN removal) and then being processed at once via single loop.

Since the events are stored in a multimap container (SONiC implementation from day 1), the ordering of items eventually can be changed due to key sorting algorithm. This means, that SWSS will be processing events in a different order, comparing to what was originally generated by CLI or controller.

Steps to reproduce the issue:

  1. Create VLAN and members
sudo config vlan add 4094
sudo config vlan member add --untagged 4094 PortChannel0001
sudo config vlan member add 4094 Ethernet64
  1. Remove LAG from VLAN
sudo config vlan member del 4094 PortChannel0001
  1. Pause SWSS
docker exec -ti swss bash
kill -s SIGSTOP $(pgrep -f /usr/bin/orchagent)
  1. Remove VLAN and member
sudo config vlan member del 4094 Ethernet64
sudo config vlan del 4094
  1. Resume SWSS
docker exec -ti swss bash
kill -s SIGCONT $(pgrep -f /usr/bin/orchagent)

syslog:

2024 Nov 27 16:34:36.642969 sonic ERR swss#orchagent: :- removeVlan: Failed to remove non-empty VLAN Vlan4094
2024 Nov 27 16:34:36.643100 sonic ERR swss#orchagent: :- removeVlan: Failed to remove non-empty VLAN Vlan4094
2024 Nov 27 16:34:36.644014 sonic INFO swss#orchagent: :- removeRoute: Failed to find route entry, vrf_id 0x3000000000002, prefix fd00::/80
2024 Nov 27 16:34:36.644162 sonic INFO swss#orchagent: :- removeRoute: Failed to find route entry, vrf_id 0x3000000000002, prefix fe80::/64
2024 Nov 27 16:34:36.644280 sonic ERR swss#orchagent: :- removeVlan: Failed to remove non-empty VLAN Vlan4094
2024 Nov 27 16:34:36.645386 sonic INFO syncd#SDK: :- processSingleEvent: key: SAI_OBJECT_TYPE_VLAN_MEMBER:oid:0x27000000000744 op: remove
2024 Nov 27 16:34:36.645386 sonic NOTICE syncd#SDK: [SAI_VLAN.NOTICE] ./src/mlnx_sai_vlan.c[1385]- mlnx_remove_vlan_member: Remove VLAN_MEMBER [OID:0x20FFE0027] [bridge_ports_db[2], vlan:4094]
2024 Nov 27 16:34:36.647624 sonic INFO syncd#SDK: :- sendApiResponse: sending response for SAI_COMMON_API_REMOVE api with status: SAI_STATUS_SUCCESS
2024 Nov 27 16:34:36.647866 sonic INFO syncd#SDK: :- sendApiResponse: response for SAI_COMMON_API_REMOVE api was send
2024 Nov 27 16:34:36.648166 sonic NOTICE swss#orchagent: :- removeVlanMember: Remove member Ethernet64 from VLAN Vlan4094 lid:ffe vmid:27000000000744
2024 Nov 27 16:34:36.651882 sonic INFO syncd#SDK: :- processSingleEvent: key: SAI_OBJECT_TYPE_VLAN:oid:0x26000000000741 op: remove
2024 Nov 27 16:34:36.652010 sonic NOTICE syncd#SDK: [SAI_VLAN.NOTICE] ./src/mlnx_sai_vlan.c[951]- mlnx_remove_vlan: Remove VLAN [OID:0xFFE00000026] [vlan:4094]
2024 Nov 27 16:34:36.656430 sonic INFO syncd#SDK: :- sendApiResponse: sending response for SAI_COMMON_API_REMOVE api with status: SAI_STATUS_SUCCESS
2024 Nov 27 16:34:36.656609 sonic INFO syncd#SDK: :- sendApiResponse: response for SAI_COMMON_API_REMOVE api was send
2024 Nov 27 16:34:36.656785 sonic NOTICE swss#orchagent: :- removeVlan: Remove VLAN Vlan4094 vid:4094

swss.rec

2024-11-27.14:34:04.172889|LAG_TABLE:PortChannel0001|SET|admin_status:up|oper_status:up|mtu:9100
2024-11-27.14:34:04.175687|VLAN_MEMBER_TABLE:Vlan4094:PortChannel0001|DEL
2024-11-27.14:34:36.642337|VLAN_TABLE:Vlan4094|DEL
2024-11-27.14:34:36.643107|ROUTE_TABLE:fe80::/64|DEL
2024-11-27.14:34:36.643164|ROUTE_TABLE:fd00::/80|DEL
2024-11-27.14:34:36.644311|VLAN_MEMBER_TABLE:Vlan4094:Ethernet64|DEL

sairedis.rec

2024-11-27.14:34:04.175890|r|SAI_OBJECT_TYPE_VLAN_MEMBER:oid:0x27000000000743
2024-11-27.14:34:04.180936|s|SAI_OBJECT_TYPE_LAG:oid:0x2000000000672|SAI_LAG_ATTR_PORT_VLAN_ID=1
2024-11-27.14:34:04.183784|f|SAI_OBJECT_TYPE_FDB_FLUSH:oid:0x21000000000000|SAI_FDB_FLUSH_ATTR_BRIDGE_PORT_ID=oid:0x3a000000000742|SAI_FDB_FLUSH_ATTR_BV_ID=oid:0x26000000000741|SAI_FDB_FLUSH_ATTR_ENTRY_TYPE=SAI_FDB_FLUSH_ENTRY_TYPE_DYNAMIC
2024-11-27.14:34:04.185607|F|SAI_STATUS_SUCCESS
2024-11-27.14:34:04.187259|n|fdb_event|[{"fdb_entry":"{\"bvid\":\"oid:0x26000000000741\",\"mac\":\"00:00:00:00:00:00\",\"switch_id\":\"oid:0x21000000000000\"}","fdb_event":"SAI_FDB_EVENT_FLUSHED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x3a000000000742"},{"id":"SAI_FDB_ENTRY_ATTR_TYPE","value":"SAI_FDB_ENTRY_TYPE_DYNAMIC"},{"id":"SAI_FDB_ENTRY_ATTR_PACKET_ACTION","value":"SAI_PACKET_ACTION_FORWARD"}]}]|
2024-11-27.14:34:04.188340|s|SAI_OBJECT_TYPE_BRIDGE_PORT:oid:0x3a000000000742|SAI_BRIDGE_PORT_ATTR_ADMIN_STATE=false
2024-11-27.14:34:04.191534|s|SAI_OBJECT_TYPE_HOSTIF:oid:0xd000000000623|SAI_HOSTIF_ATTR_VLAN_TAG=SAI_HOSTIF_VLAN_TAG_STRIP
2024-11-27.14:34:04.192420|f|SAI_OBJECT_TYPE_FDB_FLUSH:oid:0x21000000000000|SAI_FDB_FLUSH_ATTR_BRIDGE_PORT_ID=oid:0x3a000000000742|SAI_FDB_FLUSH_ATTR_ENTRY_TYPE=SAI_FDB_FLUSH_ENTRY_TYPE_DYNAMIC
2024-11-27.14:34:04.193983|F|SAI_STATUS_SUCCESS
2024-11-27.14:34:04.194680|r|SAI_OBJECT_TYPE_BRIDGE_PORT:oid:0x3a000000000742
2024-11-27.14:34:04.195296|n|fdb_event|[{"fdb_entry":"{\"bvid\":\"oid:0x0\",\"mac\":\"00:00:00:00:00:00\",\"switch_id\":\"oid:0x21000000000000\"}","fdb_event":"SAI_FDB_EVENT_FLUSHED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x3a000000000742"},{"id":"SAI_FDB_ENTRY_ATTR_TYPE","value":"SAI_FDB_ENTRY_TYPE_DYNAMIC"},{"id":"SAI_FDB_ENTRY_ATTR_PACKET_ACTION","value":"SAI_PACKET_ACTION_FORWARD"}]}]|
2024-11-27.14:34:36.644547|r|SAI_OBJECT_TYPE_VLAN_MEMBER:oid:0x27000000000744
2024-11-27.14:34:36.647986|f|SAI_OBJECT_TYPE_FDB_FLUSH:oid:0x21000000000000|SAI_FDB_FLUSH_ATTR_BRIDGE_PORT_ID=oid:0x3a000000000682|SAI_FDB_FLUSH_ATTR_BV_ID=oid:0x26000000000741|SAI_FDB_FLUSH_ATTR_ENTRY_TYPE=SAI_FDB_FLUSH_ENTRY_TYPE_DYNAMIC
2024-11-27.14:34:36.649813|F|SAI_STATUS_SUCCESS
2024-11-27.14:34:36.650390|r|SAI_OBJECT_TYPE_VLAN:oid:0x26000000000741
2024-11-27.14:34:36.651179|n|fdb_event|[{"fdb_entry":"{\"bvid\":\"oid:0x26000000000741\",\"mac\":\"00:00:00:00:00:00\",\"switch_id\":\"oid:0x21000000000000\"}","fdb_event":"SAI_FDB_EVENT_FLUSHED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x3a000000000682"},{"id":"SAI_FDB_ENTRY_ATTR_TYPE","value":"SAI_FDB_ENTRY_TYPE_DYNAMIC"},{"id":"SAI_FDB_ENTRY_ATTR_PACKET_ACTION","value":"SAI_PACKET_ACTION_FORWARD"}]}]|
2024-11-27.14:34:36.658068|n|fdb_event|[{"fdb_entry":"{\"bvid\":\"oid:0x26000000000741\",\"mac\":\"00:00:00:00:00:00\",\"switch_id\":\"oid:0x21000000000000\"}","fdb_event":"SAI_FDB_EVENT_FLUSHED","list":[{"id":"SAI_FDB_ENTRY_ATTR_TYPE","value":"SAI_FDB_ENTRY_TYPE_DYNAMIC"},{"id":"SAI_FDB_ENTRY_ATTR_PACKET_ACTION","value":"SAI_PACKET_ACTION_FORWARD"}]}]|

Describe the results you received:

2024 Nov 27 16:34:36.642969 sonic ERR swss#orchagent: :- removeVlan: Failed to remove non-empty VLAN Vlan4094

Describe the results you expected:

No errors are expected

Output of show version:

  • N/A

Output of show techsupport:

  • N/A

Additional information you deem important (e.g. issue happens only occasionally):

  • N/A
@liat-grozovik liat-grozovik changed the title [202405][VLAN]: Orchagent reports VLAN removal failure due to invalid order of event processing [VLAN]: Orchagent reports VLAN removal failure due to invalid order of event processing Dec 1, 2024
@prabhataravind
Copy link
Contributor

Discuss with @qiluo-msft and @prsunny to see if there is a generic way to maintain order of objects in swss

@prabhataravind prabhataravind added the Triaged this issue has been triaged label Dec 4, 2024
@prsunny
Copy link
Contributor

prsunny commented Dec 6, 2024

@nazariig , based on this section of code - https://github.com/sonic-net/sonic-swss/blob/caed910210cba0ff7f6cbbdc403d58a2ffc24d55/orchagent/portsorch.cpp#L5590C1-L5631C6, its different consumer m_tosync for each table. Meaning, different tables have different multimaps. So can you clarify or have an evidence of this statement?

Since the events are stored in a multimap container (SONiC implementation from day 1), the ordering of items eventually can be changed due to key sorting algorithm.

@qiluo-msft
Copy link
Collaborator

@liuh-80 to check

@liuh-80
Copy link
Contributor

liuh-80 commented Dec 10, 2024

The issue may relate with this part:

https://github.com/sonic-net/sonic-swss/blob/8387d616532badfd8b8856f37096d7740b1a8b4d/orchagent/orchdaemon.cpp#L139

vector<table_name_with_pri_t> ports_tables = {
    { APP_PORT_TABLE_NAME,        portsorch_base_pri + 5 },
    { APP_SEND_TO_INGRESS_PORT_TABLE_NAME,        portsorch_base_pri + 5 },
    { APP_VLAN_TABLE_NAME,        portsorch_base_pri + 2 }, <== VLAN_TABLE priority higher than VLAN_MEMBER_TABLE
    { APP_VLAN_MEMBER_TABLE_NAME, portsorch_base_pri     },
    { APP_LAG_TABLE_NAME,         portsorch_base_pri + 4 }, <== This may also have issue
    { APP_LAG_MEMBER_TABLE_NAME,  portsorch_base_pri     }
};

When VLAN config in CONFIG_DB change, vlanmgrd will handle CONFIG_DB change and change APPL_DB. then orchagent will handle APPL_DB change. however the VLAN_TABLE has higher priority than VLAN_MEMBER_TABLE, so according to following code, VLAN delete event will select first:
https://github.com/liuh-80/sonic-swss-common/blob/ebd2afb0a2946420f6a42ba1f54f8b2c971781be/common/select.h#L52

struct cmp
{
    bool operator()(const Selectable* a, const Selectable* b) const
    {
        /* Choose Selectable with highest priority first */
        if (a->getPri() > b->getPri())
            return true;
        else if (a->getPri() < b->getPri())
            return false;

I will validate my theory, but if this is true, then change the priority may fix the issue. but a risk is VLAN member create event may select before VLAN create event.

@prsunny
Copy link
Contributor

prsunny commented Dec 10, 2024

The issue may relate with this part:

https://github.com/sonic-net/sonic-swss/blob/8387d616532badfd8b8856f37096d7740b1a8b4d/orchagent/orchdaemon.cpp#L139

vector<table_name_with_pri_t> ports_tables = {
    { APP_PORT_TABLE_NAME,        portsorch_base_pri + 5 },
    { APP_SEND_TO_INGRESS_PORT_TABLE_NAME,        portsorch_base_pri + 5 },
    { APP_VLAN_TABLE_NAME,        portsorch_base_pri + 2 }, <== VLAN_TABLE priority higher than VLAN_MEMBER_TABLE
    { APP_VLAN_MEMBER_TABLE_NAME, portsorch_base_pri     },
    { APP_LAG_TABLE_NAME,         portsorch_base_pri + 4 }, <== This may also have issue
    { APP_LAG_MEMBER_TABLE_NAME,  portsorch_base_pri     }
};

When VLAN config in CONFIG_DB change, vlanmgrd will handle CONFIG_DB change and change APPL_DB. then orchagent will handle APPL_DB change. however the VLAN_TABLE has higher priority than VLAN_MEMBER_TABLE, so according to following code, VLAN delete event will select first: https://github.com/liuh-80/sonic-swss-common/blob/ebd2afb0a2946420f6a42ba1f54f8b2c971781be/common/select.h#L52

struct cmp
{
    bool operator()(const Selectable* a, const Selectable* b) const
    {
        /* Choose Selectable with highest priority first */
        if (a->getPri() > b->getPri())
            return true;
        else if (a->getPri() < b->getPri())
            return false;

I will validate my theory, but if this is true, then change the priority may fix the issue. but a risk is VLAN member create event may select before VLAN create event.

agree. if we change order, it will break create flow

@nazariig
Copy link
Collaborator Author

@nazariig , based on this section of code - https://github.com/sonic-net/sonic-swss/blob/caed910210cba0ff7f6cbbdc403d58a2ffc24d55/orchagent/portsorch.cpp#L5590C1-L5631C6, its different consumer m_tosync for each table. Meaning, different tables have different multimaps. So can you clarify or have an evidence of this statement?

Since the events are stored in a multimap container (SONiC implementation from day 1), the ordering of items eventually can be changed due to key sorting algorithm.

@prsunny i didn't check that statement. That was only my guess, since i did not dive too deep into the bug. Based on the code snippet you provided, this is still an ordering issue

@nazariig
Copy link
Collaborator Author

nazariig commented Dec 11, 2024

The issue may relate with this part:
https://github.com/sonic-net/sonic-swss/blob/8387d616532badfd8b8856f37096d7740b1a8b4d/orchagent/orchdaemon.cpp#L139

vector<table_name_with_pri_t> ports_tables = {
    { APP_PORT_TABLE_NAME,        portsorch_base_pri + 5 },
    { APP_SEND_TO_INGRESS_PORT_TABLE_NAME,        portsorch_base_pri + 5 },
    { APP_VLAN_TABLE_NAME,        portsorch_base_pri + 2 }, <== VLAN_TABLE priority higher than VLAN_MEMBER_TABLE
    { APP_VLAN_MEMBER_TABLE_NAME, portsorch_base_pri     },
    { APP_LAG_TABLE_NAME,         portsorch_base_pri + 4 }, <== This may also have issue
    { APP_LAG_MEMBER_TABLE_NAME,  portsorch_base_pri     }
};

When VLAN config in CONFIG_DB change, vlanmgrd will handle CONFIG_DB change and change APPL_DB. then orchagent will handle APPL_DB change. however the VLAN_TABLE has higher priority than VLAN_MEMBER_TABLE, so according to following code, VLAN delete event will select first: https://github.com/liuh-80/sonic-swss-common/blob/ebd2afb0a2946420f6a42ba1f54f8b2c971781be/common/select.h#L52

struct cmp
{
    bool operator()(const Selectable* a, const Selectable* b) const
    {
        /* Choose Selectable with highest priority first */
        if (a->getPri() > b->getPri())
            return true;
        else if (a->getPri() < b->getPri())
            return false;

I will validate my theory, but if this is true, then change the priority may fix the issue. but a risk is VLAN member create event may select before VLAN create event.

agree. if we change order, it will break create flow

So this is a classic chicken-egg issue. If you fix one part, you will definitely break the other part.
A quick and solid fix would be introducing a retry mechanism.

IMHO, solving object dependency issue for init flow (currently done in SWSS) using some magic prio numbers is not reliable enough.

@liuh-80
Copy link
Contributor

liuh-80 commented Dec 13, 2024

After change priority, issue still happen, found another issue, the compare operation will sort last selected table to first, which means this is not a stable sort algorithm, will check and update later:

struct cmp
{
    bool operator()(const Selectable* a, const Selectable* b) const
    {
        /* Choose Selectable with highest priority first */
        if (a->getPri() > b->getPri())
            return true;
        else if (a->getPri() < b->getPri())
            return false;

        /* if the priorities are equal */
        /* use Selectable which was selected earlier */
        if (a->getLastUsedTime() < b->getLastUsedTime())
            return true;
        else if (a->getLastUsedTime() > b->getLastUsedTime())
            return false;

        /* when a == b */
        return false;
    }
};

@liuh-80
Copy link
Contributor

liuh-80 commented Dec 17, 2024

After remove the last used time part from compare method, event can be select correctly, however the vlan and ports event been blocked because following check failed, seems last used time is important for handle port event:

    /* Wait for all ports to be initialized */
    if (!allPortsReady())
    {
        return;
    }

@liuh-80
Copy link
Contributor

liuh-80 commented Dec 23, 2024

Verified the issue caused by LastUsedTime with following commands:

sudo truncate -s 0 /var/log/syslog
sudo config vlan add 4094
sudo config vlan member add 4094 Ethernet64
sudo config vlan add 4093 <== this will make VLAN table last user time bigger than VLAN_MEMBER table, then will cause VLAN_MEMBER pops first.
sudo cat /var/log/syslog

sudo kill -s SIGSTOP $(pgrep -f /usr/bin/orchagent)

sudo config vlan member del 4094 Ethernet64
sudo config vlan del 4094

sudo truncate -s 0 /var/log/syslog
sudo kill -s SIGCONT $(pgrep -f /usr/bin/orchagent)
sudo cat /var/log/syslog

admin@vlab-01:~$ sudo cat /var/log/syslog
2024 Dec 23 02:45:02.822362 vlab-01 NOTICE swss#orchagent: :- removeVlanMember: Remove member Ethernet64 from VLAN Vlan4094 lid:ffe vmid:270000000006b8
2024 Dec 23 02:45:02.822362 vlab-01 NOTICE swss#orchagent: :- flushFdbEntries: flush key: SAI_OBJECT_TYPE_FDB_FLUSH:oid:0x21000000000000, fields: 3
2024 Dec 23 02:45:02.823078 vlab-01 NOTICE swss#orchagent: :- recordFlushFdbEntries: flush key: SAI_OBJECT_TYPE_FDB_FLUSH:oid:0x21000000000000, fields: 3
2024 Dec 23 02:45:02.825418 vlab-01 NOTICE syncd#syncd: :- flushFdbEntries: generating fdb flush notifications
2024 Dec 23 02:45:02.825418 vlab-01 NOTICE syncd#syncd: :- meta_sai_on_fdb_flush_event_consolidated: processing consolidated fdb flush event of type: SAI_FDB_ENTRY_TYPE_DYNAMIC
2024 Dec 23 02:45:02.827917 vlab-01 NOTICE syncd#syncd: :- meta_sai_on_fdb_flush_event_consolidated: fdb flush took 0.004426 sec
2024 Dec 23 02:45:02.828591 vlab-01 NOTICE syncd#syncd: :- processFdbFlush: fdb flush succeeded, updating redis database
2024 Dec 23 02:45:02.828591 vlab-01 NOTICE syncd#syncd: :- processFlushEvent: received a flush port fdb event, portVid = oid:0x3a00000000061b, bvId = oid:0x260000000006b7
2024 Dec 23 02:45:02.828591 vlab-01 NOTICE syncd#syncd: :- processFlushEvent: pattern ASIC_STATE:SAI_OBJECT_TYPE_FDB_ENTRY:oid:0x260000000006b7, portStr oid:0x3a00000000061b
2024 Dec 23 02:45:02.833625 vlab-01 NOTICE swss#orchagent: :- meta_sai_on_fdb_flush_event_consolidated: processing consolidated fdb flush event of type: SAI_FDB_ENTRY_TYPE_DYNAMIC
2024 Dec 23 02:45:02.837905 vlab-01 NOTICE swss#orchagent: :- meta_sai_on_fdb_flush_event_consolidated: fdb flush took 0.004390 sec
2024 Dec 23 02:45:02.840065 vlab-01 NOTICE swss#orchagent: :- removeVlan: Remove VLAN Vlan4094 vid:4094

Next step will find a draft fix solution.

@liuh-80
Copy link
Contributor

liuh-80 commented Dec 26, 2024

Currently when vlan still have member, orchagent will delay the vlan delete, which means the error message just a warning, the vlan still will be delete later.

        if (removeVlan(vlan))
            it = consumer.m_toSync.erase(it);
        else
            it++;

Also swss-common pops with 'SPOP' command, which will not keep the original insert order: https://redis.io/docs/latest/commands/spop/

So, if the orchagent want keeps the order of events, need:

  1. change to Redis list commands: https://redis.io/docs/latest/develop/data-types/lists/
  2. change to ZMQ, which is message queue and will keeps order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug 🐛 Issue for 202405 Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

6 participants