Bug 2095836 - SR-IOV port status switching between ACTIVE and BUILD
Summary: SR-IOV port status switching between ACTIVE and BUILD
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: OSP DFG:Compute
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-10 17:27 UTC by Matt Flusche
Modified: 2023-06-22 02:56 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-12 13:53:45 UTC
Target Upstream Version:
Embargoed:
pambre: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-15629 0 None None None 2022-06-10 17:32:28 UTC

Description Matt Flusche 2022-06-10 17:27:39 UTC
Description of problem:
In this environment there are multiple sr-iov ports flipping between ACTIVE and BUILD state.

The ports are on active instances are are functioning properly; the state is just flipping in the neutron DB.

Example port: f00dc545-a450-42f7-8c90-9f8dc7717d20

while (true); do openstack port show f00dc545-a450-42f7-8c90-9f8dc7717d20 -f value -c status; done
ACTIVE
ACTIVE
ACTIVE
ACTIVE
ACTIVE
ACTIVE
ACTIVE
ACTIVE
ACTIVE
BUILD
BUILD
BUILD
BUILD
BUILD
BUILD
BUILD
[...]
ACTIVE
[...]
BUILD

I'll provide full debug logs for review; here is a summary:

In /var/log/containers/neutron/sriov-nic-agent.log we see a loop of errors while setting the vf to state auto

$ grep -A4 "'ens2f1', 'vf', '23', 'state', 'auto'" var/log/containers/neutron/sriov-nic-agent.log |head -4
2022-06-07 18:10:28.159 374439 DEBUG neutron.agent.linux.utils [req-3b464202-45dc-4a61-a51e-118daf85ed9c - - - - -] Running command (rootwrap daemon): ['ip', 'link', 'set', 'ens2f1', 'vf', '23', 'state', 'auto'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2022-06-07 18:10:28.180 374439 ERROR neutron.agent.linux.utils [req-3b464202-45dc-4a61-a51e-118daf85ed9c - - - - -] Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK answers: Operation not supported

2022-06-07 18:10:28.181 374439 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-3b464202-45dc-4a61-a51e-118daf85ed9c - - - - -] Device fa:16:3e:f1:eb:50 does not support state change: IpCommandOperationNotSupportedError: Operation not supported on device ens2f1


$ grep "'ens2f1', 'vf', '23', 'state', 'auto'" var/log/containers/neutron/sriov-nic-agent.log |wc -l
81

However, the VF is already set to auto:

from sos_commands/networking/ip_-s_-d_link

    vf 23 MAC fa:16:3e:f1:eb:50, spoof checking off, link-state auto, trust on, query_rss off
    RX: bytes  packets  mcast   bcast   dropped 
    0          0        0       0       0        
    TX: bytes  packets   dropped 
    0          0        0        


In the nova compute log, there is a loop of these unexpected event messages for the port:

$ grep 'Received unexpected event network-vif-plugged-f00dc545' var/log/containers/nova/nova-compute.log|head -1
2022-06-07 16:00:22.155 7 WARNING nova.compute.manager [req-a272ce3e-3712-42d4-b5f3-f3b83fd8f323 6c14b1f098ed4192a9c4e0add240cb77 4f36a7c9956e4a56b9b0924839461b69 - default default] [instance: 458704af-9c5f-4131-a7d8-127b27b69dc1] Received unexpected event network-vif-plugged-f00dc545-a450-42f7-8c90-9f8dc7717d20 for instance with vm_state active and task_state None.

$ grep 'Received unexpected event network-vif-plugged-f00dc545' var/log/containers/nova/nova-compute.log|wc -l
331



In neutron server log, we see this (note the: Ignoring state change previous_port_status: ACTIVE current_port_status: BUILD):

2022-06-07 18:51:26.156 42 DEBUG neutron.plugins.ml2.rpc [req-3b464202-45dc-4a61-a51e-118daf85ed9c - - - - -] Device fa:16:3e:f1:eb:50 details requested by agent nic-switch-agent.COMPUTE with host None get_device_details /usr/lib/python2.7/site-packages/neutron/plugins/ml2/rpc.py:85
2022-06-07 18:51:26.156 42 DEBUG neutron.plugins.ml2.db [req-3b464202-45dc-4a61-a51e-118daf85ed9c - - - - -] get_port_from_device_mac() called for mac fa:16:3e:f1:eb:50 get_port_from_device_mac /usr/lib/python2.7/site-packages/neutron/plugins/ml2/db.py:147
2022-06-07 18:51:26.173 42 DEBUG oslo_concurrency.lockutils [-] Lock "event-dispatch" acquired by "neutron.plugins.ml2.ovo_rpc.dispatch_events" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
2022-06-07 18:51:26.250 42 DEBUG neutron.api.rpc.handlers.resources_rpc [req-3b464202-45dc-4a61-a51e-118daf85ed9c - - - - -] Pushing event updated for resources: {'Port': ['ID=8ea35eb9-1f63-4464-a19e-c286b3301f44,revision_number=390645']} push /usr/lib/python2.7/site-packages/neutron/api/rpc/handlers/resources_rpc.py:241
2022-06-07 18:51:26.255 42 DEBUG oslo_concurrency.lockutils [req-3b464202-45dc-4a61-a51e-118daf85ed9c - - - - -] Lock "event-dispatch" released by "neutron.plugins.ml2.ovo_rpc.dispatch_events" :: held 0.082s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285
2022-06-07 18:51:26.268 42 DEBUG neutron.plugins.ml2.db [req-3b464202-45dc-4a61-a51e-118daf85ed9c - - - - -] For port f00dc545-a450-42f7-8c90-9f8dc7717d20, host COMPUTE, got binding levels [<neutron.plugins.ml2.models.PortBindingLevel[object at 7f3e144fee90] {port_id=u'f00dc545-a450-42f7-8c90-9f8dc7717d20', host=u'COMPUTE', level=0, driver=u'sriovnicswitch', segment_id=u'3980ae42-ea2f-4edf-b9e5-34be572b7204'}>] get_binding_levels /usr/lib/python2.7/site-packages/neutron/plugins/ml2/db.py:77
2022-06-07 18:51:26.283 42 DEBUG neutron.plugins.ml2.rpc [req-3b464202-45dc-4a61-a51e-118daf85ed9c - - - - -] Returning: {'profile': {u'pci_slot': u'0000:87:15.7', u'physical_network': u'sriov-n1-p1', u'trusted': u'true', u'pci_vendor_info': u'8086:10ed'}, 'network_qos_policy_id': None, 'qos_policy_id': None, 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': u'ae4d9ee3-69e9-427c-bfcb-604ac2326459', 'segmentation_id': None, 'fixed_ips': [{'subnet_id': u'6d349465-4f13-4d2e-9d0b-7181a7013505', 'ip_address': '172.18.5.71'}], 'mtu': 9000, 'device_owner': u'compute:nova', 'physical_network': u'sriov-n1-p1', 'mac_address': u'fa:16:3e:f1:eb:50', 'device': u'fa:16:3e:f1:eb:50', 'port_security_enabled': False, 'port_id': u'f00dc545-a450-42f7-8c90-9f8dc7717d20', 'network_type': u'flat'} _get_device_details /usr/lib/python2.7/site-packages/neutron/plugins/ml2/rpc.py:152
2022-06-07 18:51:26.319 42 DEBUG neutron_lib.callbacks.manager [req-3b464202-45dc-4a61-a51e-118daf85ed9c - - - - -] Notify callbacks ['neutron.plugins.ml2.plugin.Ml2Plugin._ensure_default_security_group_handler-3820441'] for port, before_update _notify_loop /usr/lib/python2.7/site-packages/neutron_lib/callbacks/manager.py:167
2022-06-07 18:51:26.330 42 DEBUG neutron.notifiers.nova [req-3b464202-45dc-4a61-a51e-118daf85ed9c - - - - -] Ignoring state change previous_port_status: ACTIVE current_port_status: BUILD port_id f00dc545-a450-42f7-8c90-9f8dc7717d20 record_port_status_changed /usr/lib/python2.7/site-packages/neutron/notifiers/nova.py:209
2022-06-07 18:51:26.338 42 DEBUG neutron.plugins.ml2.db [req-3b464202-45dc-4a61-a51e-118daf85ed9c - - - - -] For port f00dc545-a450-42f7-8c90-9f8dc7717d20, host COMPUTE, got binding levels [<neutron.plugins.ml2.models.PortBindingLevel[object at 7f3e13d652d0] {port_id=u'f00dc545-a450-42f7-8c90-9f8dc7717d20', host=u'COMPUTE', level=0, driver=u'sriovnicswitch', segment_id=u'3980ae42-ea2f-4edf-b9e5-34be572b7204'}>] get_binding_levels /usr/lib/python2.7/site-packages/neutron/plugins/ml2/db.py:77



There was a thought that multiple port bindings for the interface but only one in the db:

MariaDB [ovs_neutron]> select * from ports where id like 'f00dc545%';
+----------------------------------+--------------------------------------+--------------------------------------------------------+--------------------------------------+-------------------+----------------+--------+--------------------------------------+--------------+------------------+---------------+
| project_id                       | id                                   | name                                                   | network_id                           | mac_address       | admin_state_up | status | device_id                            | device_owner | standard_attr_id | ip_allocation |
+----------------------------------+--------------------------------------+--------------------------------------------------------+--------------------------------------+-------------------+----------------+--------+--------------------------------------+--------------+------------------+---------------+
| 9ba1d817a52247bcabb6889818d8a1e8 | f00dc545-a450-42f7-8c90-9f8dc7717d20 | REMOVED | ae4d9ee3-69e9-427c-bfcb-604ac2326459 | fa:16:3e:f1:eb:50 |              1 | ACTIVE | 458704af-9c5f-4131-a7d8-127b27b69dc1 | compute:nova |            61900 | immediate     |
+----------------------------------+--------------------------------------+--------------------------------------------------------+--------------------------------------+-------------------+----------------+--------+--------------------------------------+--------------+------------------+---------------+
1 row in set (0.001 sec)


select * from ml2_port_bindings where port_id like 'f00dc545%';
+--------------------------------------+---------------------------------+----------+-----------+--------------------------------------------------------------------------------------------------------------------+-------------------------------------+--------+
| port_id                              | host                            | vif_type | vnic_type | profile                                                                                                            | vif_details                         | status |
+--------------------------------------+---------------------------------+----------+-----------+--------------------------------------------------------------------------------------------------------------------+-------------------------------------+--------+
| f00dc545-a450-42f7-8c90-9f8dc7717d20 | COMPUTE | hw_veb   | direct    | {"pci_slot": "0000:87:15.7", "physical_network": "sriov-n1-p1", "trusted": "true", "pci_vendor_info": "8086:10ed"} | {"port_filter": false, "vlan": "0"} | ACTIVE |
+--------------------------------------+---------------------------------+----------+-----------+--------------------------------------------------------------------------------------------------------------------+-------------------------------------+--------+




Version-Release number of selected component (if applicable):
OSP 13z16

How reproducible:
This env

Additional info:

More details in private comments

Comment 14 Artom Lifshitz 2022-07-12 14:01:33 UTC
We think this is caused by one or more instances using PCI devices that they're not supposed to, probably due to [1]. In this particular case, we think the PCI addresses are on the wrong physical network, so the Neutron agent is reconfiguring all ports, as Sean explained in the previous comment. We wrote a script for a similar BZ to detect which instances, if any, have a mistmatch between the PCI information in the Nova database, and the PCI information in the Neutron ports that are attached to the instance. We're asking that you run the script (don't worry, it doesn't do anything destructive, just some SELECT queries on the Nova and Neutron databases), and report back. If it finds any instances with PCI address mismatches, those will have to be manually repaired, ideally by just cold migrating them.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1767797

Comment 29 Artom Lifshitz 2022-10-12 13:53:45 UTC
Customer case is closed, closing the BZ. Looks like this will remain a mystery.


Note You need to log in before you can comment on or make changes to this bug.