Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2179328

Summary: [OSP13] openvswitch segfault on multiple nodes
Product: Red Hat OpenStack Reporter: ggrimaux
Component: openvswitchAssignee: RHOSP:NFV_Eng <rhosp-nfv-int>
Status: CLOSED EOL QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: apevec, chrisw, fleitner, jlibosva
Target Milestone: z6Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2182087 (view as bug list) Environment:
Last Closed: 2024-03-12 15:10:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2182087    
Bug Blocks:    

Description ggrimaux 2023-03-17 11:25:37 UTC
Description of problem:
openvswitch-agent segfault randomly. Happened twice in 2 compute nodes, in a few days.

Here's one example:
Mar 17 09:51:38 cmp27 kernel: pmd327[5838]: segfault at 7f8da1104000 ip 000055f487a46bc9 sp 00007f8da0837cb0 error 4 in ovs-vswitchd[55f487770000+617000]
Mar 17 09:51:40 cmp27 systemd: ovs-vswitchd.service: main process exited, code=killed, status=11/SEGV
Mar 17 09:51:40 cmp27 systemd: Unit ovs-vswitchd.service entered failed state.
Mar 17 09:51:40 cmp27 systemd: ovs-vswitchd.service failed.
Mar 17 09:51:40 cmp27 systemd: ovs-vswitchd.service holdoff time over, scheduling restart.

From ovs-vswitchd.log:
2023-03-17T04:21:38.743Z|10591967|netdev_dpdk|INFO|State of queue 0 ( tx_qid 0 ) of vhost device '/var/lib/vhost_sockets/vhu78be832d-6b' changed to 'enabled'
2023-03-17T04:21:38.743Z|10591968|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
2023-03-17T04:21:38.743Z|10591969|dpdk|INFO|VHOST_CONFIG: vring kick idx:1 file:232
2023-03-17T04:21:38.743Z|10591970|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_ENABLE
2023-03-17T04:21:38.743Z|10591971|dpdk|INFO|VHOST_CONFIG: set queue enable: 1 to qp idx: 1
2023-03-17T04:21:38.743Z|10591972|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
2023-03-17T04:21:38.743Z|10591973|dpdk|INFO|VHOST_CONFIG: vring call idx:1 file:240
2023-03-17T04:21:38.743Z|10591974|dpdk|INFO|VHOST_CONFIG: virtio is now ready for processing.
2023-03-17T04:21:38.743Z|10591975|netdev_dpdk|INFO|vHost Device '/var/lib/vhost_sockets/vhu25fa1368-ed' has been added on numa node 1
2023-03-17T04:21:38.743Z|10591976|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_ENABLE
2023-03-17T04:21:38.743Z|10591977|dpdk|INFO|VHOST_CONFIG: set queue enable: 1 to qp idx: 0
2023-03-17T04:21:38.743Z|10591978|netdev_dpdk|INFO|State of queue 0 ( tx_qid 0 ) of vhost device '/var/lib/vhost_sockets/vhu78be832d-6b' changed to 'enabled'
2023-03-17T04:21:38.743Z|10591979|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_ENABLE
2023-03-17T04:21:38.743Z|10591980|dpdk|INFO|VHOST_CONFIG: set queue enable: 1 to qp idx: 0
2023-03-17T04:21:38.743Z|10591981|netdev_dpdk|INFO|State of queue 0 ( tx_qid 0 ) of vhost device '/var/lib/vhost_sockets/vhu25fa1368-ed' changed to 'enabled'
2023-03-17T04:21:38.743Z|10591982|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_ENABLE
2023-03-17T04:21:38.743Z|10591983|dpdk|INFO|VHOST_CONFIG: set queue enable: 1 to qp idx: 1
2023-03-17T04:21:38.743Z|10591984|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_ENABLE
2023-03-17T04:21:38.743Z|10591985|dpdk|INFO|VHOST_CONFIG: set queue enable: 1 to qp idx: 1
2023-03-17T04:21:38.743Z|10591986|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_FEATURES
2023-03-17T04:21:38.743Z|10591987|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_MEM_TABLE
2023-03-17T04:21:38.743Z|10591988|dpdk|INFO|VHOST_CONFIG: (6) memory regions not changed
2023-03-17T04:21:38.743Z|10591989|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
2023-03-17T04:21:38.743Z|10591990|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
2023-03-17T04:21:38.743Z|10591991|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
2023-03-17T04:21:38.743Z|10591992|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
2023-03-17T04:21:38.743Z|10591993|dpdk|INFO|VHOST_CONFIG: vring kick idx:0 file:242
2023-03-17T04:21:38.743Z|10591994|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
2023-03-17T04:21:38.743Z|10591995|dpdk|INFO|VHOST_CONFIG: vring call idx:0 file:243
2023-03-17T04:21:38.743Z|10591996|dpdk|INFO|VHOST_CONFIG: virtio is now ready for processing.
2023-03-17T04:21:41.154Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2023-03-17T04:21:41.167Z|00002|ovs_numa|INFO|Discovered 40 CPU cores on NUMA node 0
2023-03-17T04:21:41.167Z|00003|ovs_numa|INFO|Discovered 40 CPU cores on NUMA node 1
2023-03-17T04:21:41.167Z|00004|ovs_numa|INFO|Discovered 2 NUMA nodes and 80 CPU cores
2023-03-17T04:21:41.167Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2023-03-17T04:21:41.167Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2023-03-17T04:21:41.175Z|00007|dpdk|INFO|Using DPDK 18.11.9


From ovsdb-server.log:
2023-03-17T02:57:14.674Z|19865|jsonrpc|WARN|unix#35315: receive error: Connection reset by peer
2023-03-17T02:57:14.674Z|19866|reconnect|WARN|unix#35315: connection dropped (Connection reset by peer)
2023-03-17T04:21:41.006Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
2023-03-17T04:21:41.025Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.11.6
2023-03-17T04:21:41.035Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer
2023-03-17T04:21:41.035Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer)
2023-03-17T04:21:41.049Z|00005|jsonrpc|WARN|unix#1: receive error: Connection reset by peer
2023-03-17T04:21:41.049Z|00006|reconnect|WARN|unix#1: connection dropped (Connection reset by peer)
2023-03-17T04:21:51.031Z|00007|memory|INFO|7076 kB peak resident set size after 10.0 seconds
2023-03-17T04:21:51.031Z|00008|memory|INFO|cells:7492 json-caches:1 monitors:3 sessions:2
2023-03-17T04:21:52.835Z|00009|jsonrpc|WARN|unix#6: receive error: Connection reset by peer
2023-03-17T04:21:52.836Z|00010|reconnect|WARN|unix#6: connection dropped (Connection reset by peer)


And inside openvswitch-agent.log:
2023-03-17 09:51:36.506 44188 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-fbe39231-bd63-4f11-894d-599c60f47c56 - - - - -] ofctl request version=0x4,msg_type=0x12,msg_len=0x38,xid=0x7173259c,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=23,type=1) result [OFPFlowStatsReply(body=[OFPFlowStats(byte_count=0,cookie=15396433291096763469L,duration_nsec=132000000,duration_sec=10830417,flags=0,hard_timeout=0,idle_timeout=0,instructions=[],length=56,match=OFPMatch(oxm_fields={}),packet_count=0,priority=0,table_id=23)],flags=0,type=1)] _send_msg /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py:98
2023-03-17 09:51:36.509 44188 DEBUG neutron.agent.linux.utils [req-fbe39231-bd63-4f11-894d-599c60f47c56 - - - - -] Found cmdline ['ovsdb-client', 'monitor', 'tcp:127.0.0.1:6640', 'Interface', 'name,ofport,external_ids', '--format=json'] for rocess with PID 46883. get_cmdline_from_pid /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:329
2023-03-17 09:51:36.510 44188 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-fbe39231-bd63-4f11-894d-599c60f47c56 - - - - -] Agent rpc_loop - iteration:5413269 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:0.014
2023-03-17 09:51:38.497 44188 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-fbe39231-bd63-4f11-894d-599c60f47c56 - - - - -] Agent rpc_loop - iteration:5413270 started
2023-03-17 09:51:38.502 44188 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-fbe39231-bd63-4f11-894d-599c60f47c56 - - - - -] ofctl request version=0x4,msg_type=0x12,msg_len=0x38,xid=0x7173259e,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=23,type=1) result [OFPFlowStatsReply(body=[OFPFlowStats(byte_count=0,cookie=15396433291096763469L,duration_nsec=128000000,duration_sec=10830419,flags=0,hard_timeout=0,idle_timeout=0,instructions=[],length=56,match=OFPMatch(oxm_fields={}),packet_count=0,priority=0,table_id=23)],flags=0,type=1)] _send_msg /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py:98
2023-03-17 09:51:38.505 44188 DEBUG neutron.agent.linux.utils [req-fbe39231-bd63-4f11-894d-599c60f47c56 - - - - -] Found cmdline ['ovsdb-client', 'monitor', 'tcp:127.0.0.1:6640', 'Interface', 'name,ofport,external_ids', '--format=json'] for rocess with PID 46883. get_cmdline_from_pid /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:329
2023-03-17 09:51:38.506 44188 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-fbe39231-bd63-4f11-894d-599c60f47c56 - - - - -] Agent rpc_loop - iteration:5413270 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:0.009
2023-03-17 09:51:40.498 44188 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-fbe39231-bd63-4f11-894d-599c60f47c56 - - - - -] Agent rpc_loop - iteration:5413271 started
2023-03-17 09:51:40.763 44188 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: ovsdb-client: tcp:127.0.0.1:6640: Open_vSwitch database was removed
2023-03-17 09:51:40.764 44188 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: None
2023-03-17 09:51:40.764 44188 DEBUG neutron.agent.linux.async_process [-] Halting async process [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json] in response to an error. _handle_process_error /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:217
2023-03-17 09:51:40.765 44188 DEBUG neutron.agent.linux.utils [-] Running command: ['kill', '-9', '46883'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87
2023-03-17 09:51:40.784 44188 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: ovsdb-client: tcp:127.0.0.1:6640: Open_vSwitch database was removed
2023-03-17 09:51:40.785 44188 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: None _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:260
2023-03-17 09:51:40.786 44188 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: None _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:260
2023-03-17 09:51:40.786 44188 DEBUG neutron.agent.linux.async_process [-] Halting async process [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json] in response to an error. _handle_process_error /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:217
2023-03-17 09:51:40.786 44188 DEBUG neutron.agent.linux.utils [-] Running command: ['kill', '-9', '46885'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87
2023-03-17 09:51:40.805 44188 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: None

We need to identify why this is happening suddenly.
We have sosreport from a compute where it happened (0040).

Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 13.0.16 (Queens)
rhosp13_openstack-neutron-openvswitch-agent   13.0-160
openstack-neutron-openvswitch-12.1.1-42.1.el7ost.noarch
openvswitch2.11-2.11.3-86.el7fdp.x86_64

How reproducible:
Happens randomly.
Happened already on 2 nodes

Steps to Reproduce:
1. None
2.
3.

Actual results:
Neutron processes restart (ovs-vswitchd, ovsdb-server, openvswitch-agent)

Expected results:
No sudden restart.

Additional info:
We have sosreport