Bug 2179328
| Summary: | [OSP13] openvswitch segfault on multiple nodes | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | ggrimaux | |
| Component: | openvswitch | Assignee: | RHOSP:NFV_Eng <rhosp-nfv-int> | |
| Status: | CLOSED EOL | QA Contact: | Eran Kuris <ekuris> | |
| Severity: | high | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 13.0 (Queens) | CC: | apevec, chrisw, fleitner, jlibosva | |
| Target Milestone: | z6 | Keywords: | Triaged | |
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2182087 (view as bug list) | Environment: | ||
| Last Closed: | 2024-03-12 15:10:13 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 2182087 | |||
| Bug Blocks: | ||||
Description of problem: openvswitch-agent segfault randomly. Happened twice in 2 compute nodes, in a few days. Here's one example: Mar 17 09:51:38 cmp27 kernel: pmd327[5838]: segfault at 7f8da1104000 ip 000055f487a46bc9 sp 00007f8da0837cb0 error 4 in ovs-vswitchd[55f487770000+617000] Mar 17 09:51:40 cmp27 systemd: ovs-vswitchd.service: main process exited, code=killed, status=11/SEGV Mar 17 09:51:40 cmp27 systemd: Unit ovs-vswitchd.service entered failed state. Mar 17 09:51:40 cmp27 systemd: ovs-vswitchd.service failed. Mar 17 09:51:40 cmp27 systemd: ovs-vswitchd.service holdoff time over, scheduling restart. From ovs-vswitchd.log: 2023-03-17T04:21:38.743Z|10591967|netdev_dpdk|INFO|State of queue 0 ( tx_qid 0 ) of vhost device '/var/lib/vhost_sockets/vhu78be832d-6b' changed to 'enabled' 2023-03-17T04:21:38.743Z|10591968|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK 2023-03-17T04:21:38.743Z|10591969|dpdk|INFO|VHOST_CONFIG: vring kick idx:1 file:232 2023-03-17T04:21:38.743Z|10591970|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_ENABLE 2023-03-17T04:21:38.743Z|10591971|dpdk|INFO|VHOST_CONFIG: set queue enable: 1 to qp idx: 1 2023-03-17T04:21:38.743Z|10591972|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL 2023-03-17T04:21:38.743Z|10591973|dpdk|INFO|VHOST_CONFIG: vring call idx:1 file:240 2023-03-17T04:21:38.743Z|10591974|dpdk|INFO|VHOST_CONFIG: virtio is now ready for processing. 2023-03-17T04:21:38.743Z|10591975|netdev_dpdk|INFO|vHost Device '/var/lib/vhost_sockets/vhu25fa1368-ed' has been added on numa node 1 2023-03-17T04:21:38.743Z|10591976|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_ENABLE 2023-03-17T04:21:38.743Z|10591977|dpdk|INFO|VHOST_CONFIG: set queue enable: 1 to qp idx: 0 2023-03-17T04:21:38.743Z|10591978|netdev_dpdk|INFO|State of queue 0 ( tx_qid 0 ) of vhost device '/var/lib/vhost_sockets/vhu78be832d-6b' changed to 'enabled' 2023-03-17T04:21:38.743Z|10591979|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_ENABLE 2023-03-17T04:21:38.743Z|10591980|dpdk|INFO|VHOST_CONFIG: set queue enable: 1 to qp idx: 0 2023-03-17T04:21:38.743Z|10591981|netdev_dpdk|INFO|State of queue 0 ( tx_qid 0 ) of vhost device '/var/lib/vhost_sockets/vhu25fa1368-ed' changed to 'enabled' 2023-03-17T04:21:38.743Z|10591982|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_ENABLE 2023-03-17T04:21:38.743Z|10591983|dpdk|INFO|VHOST_CONFIG: set queue enable: 1 to qp idx: 1 2023-03-17T04:21:38.743Z|10591984|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_ENABLE 2023-03-17T04:21:38.743Z|10591985|dpdk|INFO|VHOST_CONFIG: set queue enable: 1 to qp idx: 1 2023-03-17T04:21:38.743Z|10591986|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_FEATURES 2023-03-17T04:21:38.743Z|10591987|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_MEM_TABLE 2023-03-17T04:21:38.743Z|10591988|dpdk|INFO|VHOST_CONFIG: (6) memory regions not changed 2023-03-17T04:21:38.743Z|10591989|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM 2023-03-17T04:21:38.743Z|10591990|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE 2023-03-17T04:21:38.743Z|10591991|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR 2023-03-17T04:21:38.743Z|10591992|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK 2023-03-17T04:21:38.743Z|10591993|dpdk|INFO|VHOST_CONFIG: vring kick idx:0 file:242 2023-03-17T04:21:38.743Z|10591994|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL 2023-03-17T04:21:38.743Z|10591995|dpdk|INFO|VHOST_CONFIG: vring call idx:0 file:243 2023-03-17T04:21:38.743Z|10591996|dpdk|INFO|VHOST_CONFIG: virtio is now ready for processing. 2023-03-17T04:21:41.154Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log 2023-03-17T04:21:41.167Z|00002|ovs_numa|INFO|Discovered 40 CPU cores on NUMA node 0 2023-03-17T04:21:41.167Z|00003|ovs_numa|INFO|Discovered 40 CPU cores on NUMA node 1 2023-03-17T04:21:41.167Z|00004|ovs_numa|INFO|Discovered 2 NUMA nodes and 80 CPU cores 2023-03-17T04:21:41.167Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting... 2023-03-17T04:21:41.167Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected 2023-03-17T04:21:41.175Z|00007|dpdk|INFO|Using DPDK 18.11.9 From ovsdb-server.log: 2023-03-17T02:57:14.674Z|19865|jsonrpc|WARN|unix#35315: receive error: Connection reset by peer 2023-03-17T02:57:14.674Z|19866|reconnect|WARN|unix#35315: connection dropped (Connection reset by peer) 2023-03-17T04:21:41.006Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log 2023-03-17T04:21:41.025Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.11.6 2023-03-17T04:21:41.035Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer 2023-03-17T04:21:41.035Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer) 2023-03-17T04:21:41.049Z|00005|jsonrpc|WARN|unix#1: receive error: Connection reset by peer 2023-03-17T04:21:41.049Z|00006|reconnect|WARN|unix#1: connection dropped (Connection reset by peer) 2023-03-17T04:21:51.031Z|00007|memory|INFO|7076 kB peak resident set size after 10.0 seconds 2023-03-17T04:21:51.031Z|00008|memory|INFO|cells:7492 json-caches:1 monitors:3 sessions:2 2023-03-17T04:21:52.835Z|00009|jsonrpc|WARN|unix#6: receive error: Connection reset by peer 2023-03-17T04:21:52.836Z|00010|reconnect|WARN|unix#6: connection dropped (Connection reset by peer) And inside openvswitch-agent.log: 2023-03-17 09:51:36.506 44188 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-fbe39231-bd63-4f11-894d-599c60f47c56 - - - - -] ofctl request version=0x4,msg_type=0x12,msg_len=0x38,xid=0x7173259c,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=23,type=1) result [OFPFlowStatsReply(body=[OFPFlowStats(byte_count=0,cookie=15396433291096763469L,duration_nsec=132000000,duration_sec=10830417,flags=0,hard_timeout=0,idle_timeout=0,instructions=[],length=56,match=OFPMatch(oxm_fields={}),packet_count=0,priority=0,table_id=23)],flags=0,type=1)] _send_msg /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py:98 2023-03-17 09:51:36.509 44188 DEBUG neutron.agent.linux.utils [req-fbe39231-bd63-4f11-894d-599c60f47c56 - - - - -] Found cmdline ['ovsdb-client', 'monitor', 'tcp:127.0.0.1:6640', 'Interface', 'name,ofport,external_ids', '--format=json'] for rocess with PID 46883. get_cmdline_from_pid /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:329 2023-03-17 09:51:36.510 44188 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-fbe39231-bd63-4f11-894d-599c60f47c56 - - - - -] Agent rpc_loop - iteration:5413269 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:0.014 2023-03-17 09:51:38.497 44188 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-fbe39231-bd63-4f11-894d-599c60f47c56 - - - - -] Agent rpc_loop - iteration:5413270 started 2023-03-17 09:51:38.502 44188 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-fbe39231-bd63-4f11-894d-599c60f47c56 - - - - -] ofctl request version=0x4,msg_type=0x12,msg_len=0x38,xid=0x7173259e,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=23,type=1) result [OFPFlowStatsReply(body=[OFPFlowStats(byte_count=0,cookie=15396433291096763469L,duration_nsec=128000000,duration_sec=10830419,flags=0,hard_timeout=0,idle_timeout=0,instructions=[],length=56,match=OFPMatch(oxm_fields={}),packet_count=0,priority=0,table_id=23)],flags=0,type=1)] _send_msg /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py:98 2023-03-17 09:51:38.505 44188 DEBUG neutron.agent.linux.utils [req-fbe39231-bd63-4f11-894d-599c60f47c56 - - - - -] Found cmdline ['ovsdb-client', 'monitor', 'tcp:127.0.0.1:6640', 'Interface', 'name,ofport,external_ids', '--format=json'] for rocess with PID 46883. get_cmdline_from_pid /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:329 2023-03-17 09:51:38.506 44188 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-fbe39231-bd63-4f11-894d-599c60f47c56 - - - - -] Agent rpc_loop - iteration:5413270 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:0.009 2023-03-17 09:51:40.498 44188 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-fbe39231-bd63-4f11-894d-599c60f47c56 - - - - -] Agent rpc_loop - iteration:5413271 started 2023-03-17 09:51:40.763 44188 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: ovsdb-client: tcp:127.0.0.1:6640: Open_vSwitch database was removed 2023-03-17 09:51:40.764 44188 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: None 2023-03-17 09:51:40.764 44188 DEBUG neutron.agent.linux.async_process [-] Halting async process [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json] in response to an error. _handle_process_error /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:217 2023-03-17 09:51:40.765 44188 DEBUG neutron.agent.linux.utils [-] Running command: ['kill', '-9', '46883'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87 2023-03-17 09:51:40.784 44188 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: ovsdb-client: tcp:127.0.0.1:6640: Open_vSwitch database was removed 2023-03-17 09:51:40.785 44188 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: None _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:260 2023-03-17 09:51:40.786 44188 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: None _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:260 2023-03-17 09:51:40.786 44188 DEBUG neutron.agent.linux.async_process [-] Halting async process [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json] in response to an error. _handle_process_error /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:217 2023-03-17 09:51:40.786 44188 DEBUG neutron.agent.linux.utils [-] Running command: ['kill', '-9', '46885'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87 2023-03-17 09:51:40.805 44188 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: None We need to identify why this is happening suddenly. We have sosreport from a compute where it happened (0040). Version-Release number of selected component (if applicable): Red Hat OpenStack Platform release 13.0.16 (Queens) rhosp13_openstack-neutron-openvswitch-agent 13.0-160 openstack-neutron-openvswitch-12.1.1-42.1.el7ost.noarch openvswitch2.11-2.11.3-86.el7fdp.x86_64 How reproducible: Happens randomly. Happened already on 2 nodes Steps to Reproduce: 1. None 2. 3. Actual results: Neutron processes restart (ovs-vswitchd, ovsdb-server, openvswitch-agent) Expected results: No sudden restart. Additional info: We have sosreport