RHOSP 10 compute nodes with ovs-dpdk ports hit an issue causes frequent restart of neutron openvswitch agent. The issue is semi consitent ovs crashing during physical link failover. Coredumps were generated during latest crash. Core was generated by `ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfi'. Packages: ~~~ openvswitch-2.9.0-122.el7fdp.x86_64 openvswitch-debuginfo-2.9.0-122.el7fdp.x86_64 ~~~ bt: ~~~ #0 0x00007f92bf906377 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 #1 0x00007f92bf907a68 in __GI_abort () at abort.c:90 #2 0x0000563dfe76dad5 in dp_packet_resize__ (b=b@entry=0x2d0bb2940, new_headroom=new_headroom@entry=64, new_tailroom=<optimized out>) at ../lib/dp-packet.c:264 #3 0x0000563dfe76de1f in dp_packet_prealloc_headroom (b=b@entry=0x2d0bb2940, size=size@entry=50) at ../lib/dp-packet.c:294 #4 0x0000563dfe76e351 in dp_packet_push_uninit (b=b@entry=0x2d0bb2940, size=size@entry=50) at ../lib/dp-packet.c:406 #5 0x0000563dfe8244cc in netdev_tnl_push_ip_header (packet=packet@entry=0x2d0bb2940, header=0x7f9288287690, size=50, ip_tot_size=ip_tot_size@entry=0x7f92b970d3d4) at ../lib/netdev-native-tnl.c:154 #6 0x0000563dfe8245ca in netdev_tnl_push_udp_header (packet=0x2d0bb2940, data=<optimized out>) at ../lib/netdev-native-tnl.c:224 #7 0x0000563dfe7a03f6 in netdev_push_header (netdev=0x563e00ca61a0, batch=batch@entry=0x7f92b970df80, data=data@entry=0x7f9288287680) at ../lib/netdev.c:858 #8 0x0000563dfe77a0c2 in push_tnl_action (batch=0x7f92b970df80, attr=0x7f92b970df80, pmd=0x7f92b9711010) at ../lib/dpif-netdev.c:6134 #9 dp_execute_cb (aux_=aux_@entry=0x7f92b970def0, packets_=packets_@entry=0x7f92b970df80, a=a@entry=0x7f928828767c, may_steal=false) at ../lib/dpif-netdev.c:6225 #10 0x0000563dfe7a93d8 in odp_execute_actions (dp=dp@entry=0x7f92b970def0, batch=batch@entry=0x7f92b970df80, steal=steal@entry=true, actions=<optimized out>, actions_len=<optimized out>, dp_execute_action=dp_execute_action@entry=0x563dfe779bb0 <dp_execute_cb>) at ../lib/odp-execute.c:717 #11 0x0000563dfe7779a9 in dp_netdev_execute_actions (actions_len=<optimized out>, actions=<optimized out>, flow=0x7f92b970e490, may_steal=true, packets=0x7f92b970df80, pmd=0x7f92b9711010) at ../lib/dpif-netdev.c:6496 #12 handle_packet_upcall (put_actions=0x7f92b970df40, actions=0x7f92b970df00, key=0x7f92b970f380, packet=0x2d0bb2940, pmd=0x7f92b9711010) at ../lib/dpif-netdev.c:5788 #13 fast_path_processing (pmd=pmd@entry=0x7f92b9711010, packets_=packets_@entry=0x7f92b970f750, keys=keys@entry=0x7f92b970f370, flow_map=flow_map@entry=0x7f92b970f220, index_map=index_map@entry=0x7f92b970f210 "", in_port=<optimized out>) at ../lib/dpif-netdev.c:5878 #14 0x0000563dfe7787a1 in dp_netdev_input__ (pmd=pmd@entry=0x7f92b9711010, packets=packets@entry=0x7f92b970f750, md_is_valid=md_is_valid@entry=false, port_no=port_no@entry=2) at ../lib/dpif-netdev.c:5966 #15 0x0000563dfe778f76 in dp_netdev_input (port_no=2, packets=0x7f92b970f750, pmd=0x7f92b9711010) at ../lib/dpif-netdev.c:6004 #16 dp_netdev_process_rxq_port (pmd=pmd@entry=0x7f92b9711010, rxq=0x563e00d09e20, port_no=2) at ../lib/dpif-netdev.c:3798 #17 0x0000563dfe77934a in pmd_thread_main (f_=<optimized out>) at ../lib/dpif-netdev.c:4680 #18 0x0000563dfe7f749f in ovsthread_wrapper (aux_=<optimized out>) at ../lib/ovs-thread.c:354 #19 0x00007f92c05d0ea5 in start_thread (arg=0x7f92b9710700) at pthread_create.c:307 #20 0x00007f92bf9ce8cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 ~~~ This is very similar to an old discussion: - https://mail.openvswitch.org/pipermail/ovs-dev/2018-May/346911.html Couldn't find source of ovs 2.9 on https://code.engineering.redhat.com/gerrit/
The bug you mention has been fixed in 2.9.0-127, in addition with a lot of additional fixes (BZ1770408). I would suggest moving to the latest 2.9, 130 which has even more potential crashes fixed. Please let me know if you still observer crashed with -130. If not please close the BZ.
Sounds good, I have requested the client to update their environment and rerun testing to validate. Will keep you posted.
Do we have any update on this BZ from the customer?
Hello, Yes, the -130 appears to have fixed it after some preliminary testing. We are however running into issues with SELINUX which can't be run into enforcing for the time being. I *think* this may relate to BZ 1759695 since this update introduced a change in their runtime directory so I've reopened that bug and added the relevant audit info there. Thanks, Gabriel Diotte
(In reply to gdiotte from comment #5) > Hello, > > Yes, the -130 appears to have fixed it after some preliminary testing. We > are however running into issues with SELINUX which can't be run into > enforcing for the time being. I *think* this may relate to BZ 1759695 since > this update introduced a change in their runtime directory so I've reopened > that bug and added the relevant audit info there. > > Thanks, > Gabriel Diotte Thanks Gabriel for the update. I'll close this BZ as a duplicate of BZ1770408, as it fixed the crash. The remaining issue can be dealt with trough BZ1759695. *** This bug has been marked as a duplicate of bug 1770408 ***