1840436 – ovs tnl push crash

Bug 1840436 - ovs tnl push crash

Summary: ovs tnl push crash

Keywords:
Status:	CLOSED DUPLICATE of bug 1770408
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openvswitch
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Eelco Chaudron
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-26 23:21 UTC by Marc Methot
Modified:	2023-10-06 20:16 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-08 06:53:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5111851	0	None	None	None	2020-05-27 15:57:44 UTC

Description Marc Methot 2020-05-26 23:21:01 UTC

RHOSP 10 compute nodes with ovs-dpdk ports hit an issue causes frequent restart of neutron openvswitch agent.

The issue is semi consitent ovs crashing during physical link failover.
Coredumps were generated during latest crash.
Core was generated by `ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfi'.


Packages:
~~~
openvswitch-2.9.0-122.el7fdp.x86_64
openvswitch-debuginfo-2.9.0-122.el7fdp.x86_64
~~~

bt:
~~~
#0  0x00007f92bf906377 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f92bf907a68 in __GI_abort () at abort.c:90
#2  0x0000563dfe76dad5 in dp_packet_resize__ (b=b@entry=0x2d0bb2940, new_headroom=new_headroom@entry=64, new_tailroom=<optimized out>) at ../lib/dp-packet.c:264
#3  0x0000563dfe76de1f in dp_packet_prealloc_headroom (b=b@entry=0x2d0bb2940, size=size@entry=50) at ../lib/dp-packet.c:294
#4  0x0000563dfe76e351 in dp_packet_push_uninit (b=b@entry=0x2d0bb2940, size=size@entry=50) at ../lib/dp-packet.c:406
#5  0x0000563dfe8244cc in netdev_tnl_push_ip_header (packet=packet@entry=0x2d0bb2940, header=0x7f9288287690, size=50, 
    ip_tot_size=ip_tot_size@entry=0x7f92b970d3d4) at ../lib/netdev-native-tnl.c:154
#6  0x0000563dfe8245ca in netdev_tnl_push_udp_header (packet=0x2d0bb2940, data=<optimized out>) at ../lib/netdev-native-tnl.c:224
#7  0x0000563dfe7a03f6 in netdev_push_header (netdev=0x563e00ca61a0, batch=batch@entry=0x7f92b970df80, data=data@entry=0x7f9288287680) at ../lib/netdev.c:858
#8  0x0000563dfe77a0c2 in push_tnl_action (batch=0x7f92b970df80, attr=0x7f92b970df80, pmd=0x7f92b9711010) at ../lib/dpif-netdev.c:6134
#9  dp_execute_cb (aux_=aux_@entry=0x7f92b970def0, packets_=packets_@entry=0x7f92b970df80, a=a@entry=0x7f928828767c, may_steal=false)
    at ../lib/dpif-netdev.c:6225
#10 0x0000563dfe7a93d8 in odp_execute_actions (dp=dp@entry=0x7f92b970def0, batch=batch@entry=0x7f92b970df80, steal=steal@entry=true, actions=<optimized out>, 
    actions_len=<optimized out>, dp_execute_action=dp_execute_action@entry=0x563dfe779bb0 <dp_execute_cb>) at ../lib/odp-execute.c:717
#11 0x0000563dfe7779a9 in dp_netdev_execute_actions (actions_len=<optimized out>, actions=<optimized out>, flow=0x7f92b970e490, may_steal=true, 
    packets=0x7f92b970df80, pmd=0x7f92b9711010) at ../lib/dpif-netdev.c:6496
#12 handle_packet_upcall (put_actions=0x7f92b970df40, actions=0x7f92b970df00, key=0x7f92b970f380, packet=0x2d0bb2940, pmd=0x7f92b9711010)
    at ../lib/dpif-netdev.c:5788
#13 fast_path_processing (pmd=pmd@entry=0x7f92b9711010, packets_=packets_@entry=0x7f92b970f750, keys=keys@entry=0x7f92b970f370, 
    flow_map=flow_map@entry=0x7f92b970f220, index_map=index_map@entry=0x7f92b970f210 "", in_port=<optimized out>) at ../lib/dpif-netdev.c:5878
#14 0x0000563dfe7787a1 in dp_netdev_input__ (pmd=pmd@entry=0x7f92b9711010, packets=packets@entry=0x7f92b970f750, md_is_valid=md_is_valid@entry=false, 
    port_no=port_no@entry=2) at ../lib/dpif-netdev.c:5966
#15 0x0000563dfe778f76 in dp_netdev_input (port_no=2, packets=0x7f92b970f750, pmd=0x7f92b9711010) at ../lib/dpif-netdev.c:6004
#16 dp_netdev_process_rxq_port (pmd=pmd@entry=0x7f92b9711010, rxq=0x563e00d09e20, port_no=2) at ../lib/dpif-netdev.c:3798
#17 0x0000563dfe77934a in pmd_thread_main (f_=<optimized out>) at ../lib/dpif-netdev.c:4680
#18 0x0000563dfe7f749f in ovsthread_wrapper (aux_=<optimized out>) at ../lib/ovs-thread.c:354
#19 0x00007f92c05d0ea5 in start_thread (arg=0x7f92b9710700) at pthread_create.c:307
#20 0x00007f92bf9ce8cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
~~~

This is very similar to an old discussion:
- https://mail.openvswitch.org/pipermail/ovs-dev/2018-May/346911.html

Couldn't find source of ovs 2.9 on https://code.engineering.redhat.com/gerrit/

Comment 2 Eelco Chaudron 2020-05-27 13:38:01 UTC

The bug you mention has been fixed in 2.9.0-127, in addition with a lot of additional fixes (BZ1770408). I would suggest moving to the latest 2.9, 130 which has even more potential crashes fixed.

Please let me know if you still observer crashed with -130. If not please close the BZ.

Comment 3 Marc Methot 2020-05-27 14:24:19 UTC

Sounds good, I have requested the client to update their environment and rerun testing to validate.
Will keep you posted.

Comment 4 Eelco Chaudron 2020-06-05 09:08:58 UTC

Do we have any update on this BZ from the customer?

Comment 5 Gabriel Diotte 2020-06-05 13:07:23 UTC

Hello,

Yes, the -130 appears to have fixed it after some preliminary testing. We are however running into issues with SELINUX which can't be run into enforcing for the time being. I *think* this may relate to BZ 1759695 since this update introduced a change in their runtime directory so I've reopened that bug and added the relevant audit info there.

Thanks,
Gabriel Diotte

Comment 6 Eelco Chaudron 2020-06-08 06:53:08 UTC

(In reply to gdiotte from comment #5)
> Hello,
> 
> Yes, the -130 appears to have fixed it after some preliminary testing. We
> are however running into issues with SELINUX which can't be run into
> enforcing for the time being. I *think* this may relate to BZ 1759695 since
> this update introduced a change in their runtime directory so I've reopened
> that bug and added the relevant audit info there.
> 
> Thanks,
> Gabriel Diotte

Thanks Gabriel for the update. I'll close this BZ as a duplicate of BZ1770408, as it fixed the crash. The remaining issue can be dealt with trough BZ1759695.

*** This bug has been marked as a duplicate of bug 1770408 ***

Note You need to log in before you can comment on or make changes to this bug.