2226809 – OpenShift 4.14 kernel panic IBM Power

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2226809 - OpenShift 4.14 kernel panic IBM Power

Summary: OpenShift 4.14 kernel panic IBM Power

Keywords:
Status:	CLOSED DUPLICATE of bug 2223310
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	openvswitch
Sub Component:
Version:	RHEL 9.0
Hardware:	ppc64le
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Timothy Redaelli
QA Contact:	Ping Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-07-26 15:52 UTC by Jeremy Poulin
Modified:	2023-08-28 17:07 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-08-28 17:07:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	FD-3062	None	None	None	2023-07-26 15:54:00 UTC
Red Hat Issue Tracker	OCPBUGS-15573	None	None	None	2023-07-26 16:00:53 UTC
Red Hat Issue Tracker	RHEL-463	None	None	None	2023-07-26 16:00:53 UTC

Description Jeremy Poulin 2023-07-26 15:52:40 UTC

Description of problem:
I don't know if I routed to the right component, but we are running end-to-end tests on OpenShift 4.14 using libvirt to create a 5 node cluster.

During deployment (or sometimes during the tests), one of the worker nodes will crash, causing a cascade of failures for the test run. We were able to catch one of these in this state and found this report in the virsh console logs:

[  823.876958] Kernel attempted to read user page (3fdb90000) - exploit attempt? (uid: 0)
[  823.878704] BUG: Unable to handle kernel data access on read at 0x3fdb90000
[  823.878971] Faulting instruction address: 0xc0080000024095bc
[  823.879249] Oops: Kernel access of bad area, sig: 11 [#1]
[  823.879442] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[  823.879639] Modules linked in: xt_REDIRECT xt_addrtype veth xt_mark ipt_REJECT nf_reject_ipv4 xt_nat nft_chain_nat nf_conntrack_netlink geneve ip6_udp_tunnel udp_tunnel tls xt_CT xt_conntrack xt_comment nft_compat nft_counter nf_tables nfnetlink_cttimeout nfnetlink rfkill openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay nvram ext4 mbcache jbd2 virtio_balloon drm ip_tables drm_panel_orientation_quirks xfs libcrc32c nvme_tcp nvme_fabrics nvme nvme_core nvme_common t10_pi vmx_crypto virtio_net net_failover failover virtio_console virtio_blk dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
[  823.881123] CPU: 3 PID: 24814 Comm: kworker/3:9 Not tainted 5.14.0-284.23.1.el9_2.ppc64le #1
[  823.881565] Workqueue: mld mld_ifc_work
[  823.881664] NIP:  c0080000024095bc LR: c008000002409614 CTR: c000000000c28080
[  823.882002] REGS: c0000003ffebf390 TRAP: 0300   Not tainted  (5.14.0-284.23.1.el9_2.ppc64le)
[  823.882470] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28002844  XER: 20040000
[  823.882725] CFAR: c00800000240962c DAR: 00000003fdb90000 DSISR: 40000000 IRQMASK: 3
[  823.882725] GPR00: c008000002409614 c0000003ffebf630 c008000002458100 0000000000000000
[  823.882725] GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000002
[  823.882725] GPR08: 0000000000000003 0000000000000000 00000003fdb90000 c0080000024263d8
[  823.882725] GPR12: c000000000c28080 c0000003ffffd480 c000000000184be8 0000000000000000
[  823.882725] GPR16: 0000000000000001 c000000002b33a00 c000000002b4c720 0000000000000000
[  823.882725] GPR20: 000000000000dd86 0000000000000000 000000000000a888 c0000000021e18a8
[  823.882725] GPR24: c0000003ffebfad0 c0000003ffebfac8 c00000000b8d2280 c00000000b8d2280
[  823.882725] GPR28: 0000000000000000 0000000000000000 c0000003ffebf6f0 c000000003b32400
[  823.883928] NIP [c0080000024095bc] ovs_dp_upcall+0xc4/0x240 [openvswitch]
[  823.884053] LR [c008000002409614] ovs_dp_upcall+0x11c/0x240 [openvswitch]
[  823.884188] Call Trace:
[  823.884243] [c0000003ffebf630] [c008000002409614] ovs_dp_upcall+0x11c/0x240 [openvswitch] (unreliable)
[  823.884445] [c0000003ffebf680] [c008000002409a18] ovs_dp_process_packet+0x1c0/0x2e0 [openvswitch]
[  823.884636] [c0000003ffebf750] [c00800000241dfe4] ovs_vport_receive+0x8c/0x130 [openvswitch]
[  823.884831] [c0000003ffebf960] [c00800000241f03c] netdev_port_receive+0xf4/0x280 [openvswitch]
[  823.885018] [c0000003ffebf9a0] [c00800000241f1fc] netdev_frame_hook+0x34/0x70 [openvswitch]
[  823.885188] [c0000003ffebf9c0] [c000000000c50168] __netif_receive_skb_core.constprop.0+0x388/0x1210
[  823.885377] [c0000003ffebfaa0] [c000000000c510bc] __netif_receive_skb_list_core+0xcc/0x380
[  823.885599] [c0000003ffebfb50] [c000000000c51cb8] netif_receive_skb_list_internal+0x278/0x3c0
[  823.885788] [c0000003ffebfbe0] [c000000000c52628] napi_complete_done+0x88/0x250
[  823.885956] [c0000003ffebfc20] [c008000002cf4b3c] veth_poll+0x124/0x234 [veth]
[  823.886133] [c0000003ffebfd40] [c000000000c52858] __napi_poll+0x68/0x300
[  823.886283] [c0000003ffebfdd0] [c000000000c530dc] net_rx_action+0x39c/0x460
[  823.886429] [c0000003ffebfe90] [c000000000f09abc] __do_softirq+0x15c/0x3e0
[  823.886570] [c0000003ffebff90] [c000000000017120] do_softirq_own_stack+0x40/0x60
[  823.886733] [c0000003149179f0] [c000000000154b74] do_softirq+0xa4/0xb0
[  823.886872] [c000000314917a20] [c000000000154c78] __local_bh_enable_ip+0xf8/0x120
[  823.887048] [c000000314917a40] [c000000000e3f748] ip6_finish_output2+0x228/0x6c0
[  823.887213] [c000000314917af0] [c000000000e84fb0] NF_HOOK.constprop.0+0x100/0x110
[  823.887366] [c000000314917b70] [c000000000e851cc] mld_sendpack+0x20c/0x360
[  823.887513] [c000000314917c40] [c000000000e87e40] mld_ifc_work+0x50/0x250
[  823.887653] [c000000314917c90] [c000000000177558] process_one_work+0x298/0x580
[  823.887828] [c000000314917d30] [c000000000177ee8] worker_thread+0xa8/0x620
[  823.887968] [c000000314917dc0] [c000000000184d04] kthread+0x124/0x130
[  823.888098] [c000000314917e10] [c00000000000cd64] ret_from_kernel_thread+0x5c/0x64
[  823.888244] Instruction dump:
[  823.888326] 5529063e 28090001 4181009c e92a0048 2c1d0000 e94d0030 7d095214 40820158
[  823.888492] 39000003 886d0932 7c684378 990d0932 <7d09502a> 39080001 7d09512a 4801ba89
[  823.888668] ---[ end trace e9a622ddc1d1d390 ]---
[  823.897877]
[  824.897947] Kernel panic - not syncing: Fatal exception


Version-Release number of selected component (if applicable):
4.14.0-0.nightly-ppc64le-2023-07-20-202654
RHCOS: https://releases-rhcos-art.apps.ocp-virt.prod.psi.redhat.com/?arch=ppc64le&release=414.92.202307182026-0&stream=prod%2Fstreams%2F4.14-9.2#414.92.202307182026-0

How reproducible:
We've set up a dummy CI job that delays the clean-up of a cluster to leave it in a broken state.

But right now, we can recreate this by deploying a cluster using the openshift installer build with libvirt support, and running the openshift-tests openshift/conformance/parallel test suite.


Actual results:
Nodes go into NotReady state. Inspecting the virsh domains shows that the node has been shut down, and the console reports a panic.


Expected results:
Nodes stay up for the full test suite.


Additional info:
We can probably get you access to the environment in question since it is test hardware.

Comment 1 Jeremy Poulin 2023-07-26 16:02:25 UTC

https://issues.redhat.com/browse/OCPBUGS-15573 is where we're tracking the regression in OpenShift. https://issues.redhat.com/browse/RHEL-463 is a potentially related issue we've recently hit on ARM related to microshift.

Comment 2 Prashanth Sundararaman 2023-07-26 22:07:54 UTC

looks very similar to the bug seen on arm64 systems as Jeremy pointed out. 

@Eelco Chaudron - could it be the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=2203263 ?

Comment 3 Flavio Leitner 2023-08-02 18:39:12 UTC

Hi,

We only have half of the problem signature with that backtrace, but it looks very likely to be the same case.
Can you see if it reproduces with kernel-5.14.0-284.26.1.el9_2 ?
That is the kernel with the fix from https://bugzilla.redhat.com/show_bug.cgi?id=2223310

Thanks,
fbl

Comment 4 Jeremy Poulin 2023-08-02 18:50:42 UTC

I'll be out for a few days, so I'll forward the verification request for input to our PEs.

Comment 5 Jeremy Poulin 2023-08-10 15:34:38 UTC

So far we've only been able to reproduce this in CI, which isn't an environment that we can expose unreleased kernels into. I tried reproducing locally, but ran into some issues with our hardware that I've since found workaround for. I've written a script to hammer my cluster with the same e2e tests as CI, in hopes that I can get it to crash and pull the relevant signature.

If I am successful, I'll build a custom RHCOS with the kernel in question and perform the same test. With any luck I will either have more information OR a confirmation that the issue is already resolved.

Comment 6 Manoj Kumar 2023-08-16 17:31:12 UTC

The latest trap we hit on Power was:

[16729.131470] GPR08: 00000003fd9b0000 0000000000000000 0000000000000000 c008000002675168
[16729.131470] GPR12: c000000000844a70 c000000002ea0000 000000010749f5c8 00007fffa3de8f30
[16729.131470] GPR16: 0000000000000000 00000001076262a0 00007fffa3def6ac 00007fffa3deb598
[16729.131470] GPR20: 00007fffa3def69c 0000000000ff3fc0 0000000001745ab4 0000000000000320
[16729.131470] GPR24: c00000000b70bb80 0000000000000170 c00000000c759b00 c000000002b472b8
[16729.131470] GPR28: 0000000000000004 c000000024f4a880 0000000000000000 0000000000000000
[16729.132975] NIP [c00800000266d8e8] ovs_vport_get_upcall_stats+0x90/0x1f0 [openvswitch]
[16729.133158] LR [c00800000266d904] ovs_vport_get_upcall_stats+0xac/0x1f0 [openvswitch]
[16729.133329] Call Trace:
[16729.133400] [c00000000814f6b0] [c00800000266d904] ovs_vport_get_upcall_stats+0xac/0x1f0 [openvswitch] (unreliable)
[16729.133620] [c00000000814f710] [c008000002654fdc] ovs_vport_cmd_fill_info+0x224/0x340 [openvswitch]
[16729.133815] [c00000000814f7c0] [c008000002655270] ovs_vport_cmd_dump+0x178/0x1c0 [openvswitch]
[16729.134011] [c00000000814f820] [c000000000d13448] netlink_dump+0x138/0x370
[16729.134167] [c00000000814f8b0] [c000000000d14fc8] __netlink_dump_start+0x238/0x3b0
[16729.134339] [c00000000814f900] [c000000000d18ed4] genl_family_rcv_msg_dumpit+0xa4/0x1a0
[16729.134510] [c00000000814f9a0] [c000000000d1a9d0] genl_rcv_msg+0x1e0/0x280
[16729.134656] [c00000000814fa40] [c000000000d182e4] netlink_rcv_skb+0x84/0x1d0
[16729.134825] [c00000000814fac0] [c000000000d18dfc] genl_rcv+0x4c/0x80
[16729.134971] [c00000000814faf0] [c000000000d17788] netlink_unicast+0x308/0x3e0
[16729.135143] [c00000000814fb60] [c000000000d17abc] netlink_sendmsg+0x25c/0x560
[16729.135314] [c00000000814fc10] [c000000000c0f490] sock_sendmsg+0x80/0xc0
[16729.135460] [c00000000814fc40] [c000000000c123b4] __sys_sendto+0x164/0x1c0
[16729.135605] [c00000000814fd90] [c000000000c12480] sys_send+0x30/0x40
[16729.135751] [c00000000814fdb0] [c00000000002f544] system_call_exception+0x164/0x310
[16729.135946] [c00000000814fe10] [c00000000000bfe8] system_call_vectored_common+0xe8/0x278
[16729.136136] --- interrupt: 3000 at 0x7fffb1760a44
[16729.136272] NIP:  00007fffb1760a44 LR: 0000000000000000 CTR: 0000000000000000
[16729.136493] REGS: c00000000814fe80 TRAP: 3000   Not tainted  (5.14.0-284.25.1.el9_2.ppc64le)
[16729.136744] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 44028842  XER: 00000000
[16729.137005] IRQMASK: 0
[16729.137005] GPR00: 000000000000014e 00007fffa3de8b50 00007fffb1857200 0000000000000034
[16729.137005] GPR04: 00007fffa4007e30 0000000000000018 0000000000000000 00007fffa3df53e0
[16729.137005] GPR08: 00007fffa3dedca8 0000000000000000 0000000000000000 0000000000000000
[16729.137005] GPR12: 0000000000000000 00007fffa3df53e0 000000010749f5c8 00007fffa3de8f30
[16729.137005] GPR16: 0000000000000000 00000001076262a0 00007fffa3def6ac 00007fffa3deb598
[16729.137005] GPR20: 00007fffa3def69c 0000000000ff3fc0 0000000001745ab4 00007fffa3de8fb0
[16729.137005] GPR24: 00007fffa3de9008 0000000126dfc010 00007fffa3de8d10 00007fffa415fc80
[16729.137005] GPR28: 00007fffa4007e30 0000000000000034 0000000000000000 0000000000000000
[16729.139072] NIP [00007fffb1760a44] 0x7fffb1760a44
[16729.139219] LR [0000000000000000] 0x0
[16729.139312] --- interrupt: 3000
[16729.139403] Instruction dump:
[16729.139496] 39400000 83890000 48000038 60000000 60000000 60000000 3d220000 e95d0048
[16729.139701] e90984a8 7c691ef4 7d08482a 7cca4214 <7d4a402a> e9260008 7fff5214 7fde4a14
[16729.139912] ---[ end trace b78c94f8b77966c8 ]---
[16729.184437]
[16730.184567] Kernel panic - not syncing: Fatal exception

Comment 7 Jeremy Poulin 2023-08-25 20:27:47 UTC

Since the merge of kernel https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2634530 in OpenShift, we have not seen this issue recur. I'm going to double check the CI runs again come Monday, but if those are also clean, I'm going to say that we are no longer able to reproduce this issue, and that we believe it was a duplicate of the other linked bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=2203263
https://bugzilla.redhat.com/show_bug.cgi?id=2223310

Comment 8 Jeremy Poulin 2023-08-28 17:07:23 UTC

Closing this as a duplicate after reviewing this weekend's runs and seeing no recurrences since the kernel update in OpenShift.

*** This bug has been marked as a duplicate of bug 2223310 ***

Note You need to log in before you can comment on or make changes to this bug.