Bug 2226809
| Summary: | OpenShift 4.14 kernel panic IBM Power | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Jeremy Poulin <jpoulin> |
| Component: | openvswitch | Assignee: | Timothy Redaelli <tredaelli> |
| openvswitch sub component: | ovs-dpdk | QA Contact: | Ping Zhang <pizhang> |
| Status: | NEW --- | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | ctrautma, echaudro, eterrell, fleitner, jhsiao, ktraynor, manokuma, mtarsel, psundara, qding |
| Version: | RHEL 9.0 | Flags: | fleitner:
needinfo?
(jpoulin) |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | ppc64le | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Jeremy Poulin
2023-07-26 15:52:40 UTC
https://issues.redhat.com/browse/OCPBUGS-15573 is where we're tracking the regression in OpenShift. https://issues.redhat.com/browse/RHEL-463 is a potentially related issue we've recently hit on ARM related to microshift. looks very similar to the bug seen on arm64 systems as Jeremy pointed out. @Eelco Chaudron - could it be the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=2203263 ? Hi, We only have half of the problem signature with that backtrace, but it looks very likely to be the same case. Can you see if it reproduces with kernel-5.14.0-284.26.1.el9_2 ? That is the kernel with the fix from https://bugzilla.redhat.com/show_bug.cgi?id=2223310 Thanks, fbl I'll be out for a few days, so I'll forward the verification request for input to our PEs. So far we've only been able to reproduce this in CI, which isn't an environment that we can expose unreleased kernels into. I tried reproducing locally, but ran into some issues with our hardware that I've since found workaround for. I've written a script to hammer my cluster with the same e2e tests as CI, in hopes that I can get it to crash and pull the relevant signature. If I am successful, I'll build a custom RHCOS with the kernel in question and perform the same test. With any luck I will either have more information OR a confirmation that the issue is already resolved. The latest trap we hit on Power was: [16729.131470] GPR08: 00000003fd9b0000 0000000000000000 0000000000000000 c008000002675168 [16729.131470] GPR12: c000000000844a70 c000000002ea0000 000000010749f5c8 00007fffa3de8f30 [16729.131470] GPR16: 0000000000000000 00000001076262a0 00007fffa3def6ac 00007fffa3deb598 [16729.131470] GPR20: 00007fffa3def69c 0000000000ff3fc0 0000000001745ab4 0000000000000320 [16729.131470] GPR24: c00000000b70bb80 0000000000000170 c00000000c759b00 c000000002b472b8 [16729.131470] GPR28: 0000000000000004 c000000024f4a880 0000000000000000 0000000000000000 [16729.132975] NIP [c00800000266d8e8] ovs_vport_get_upcall_stats+0x90/0x1f0 [openvswitch] [16729.133158] LR [c00800000266d904] ovs_vport_get_upcall_stats+0xac/0x1f0 [openvswitch] [16729.133329] Call Trace: [16729.133400] [c00000000814f6b0] [c00800000266d904] ovs_vport_get_upcall_stats+0xac/0x1f0 [openvswitch] (unreliable) [16729.133620] [c00000000814f710] [c008000002654fdc] ovs_vport_cmd_fill_info+0x224/0x340 [openvswitch] [16729.133815] [c00000000814f7c0] [c008000002655270] ovs_vport_cmd_dump+0x178/0x1c0 [openvswitch] [16729.134011] [c00000000814f820] [c000000000d13448] netlink_dump+0x138/0x370 [16729.134167] [c00000000814f8b0] [c000000000d14fc8] __netlink_dump_start+0x238/0x3b0 [16729.134339] [c00000000814f900] [c000000000d18ed4] genl_family_rcv_msg_dumpit+0xa4/0x1a0 [16729.134510] [c00000000814f9a0] [c000000000d1a9d0] genl_rcv_msg+0x1e0/0x280 [16729.134656] [c00000000814fa40] [c000000000d182e4] netlink_rcv_skb+0x84/0x1d0 [16729.134825] [c00000000814fac0] [c000000000d18dfc] genl_rcv+0x4c/0x80 [16729.134971] [c00000000814faf0] [c000000000d17788] netlink_unicast+0x308/0x3e0 [16729.135143] [c00000000814fb60] [c000000000d17abc] netlink_sendmsg+0x25c/0x560 [16729.135314] [c00000000814fc10] [c000000000c0f490] sock_sendmsg+0x80/0xc0 [16729.135460] [c00000000814fc40] [c000000000c123b4] __sys_sendto+0x164/0x1c0 [16729.135605] [c00000000814fd90] [c000000000c12480] sys_send+0x30/0x40 [16729.135751] [c00000000814fdb0] [c00000000002f544] system_call_exception+0x164/0x310 [16729.135946] [c00000000814fe10] [c00000000000bfe8] system_call_vectored_common+0xe8/0x278 [16729.136136] --- interrupt: 3000 at 0x7fffb1760a44 [16729.136272] NIP: 00007fffb1760a44 LR: 0000000000000000 CTR: 0000000000000000 [16729.136493] REGS: c00000000814fe80 TRAP: 3000 Not tainted (5.14.0-284.25.1.el9_2.ppc64le) [16729.136744] MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 44028842 XER: 00000000 [16729.137005] IRQMASK: 0 [16729.137005] GPR00: 000000000000014e 00007fffa3de8b50 00007fffb1857200 0000000000000034 [16729.137005] GPR04: 00007fffa4007e30 0000000000000018 0000000000000000 00007fffa3df53e0 [16729.137005] GPR08: 00007fffa3dedca8 0000000000000000 0000000000000000 0000000000000000 [16729.137005] GPR12: 0000000000000000 00007fffa3df53e0 000000010749f5c8 00007fffa3de8f30 [16729.137005] GPR16: 0000000000000000 00000001076262a0 00007fffa3def6ac 00007fffa3deb598 [16729.137005] GPR20: 00007fffa3def69c 0000000000ff3fc0 0000000001745ab4 00007fffa3de8fb0 [16729.137005] GPR24: 00007fffa3de9008 0000000126dfc010 00007fffa3de8d10 00007fffa415fc80 [16729.137005] GPR28: 00007fffa4007e30 0000000000000034 0000000000000000 0000000000000000 [16729.139072] NIP [00007fffb1760a44] 0x7fffb1760a44 [16729.139219] LR [0000000000000000] 0x0 [16729.139312] --- interrupt: 3000 [16729.139403] Instruction dump: [16729.139496] 39400000 83890000 48000038 60000000 60000000 60000000 3d220000 e95d0048 [16729.139701] e90984a8 7c691ef4 7d08482a 7cca4214 <7d4a402a> e9260008 7fff5214 7fde4a14 [16729.139912] ---[ end trace b78c94f8b77966c8 ]--- [16729.184437] [16730.184567] Kernel panic - not syncing: Fatal exception |