RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1859590 - [efi]RT kernel crash with "rcu: INFO: rcu_preempt detected stalls on CPUs/tasks"
Summary: [efi]RT kernel crash with "rcu: INFO: rcu_preempt detected stalls on CPUs/tasks"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: kernel-rt
Version: 8.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 8.0
Assignee: Luiz Capitulino
QA Contact: Pei Zhang
URL:
Whiteboard:
Depends On: 1684462 1859857
Blocks: 1817732 1823810
TreeView+ depends on / blocked
 
Reported: 2020-07-22 14:20 UTC by Pei Zhang
Modified: 2020-11-04 02:27 UTC (History)
13 users (show)

Fixed In Version: 4.18.0-230
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-04 02:26:35 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Pei Zhang 2020-07-22 14:20:12 UTC
Description of problem:
Install kernel-rt on hpe-dl380gen10-01.hpe2.lab.eng.bos.redhat.com, then rebooting the host to apply kernel-rt will cause host crash.

Version-Release number of selected component (if applicable):
4.18.0-226.rt7.38.el8.x86_64
4.18.0-227.rt7.39.el8.x86_64

How reproducible:
1. 100% reproduced with hpe-dl380gen10-01.hpe2.lab.eng.bos.redhat.com
2. No reproduced on other server: dell-per430-09.lab.eng.pek2.redhat.com

Steps to Reproduce:
1. Install a rhel8.3 host

2. Install kernel-rt packages

3. Reboot rt host, always crash

hpe-dl380gen10-01 login: [  123.418078] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  123.418083] rcu:         Tasks blocked on level-1 rcu_node (CPUs 0-15): P3356
[  123.418096]         (detected by 58, t=60002 jiffies, g=22237, q=26747)
[  123.418101] NetworkManager  D    0  3356      1 0x00004080
[  123.418105] Call Trace:
[  123.418117]  __schedule+0x342/0x830
[  123.418122]  schedule+0x3c/0xf0
[  123.418126]  schedule_timeout+0x1b7/0x410
[  123.418135]  ? raise_timer_softirq+0x10/0x10
[  123.418142]  ? update_load_avg+0x80/0x6a0
[  123.418146]  ? account_entity_enqueue+0x64/0x90
[  123.418150]  wait_for_completion_timeout+0x7e/0xf0
[  123.418161]  ionic_adminq_post_wait+0x111/0x350 [ionic]
[  123.418171]  ? ionic_addr_del+0x20/0x20 [ionic]
[  123.418178]  ionic_lif_addr_add+0xef/0x140 [ionic]
[  123.418188]  __hw_addr_sync_dev+0x9d/0xd0
[  123.418195]  ? ionic_lif_addr+0x1f0/0x1f0 [ionic]
[  123.418202]  ionic_set_rx_mode+0xa4/0x1f0 [ionic]
[  123.418207]  __dev_mc_add+0x89/0x90
[  123.418214]  igmp_group_added+0x1a3/0x1c0
[  123.418220]  __ip_mc_inc_group+0x128/0x1f0
[  123.418225]  ip_mc_up+0x4f/0xb0
[  123.418229]  inetdev_event+0x395/0x580
[  123.418238]  ? notifier_call_chain+0x47/0x70
[  123.418241]  ? inetdev_init+0x170/0x170
[  123.418245]  notifier_call_chain+0x47/0x70
[  123.418249]  __dev_notify_flags+0x5b/0xf0
[  123.418254]  dev_change_flags+0x48/0x60
[  123.418258]  do_setlink+0x314/0xf00
[  123.418265]  ? __nla_validate_parse+0x51/0x840
[  123.418273]  ? cpumask_next+0x16/0x20
[  123.418277]  ? __snmp6_fill_stats64.isra.56+0x6b/0x110
[  123.418281]  ? __nla_validate_parse+0x51/0x840
[  123.418286]  __rtnl_newlink+0x53d/0x890
[  123.418292]  ? migrate_enable+0x118/0x3a0
[  123.418295]  ? migrate_enable+0x3b/0x3a0
[  123.418302]  ? preempt_count_add+0x49/0xa0
[  123.418305]  ? migrate_enable+0x118/0x3a0
[  123.418308]  ? migrate_disable+0x38/0xc0
[  123.418313]  ? __switch_to_asm+0x41/0x70
[  123.418316]  ? __switch_to_asm+0x35/0x70
[  123.418319]  ? __switch_to_asm+0x41/0x70
[  123.418322]  ? __switch_to_asm+0x35/0x70
[  123.418325]  ? __switch_to_asm+0x41/0x70
[  123.418328]  ? __switch_to_asm+0x35/0x70
[  123.418339]  ? entry_SYSCALL_64_after_hwframe+0xbb/0xca
[  123.418342]  ? __switch_to_asm+0x35/0x70
[  123.418345]  ? __switch_to_asm+0x41/0x70
[  123.418352]  ? kmem_cache_alloc_trace+0xcf/0x1d0
[  123.418356]  rtnl_newlink+0x43/0x60
[  123.418360]  rtnetlink_rcv_msg+0x126/0x390
[  123.418367]  ? sock_has_perm+0x78/0xa0
[  123.418370]  ? rtnl_calcit.isra.37+0x110/0x110
[  123.418375]  netlink_rcv_skb+0x4c/0x120
[  123.418380]  netlink_unicast+0x197/0x230
[  123.418384]  netlink_sendmsg+0x204/0x3d0
[  123.418391]  sock_sendmsg+0x4c/0x50
[  123.418395]  ____sys_sendmsg+0x1eb/0x250
[  123.418400]  ? copy_msghdr_from_user+0x5c/0x90
[  123.418406]  ? __check_object_size+0xae/0x166
[  123.418411]  ___sys_sendmsg+0x7c/0xc0
[  123.418415]  ? netdev_run_todo+0x5e/0x290
[  123.418421]  ? addrconf_sysctl_forward+0x113/0x250
[  123.418425]  ? preempt_count_add+0x49/0xa0
[  123.418429]  ? preempt_count_add+0x49/0xa0
[  123.418432]  ? migrate_enable+0x118/0x3a0
[  123.418436]  ? __fget+0x73/0xb0
[  123.418441]  __sys_sendmsg+0x57/0xa0
[  123.418448]  do_syscall_64+0x87/0x1a0
[  123.418452]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[  123.418456] RIP: 0033:0x7ff83a439857
[  123.418464] Code: Bad RIP value.
[  123.418466] RSP: 002b:00007fffb0c75a70 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[  123.418469] RAX: ffffffffffffffda RBX: 000000000000000e RCX: 00007ff83a439857
[  123.418470] RDX: 0000000000000000 RSI: 00007fffb0c75ac0 RDI: 000000000000000e
[  123.418472] RBP: 00007fffb0c75ac0 R08: 0000000000000000 R09: 0000000000000000
[  123.418473] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
[  123.418474] R13: 0000000000000000 R14: 00007fffb0c75c78 R15: 00007fffb0c75c6c


Actual results:
RT host crash on hpe-dl380gen10-01.hpe2.lab.eng.bos.redhat.com

Expected results:
RT host should not crash.

Additional info:
1. This should be a rhel8.3-rt issue. rhel8.2.z works well. So I would set regression key words.

(1)rhel8.2.z-rt: 4.18.0-193.13.2.rt13.65.el8_2.x86_64   works well
(2)rhel8.3-rt:   4.18.0-227.rt7.39.el8.x86_64           crash

2. Currently I only hit this issue on hpe-dl380gen10-01.hpe2.lab.eng.bos.redhat.com. 

# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  2
Core(s) per socket:  20
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping:            4
CPU MHz:             2400.209
BogoMIPS:            4800.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            28160K
NUMA node0 CPU(s):   0-19,40-59
NUMA node1 CPU(s):   20-39,60-79
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts pku ospke md_clear flush_l1d

Comment 1 Pei Zhang 2020-07-22 14:26:26 UTC
With the reproduced server hpe-dl380gen10-01.hpe2.lab.eng.bos.redhat.com, Hyper Threading is enabled.

Comment 2 Juri Lelli 2020-07-22 14:30:06 UTC
Hi Pei,

could you append or point to a full console output?

Thanks!

Comment 3 Luiz Capitulino 2020-07-23 13:52:52 UTC
Pei,

More questions:

o Do you have any VMs running at all? Or the real-time host profile applied?

o How similar is this to bug 1817045 ?

o Since your system is a skylake, make sure you have microcode version
0x2000064

Comment 4 Pei Zhang 2020-07-24 11:59:08 UTC
(In reply to Luiz Capitulino from comment #3)
> Pei,
> 
> More questions:
> 
> o Do you have any VMs running at all? Or the real-time host profile applied?

Luiz,

No, just installing a RT host will cause this issue. No VMs is running, no real-time host profile applied. 

> 
> o How similar is this to bug 1817045 ?

They are different. There is no VM with this bug.

> 
> o Since your system is a skylake, make sure you have microcode version
> 0x2000064

Currently this system is installing 8.2-rt. The microcode:
[root@hpe-dl380gen10-01 ~]# dmesg | grep microcode
[    0.000000] microcode: microcode updated early to revision 0x2006906, date = 2020-04-24
[   11.127493] microcode: sig=0x50654, pf=0x80, revision=0x2006906
[   11.131307] microcode: Microcode Update Driver: v2.2.


Best regards,

Pei

Comment 8 Pei Zhang 2020-07-24 15:39:52 UTC
*** Bug 1859857 has been marked as a duplicate of this bug. ***

Comment 11 Luiz Capitulino 2020-07-29 13:49:19 UTC
Since this is fixed by bug 1859857 and bug 1684462,
I'm making this a TestOnly BZ and added both bugs
as dependencies.

Comment 19 Pei Zhang 2020-08-21 07:04:43 UTC
Verified with 4.18.0-233.rt7.45.el8.x86_64:

RT host works very well on UEFI server, the hang issue cannot be reproduced any more.

So this issue has been fixed very well. Move to Verified.

Comment 22 errata-xmlrpc 2020-11-04 02:26:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: kernel-rt security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4609


Note You need to log in before you can comment on or make changes to this bug.