Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1896982

Summary: kernel-rt: kernel BUG at lib/list_debug.c:28!
Product: Red Hat Enterprise Linux 8 Reporter: Chunyu Hu <chuhu>
Component: kernel-rtAssignee: Juri Lelli <jlelli>
kernel-rt sub component: Memory Management QA Contact: Chunyu Hu <chuhu>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bhu, liwan, mm-maint, pifang, rt-maint, rt-qe
Version: 8.4Flags: pm-rhel: mirror+
Target Milestone: rc   
Target Release: 8.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-12 08:42:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chunyu Hu 2020-11-12 01:08:36 UTC
Description of problem:

kernel-rt panic when running VMM-FUNCTION test. Vmcore:
http://ibm-x3250m4-03.rhts.eng.pek2.redhat.com/vmcore/chuhu/4.18.0-246.rt14.11.el8.x86_64/4720837/hp-dl585g7-01.rhts.eng.pek2.redhat.com/10.73.194.85-2020-11-11-04:56:02/vmcore-dmesg.txt

Job:
https://beaker.engineering.redhat.com/jobs/4720837

[  167.857379] Key type id_legacy registered
[  173.675021] irq 3: Affinity broken due to vector space exhaustion.
[  173.764118] list_add corruption. prev->next should be next (ffffc3b5fbefec48), but was ffff9159ffb34640. (prev=ffff9159ffb34640).
[  173.825794] ------------[ cut here ]------------
[  173.825798] kernel BUG at lib/list_debug.c:28!
[  173.825816] invalid opcode: 0000 [#1] PREEMPT_RT SMP NOPTI
[  173.825820] CPU: 17 PID: 1110 Comm: kworker/17:2 Kdump: loaded Not tainted 4.18.0-246.rt14.11.el8.x86_64 #1
[  173.825821] Hardware name: HP ProLiant DL585 G7, BIOS A16 06/04/2013
[  173.825831] Workqueue: memcg_kmem_cache kmemcg_workfn
[  173.825840] RIP: 0010:__list_add_valid.cold.0+0x26/0x28
[  173.825845] Code: 00 00 00 c3 48 89 d1 48 c7 c7 a0 c9 ae b1 48 89 c2 e8 50 73 cc ff 0f 0b 48 89 c1 4c 89 c6 48 c7 c7 f8 c9 ae b1 e8 3c 73 cc ff <0f> 0b 48 89 fe 48 89 c2 48 c7 c7 88 ca ae b1 e8 28 73 cc ff 0f 0b
[  173.825848] RSP: 0018:ffffa3509b82faa8 EFLAGS: 00010246
[  173.825851] RAX: 0000000000000075 RBX: ffff9159ffb34630 RCX: 0000000000000001
[  173.825854] RDX: 0000000000000000 RSI: ffffffffb1ada3c3 RDI: 00000000ffffffff
[  173.825855] RBP: ffffc3b5fbefec48 R08: ffff913a84d548c0 R09: 0000000000000000
[  173.825858] R10: 0000000000c45f1e R11: 00000000f5257d14 R12: ffff9159ffb34640
[  173.825859] R13: 0000000000000000 R14: ffffc3b5fbf2fbc8 R15: ffffc3b5fbf2fbc0
[  173.825863] FS:  0000000000000000(0000) GS:ffff9159ffb00000(0000) knlGS:0000000000000000
[  173.825865] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  173.825867] CR2: 00007fdc51400000 CR3: 0000005d9e668000 CR4: 00000000000406e0
[  173.825870] Call Trace:
[  173.825878]  free_unref_page_commit+0xd2/0x190
[  173.825886]  free_unref_page+0x143/0x2b0
[  173.825894]  free_delayed+0x61/0x80
[  173.825902]  flush_all+0xa5/0xe0
[  173.825906]  __kmem_cache_shrink+0x3a/0x300
[  173.825912]  ? __switch_to_asm+0x35/0x70
[  173.825916]  ? __switch_to_asm+0x41/0x70
[  173.825919]  ? __switch_to_asm+0x35/0x70
[  173.825922]  ? __switch_to_asm+0x41/0x70
[  173.825926]  ? __switch_to_asm+0x35/0x70
[  173.825929]  ? __switch_to_asm+0x41/0x70
[  173.825932]  ? __switch_to_asm+0x35/0x70
[  173.825936]  ? __switch_to_asm+0x35/0x70
[  173.825939]  ? __switch_to_asm+0x41/0x70
[  173.825942]  ? __switch_to_asm+0x35/0x70
[  173.825945]  ? __switch_to_asm+0x41/0x70
[  173.825949]  ? __switch_to_asm+0x35/0x70
[  173.825951]  ? __switch_to_asm+0x41/0x70
[  173.825954]  ? __switch_to_asm+0x35/0x70
[  173.825957]  ? __switch_to_asm+0x41/0x70
[  173.825962]  ? __switch_to_asm+0x35/0x70
[  173.825965]  ? __switch_to_asm+0x41/0x70
[  173.825968]  ? __switch_to_asm+0x35/0x70
[  173.825971]  ? __switch_to_asm+0x41/0x70
[  173.825974]  ? __switch_to_asm+0x35/0x70
[  173.825977]  ? __switch_to_asm+0x41/0x70
[  173.825979]  ? __switch_to_asm+0x35/0x70
[  173.825983]  ? __switch_to_asm+0x41/0x70
[  173.825986]  ? __switch_to_asm+0x35/0x70
[  173.825989]  ? _raw_spin_unlock_irq+0x1d/0x50
[  173.825995]  ? finish_task_switch+0x9e/0x2e0
[  173.825999]  ? __switch_to+0x147/0x470
[  173.826005]  ? __schedule+0x355/0x8a0
[  173.826010]  ? _raw_spin_lock+0x13/0x40
[  173.826015]  ? __try_to_take_rt_mutex+0x100/0x1e0
[  173.826019]  ? __rt_mutex_slowlock+0x42/0x130
[  173.826024]  ? rt_mutex_slowlock_locked+0xbc/0x260
[  173.826027]  ? try_to_wake_up+0x294/0x5e0
[  173.826032]  ? _raw_spin_unlock_irqrestore+0x20/0x60
[  173.826048]  ? rt_mutex_slowlock.constprop.30+0x6c/0x90
[  173.826056]  __kmemcg_cache_deactivate_after_rcu+0xe/0x40
[  173.826070]  kmemcg_cache_deactivate_after_rcu+0xe/0x20
[  173.826075]  kmemcg_workfn+0x2f/0x50
[  173.826081]  process_one_work+0x18f/0x420
[  173.826087]  worker_thread+0x30/0x370
[  173.826092]  ? process_one_work+0x420/0x420
[  173.826096]  kthread+0x112/0x130
[  173.826099]  ? kthread_flush_work_fn+0x10/0x10
[  173.826104]  ret_from_fork+0x22/0x40
[  173.826110] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc amd64_edac_mod joydev edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_ssif pcspkr hpilo hpwdt ipmi_si ipmi_devintf ipmi_msghandler sp5100_tco k10temp fam15h_power i2c_piix4 acpi_power_meter ip_tables xfs libcrc32c sd_mod t10_pi sg radeon i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci crc32c_intel serio_raw libahci ata_generic drm hpsa netxen_nic scsi_transport_sas libata dm_mirror dm_region_hash dm_log dm_mod

Version-Release number of selected component (if applicable):
4.18.0-246.rt14.11.el8.x86_64

How reproducible:
always

Steps to Reproduce:
1. run VMM-FUNCTION on rt kernel
2.
3.

Actual results:
kernel panic on list corruption. stock kernel is good.
stock kernel-4.18.0-246.el8 is good with VMM-FUNCTION:
https://beaker.engineering.redhat.com/jobs/4721495

Expected results:
kernel-rt don't hit the list corruption issue and keep not panic.

Additional info:

Comment 1 Chunyu Hu 2020-11-12 01:28:52 UTC
origin job is run on a dt kernel rt dt 4.18.0-246.rt4.11.el8.dt2.x86_64, also list corruption panic with different line of list_debug, so run mainline version, got the similar panic in comment#0:

Vmcore:
http://ibm-x3250m4-03.rhts.eng.pek2.redhat.com/vmcore/chuhu/4.18.0-246.rt4.11.el8.dt2.x86_64/4718130/hp-dl380eg8-01.rhts.eng.pek2.redhat.com/10.73.194.73-2020-11-10-06:23:36/vmcore-dmesg.txt


[ 2211.243652] ------------[ cut here ]------------
[ 2211.243655] kernel BUG at lib/list_debug.c:56!
[ 2211.243673] invalid opcode: 0000 [#1] PREEMPT_RT SMP PTI
[ 2211.243678] CPU: 18 PID: 4154990 Comm: runtest.sh Kdump: loaded Tainted: G          I      --------- -  - 4.18.0-246.rt4.11.el8.dt2.x86_64 #1
[ 2211.243679] Hardware name: HP ProLiant DL380e Gen8, BIOS P73 07/01/2013
[ 2211.243701] RIP: 0010:__list_del_entry_valid.cold.1+0x20/0x4c
[ 2211.243706] Code: 43 4f 87 e8 7c e7 cb ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 b0 43 4f 87 e8 68 e7 cb ff 0f 0b 48 c7 c7 60 44 4f 87 e8 5a e7 cb ff <0f> 0b 48 89 f2 48 89 fe 48 c7 c7 20 44 4f 87 e8 46 e7 cb ff 0f 0b
[ 2211.243708] RSP: 0018:ffffbb1921a6fc70 EFLAGS: 00010246
[ 2211.243711] RAX: 0000000000000054 RBX: fffff76aa13b5208 RCX: 0000000000000001
[ 2211.243712] RDX: 0000000000000000 RSI: ffffffff874e1ce3 RDI: 00000000ffffffff
[ 2211.243714] RBP: 00000000000005b7 R08: ffffffff8698aa90 R09: 0000000000000544
[ 2211.243715] R10: 000000000001d3d4 R11: ffffbb1921a6fb20 R12: ffff9a846f4f4630
[ 2211.243717] R13: fffff76aa0f57388 R14: ffffbb1921a6fcd0 R15: ffff9a846f4f4650
[ 2211.243720] FS:  00007f229a208740(0000) GS:ffff9a846f580000(0000) knlGS:0000000000000000
[ 2211.243721] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2211.243723] CR2: 00005643b9d6bb20 CR3: 000000084dd10003 CR4: 00000000000606e0
[ 2211.243725] Call Trace:
[ 2211.243737]  isolate_pcp_pages+0xf3/0x1c0
[ 2211.243747]  drain_pages_zone+0x17c/0x250
[ 2211.243751]  drain_pages+0x39/0x50
[ 2211.243754]  drain_all_pages+0xce/0x120
[ 2211.243759]  start_isolate_page_range+0x1ce/0x2f0
[ 2211.243768]  __offline_pages+0xfa/0x8f0
[ 2211.243776]  ? rt_spin_unlock+0x13/0x40
[ 2211.243784]  ? klist_next+0xd5/0xe0
[ 2211.243790]  ? device_is_dependent+0xa0/0xa0
[ 2211.243800]  memory_subsys_offline+0x45/0x60
[ 2211.243806]  device_offline+0x84/0xb0
[ 2211.243812]  state_store+0x63/0xb0
[ 2211.243821]  kernfs_fop_write+0xf6/0x1a0
[ 2211.243827]  vfs_write+0xa5/0x1a0
[ 2211.243833]  ksys_write+0x52/0xc0
[ 2211.243840]  do_syscall_64+0x87/0x1a0
[ 2211.243845]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 2211.243849] RIP: 0033:0x7f22998ea198
[ 2211.243852] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 c5 43 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[ 2211.243853] RSP: 002b:00007ffdb428e4c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 2211.243856] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f22998ea198
[ 2211.243857] RDX: 0000000000000008 RSI: 00005643b9b30380 RDI: 0000000000000001
[ 2211.243858] RBP: 00005643b9b30380 R08: 000000000000000a R09: 00007f229997a4c0
[ 2211.243860] R10: 000000000000000a R11: 0000000000000246 R12: 00007f2299bba6c0
[ 2211.243861] R13: 0000000000000008 R14: 00007f2299bb5880 R15: 0000000000000008
[ 2211.243865] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr iTCO_wdt iTCO_vendor_support intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore ipmi_ssif intel_rapl_perf pcspkr hpwdt hpilo ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter ioatdma lpc_ich ip_tables xfs sd_mod t10_pi sg mgag200 drm_kms_helper uas syscopyarea sysfillrect sysimgblt usb_storage fb_sys_fops drm_vram_helper drm_ttm_helper ttm ahci sfc serio_raw igb bnx2x libahci drm dca mtd libcrc32c i2c_algo_bit mdio crc32c_intel libata dm_mirror dm_region_hash dm_log dm_mod

Comment 2 Chunyu Hu 2020-11-12 02:51:19 UTC
There's no such issue with 8.4 GA version kernel-rt-4.18.0-240.rt7.54.el8
https://beaker.engineering.redhat.com/jobs/4723325

Comment 3 Juri Lelli 2021-03-11 07:06:55 UTC
Hi,

Would it be possible to test again with latest 8.4-rt build
(kernel-rt-4.18.0-296.rt7.63.el8 at the time of writing).

We merged an RT specific change lately that it would be interesting
to see if it might play a role here.

Thanks!

Comment 4 Chunyu Hu 2021-03-11 08:32:55 UTC
(In reply to Juri Lelli from comment #3)
> Hi,
> 
> Would it be possible to test again with latest 8.4-rt build
> (kernel-rt-4.18.0-296.rt7.63.el8 at the time of writing).

Job submitted. Will update when job finish running.

> 
> We merged an RT specific change lately that it would be interesting
> to see if it might play a role here.
> 
> Thanks!