Bug 1896982

Summary: kernel-rt: kernel BUG at lib/list_debug.c:28!
Product: Red Hat Enterprise Linux 8 Reporter: Chunyu Hu <chuhu>
Component: kernel-rtAssignee: Juri Lelli <jlelli>
kernel-rt sub component: Memory Management QA Contact: Chunyu Hu <chuhu>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bhu, liwan, mm-maint, pifang, rt-maint, rt-qe
Version: 8.4   
Target Milestone: rc   
Target Release: 8.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-12 08:42:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chunyu Hu 2020-11-12 01:08:36 UTC
Description of problem:

kernel-rt panic when running VMM-FUNCTION test. Vmcore:
http://ibm-x3250m4-03.rhts.eng.pek2.redhat.com/vmcore/chuhu/4.18.0-246.rt14.11.el8.x86_64/4720837/hp-dl585g7-01.rhts.eng.pek2.redhat.com/10.73.194.85-2020-11-11-04:56:02/vmcore-dmesg.txt

Job:
https://beaker.engineering.redhat.com/jobs/4720837

[  167.857379] Key type id_legacy registered
[  173.675021] irq 3: Affinity broken due to vector space exhaustion.
[  173.764118] list_add corruption. prev->next should be next (ffffc3b5fbefec48), but was ffff9159ffb34640. (prev=ffff9159ffb34640).
[  173.825794] ------------[ cut here ]------------
[  173.825798] kernel BUG at lib/list_debug.c:28!
[  173.825816] invalid opcode: 0000 [#1] PREEMPT_RT SMP NOPTI
[  173.825820] CPU: 17 PID: 1110 Comm: kworker/17:2 Kdump: loaded Not tainted 4.18.0-246.rt14.11.el8.x86_64 #1
[  173.825821] Hardware name: HP ProLiant DL585 G7, BIOS A16 06/04/2013
[  173.825831] Workqueue: memcg_kmem_cache kmemcg_workfn
[  173.825840] RIP: 0010:__list_add_valid.cold.0+0x26/0x28
[  173.825845] Code: 00 00 00 c3 48 89 d1 48 c7 c7 a0 c9 ae b1 48 89 c2 e8 50 73 cc ff 0f 0b 48 89 c1 4c 89 c6 48 c7 c7 f8 c9 ae b1 e8 3c 73 cc ff <0f> 0b 48 89 fe 48 89 c2 48 c7 c7 88 ca ae b1 e8 28 73 cc ff 0f 0b
[  173.825848] RSP: 0018:ffffa3509b82faa8 EFLAGS: 00010246
[  173.825851] RAX: 0000000000000075 RBX: ffff9159ffb34630 RCX: 0000000000000001
[  173.825854] RDX: 0000000000000000 RSI: ffffffffb1ada3c3 RDI: 00000000ffffffff
[  173.825855] RBP: ffffc3b5fbefec48 R08: ffff913a84d548c0 R09: 0000000000000000
[  173.825858] R10: 0000000000c45f1e R11: 00000000f5257d14 R12: ffff9159ffb34640
[  173.825859] R13: 0000000000000000 R14: ffffc3b5fbf2fbc8 R15: ffffc3b5fbf2fbc0
[  173.825863] FS:  0000000000000000(0000) GS:ffff9159ffb00000(0000) knlGS:0000000000000000
[  173.825865] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  173.825867] CR2: 00007fdc51400000 CR3: 0000005d9e668000 CR4: 00000000000406e0
[  173.825870] Call Trace:
[  173.825878]  free_unref_page_commit+0xd2/0x190
[  173.825886]  free_unref_page+0x143/0x2b0
[  173.825894]  free_delayed+0x61/0x80
[  173.825902]  flush_all+0xa5/0xe0
[  173.825906]  __kmem_cache_shrink+0x3a/0x300
[  173.825912]  ? __switch_to_asm+0x35/0x70
[  173.825916]  ? __switch_to_asm+0x41/0x70
[  173.825919]  ? __switch_to_asm+0x35/0x70
[  173.825922]  ? __switch_to_asm+0x41/0x70
[  173.825926]  ? __switch_to_asm+0x35/0x70
[  173.825929]  ? __switch_to_asm+0x41/0x70
[  173.825932]  ? __switch_to_asm+0x35/0x70
[  173.825936]  ? __switch_to_asm+0x35/0x70
[  173.825939]  ? __switch_to_asm+0x41/0x70
[  173.825942]  ? __switch_to_asm+0x35/0x70
[  173.825945]  ? __switch_to_asm+0x41/0x70
[  173.825949]  ? __switch_to_asm+0x35/0x70
[  173.825951]  ? __switch_to_asm+0x41/0x70
[  173.825954]  ? __switch_to_asm+0x35/0x70
[  173.825957]  ? __switch_to_asm+0x41/0x70
[  173.825962]  ? __switch_to_asm+0x35/0x70
[  173.825965]  ? __switch_to_asm+0x41/0x70
[  173.825968]  ? __switch_to_asm+0x35/0x70
[  173.825971]  ? __switch_to_asm+0x41/0x70
[  173.825974]  ? __switch_to_asm+0x35/0x70
[  173.825977]  ? __switch_to_asm+0x41/0x70
[  173.825979]  ? __switch_to_asm+0x35/0x70
[  173.825983]  ? __switch_to_asm+0x41/0x70
[  173.825986]  ? __switch_to_asm+0x35/0x70
[  173.825989]  ? _raw_spin_unlock_irq+0x1d/0x50
[  173.825995]  ? finish_task_switch+0x9e/0x2e0
[  173.825999]  ? __switch_to+0x147/0x470
[  173.826005]  ? __schedule+0x355/0x8a0
[  173.826010]  ? _raw_spin_lock+0x13/0x40
[  173.826015]  ? __try_to_take_rt_mutex+0x100/0x1e0
[  173.826019]  ? __rt_mutex_slowlock+0x42/0x130
[  173.826024]  ? rt_mutex_slowlock_locked+0xbc/0x260
[  173.826027]  ? try_to_wake_up+0x294/0x5e0
[  173.826032]  ? _raw_spin_unlock_irqrestore+0x20/0x60
[  173.826048]  ? rt_mutex_slowlock.constprop.30+0x6c/0x90
[  173.826056]  __kmemcg_cache_deactivate_after_rcu+0xe/0x40
[  173.826070]  kmemcg_cache_deactivate_after_rcu+0xe/0x20
[  173.826075]  kmemcg_workfn+0x2f/0x50
[  173.826081]  process_one_work+0x18f/0x420
[  173.826087]  worker_thread+0x30/0x370
[  173.826092]  ? process_one_work+0x420/0x420
[  173.826096]  kthread+0x112/0x130
[  173.826099]  ? kthread_flush_work_fn+0x10/0x10
[  173.826104]  ret_from_fork+0x22/0x40
[  173.826110] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc amd64_edac_mod joydev edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_ssif pcspkr hpilo hpwdt ipmi_si ipmi_devintf ipmi_msghandler sp5100_tco k10temp fam15h_power i2c_piix4 acpi_power_meter ip_tables xfs libcrc32c sd_mod t10_pi sg radeon i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci crc32c_intel serio_raw libahci ata_generic drm hpsa netxen_nic scsi_transport_sas libata dm_mirror dm_region_hash dm_log dm_mod

Version-Release number of selected component (if applicable):
4.18.0-246.rt14.11.el8.x86_64

How reproducible:
always

Steps to Reproduce:
1. run VMM-FUNCTION on rt kernel
2.
3.

Actual results:
kernel panic on list corruption. stock kernel is good.
stock kernel-4.18.0-246.el8 is good with VMM-FUNCTION:
https://beaker.engineering.redhat.com/jobs/4721495

Expected results:
kernel-rt don't hit the list corruption issue and keep not panic.

Additional info:

Comment 1 Chunyu Hu 2020-11-12 01:28:52 UTC
origin job is run on a dt kernel rt dt 4.18.0-246.rt4.11.el8.dt2.x86_64, also list corruption panic with different line of list_debug, so run mainline version, got the similar panic in comment#0:

Vmcore:
http://ibm-x3250m4-03.rhts.eng.pek2.redhat.com/vmcore/chuhu/4.18.0-246.rt4.11.el8.dt2.x86_64/4718130/hp-dl380eg8-01.rhts.eng.pek2.redhat.com/10.73.194.73-2020-11-10-06:23:36/vmcore-dmesg.txt


[ 2211.243652] ------------[ cut here ]------------
[ 2211.243655] kernel BUG at lib/list_debug.c:56!
[ 2211.243673] invalid opcode: 0000 [#1] PREEMPT_RT SMP PTI
[ 2211.243678] CPU: 18 PID: 4154990 Comm: runtest.sh Kdump: loaded Tainted: G          I      --------- -  - 4.18.0-246.rt4.11.el8.dt2.x86_64 #1
[ 2211.243679] Hardware name: HP ProLiant DL380e Gen8, BIOS P73 07/01/2013
[ 2211.243701] RIP: 0010:__list_del_entry_valid.cold.1+0x20/0x4c
[ 2211.243706] Code: 43 4f 87 e8 7c e7 cb ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 b0 43 4f 87 e8 68 e7 cb ff 0f 0b 48 c7 c7 60 44 4f 87 e8 5a e7 cb ff <0f> 0b 48 89 f2 48 89 fe 48 c7 c7 20 44 4f 87 e8 46 e7 cb ff 0f 0b
[ 2211.243708] RSP: 0018:ffffbb1921a6fc70 EFLAGS: 00010246
[ 2211.243711] RAX: 0000000000000054 RBX: fffff76aa13b5208 RCX: 0000000000000001
[ 2211.243712] RDX: 0000000000000000 RSI: ffffffff874e1ce3 RDI: 00000000ffffffff
[ 2211.243714] RBP: 00000000000005b7 R08: ffffffff8698aa90 R09: 0000000000000544
[ 2211.243715] R10: 000000000001d3d4 R11: ffffbb1921a6fb20 R12: ffff9a846f4f4630
[ 2211.243717] R13: fffff76aa0f57388 R14: ffffbb1921a6fcd0 R15: ffff9a846f4f4650
[ 2211.243720] FS:  00007f229a208740(0000) GS:ffff9a846f580000(0000) knlGS:0000000000000000
[ 2211.243721] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2211.243723] CR2: 00005643b9d6bb20 CR3: 000000084dd10003 CR4: 00000000000606e0
[ 2211.243725] Call Trace:
[ 2211.243737]  isolate_pcp_pages+0xf3/0x1c0
[ 2211.243747]  drain_pages_zone+0x17c/0x250
[ 2211.243751]  drain_pages+0x39/0x50
[ 2211.243754]  drain_all_pages+0xce/0x120
[ 2211.243759]  start_isolate_page_range+0x1ce/0x2f0
[ 2211.243768]  __offline_pages+0xfa/0x8f0
[ 2211.243776]  ? rt_spin_unlock+0x13/0x40
[ 2211.243784]  ? klist_next+0xd5/0xe0
[ 2211.243790]  ? device_is_dependent+0xa0/0xa0
[ 2211.243800]  memory_subsys_offline+0x45/0x60
[ 2211.243806]  device_offline+0x84/0xb0
[ 2211.243812]  state_store+0x63/0xb0
[ 2211.243821]  kernfs_fop_write+0xf6/0x1a0
[ 2211.243827]  vfs_write+0xa5/0x1a0
[ 2211.243833]  ksys_write+0x52/0xc0
[ 2211.243840]  do_syscall_64+0x87/0x1a0
[ 2211.243845]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 2211.243849] RIP: 0033:0x7f22998ea198
[ 2211.243852] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 c5 43 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[ 2211.243853] RSP: 002b:00007ffdb428e4c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 2211.243856] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f22998ea198
[ 2211.243857] RDX: 0000000000000008 RSI: 00005643b9b30380 RDI: 0000000000000001
[ 2211.243858] RBP: 00005643b9b30380 R08: 000000000000000a R09: 00007f229997a4c0
[ 2211.243860] R10: 000000000000000a R11: 0000000000000246 R12: 00007f2299bba6c0
[ 2211.243861] R13: 0000000000000008 R14: 00007f2299bb5880 R15: 0000000000000008
[ 2211.243865] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr iTCO_wdt iTCO_vendor_support intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore ipmi_ssif intel_rapl_perf pcspkr hpwdt hpilo ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter ioatdma lpc_ich ip_tables xfs sd_mod t10_pi sg mgag200 drm_kms_helper uas syscopyarea sysfillrect sysimgblt usb_storage fb_sys_fops drm_vram_helper drm_ttm_helper ttm ahci sfc serio_raw igb bnx2x libahci drm dca mtd libcrc32c i2c_algo_bit mdio crc32c_intel libata dm_mirror dm_region_hash dm_log dm_mod

Comment 2 Chunyu Hu 2020-11-12 02:51:19 UTC
There's no such issue with 8.4 GA version kernel-rt-4.18.0-240.rt7.54.el8
https://beaker.engineering.redhat.com/jobs/4723325

Comment 3 Juri Lelli 2021-03-11 07:06:55 UTC
Hi,

Would it be possible to test again with latest 8.4-rt build
(kernel-rt-4.18.0-296.rt7.63.el8 at the time of writing).

We merged an RT specific change lately that it would be interesting
to see if it might play a role here.

Thanks!

Comment 4 Chunyu Hu 2021-03-11 08:32:55 UTC
(In reply to Juri Lelli from comment #3)
> Hi,
> 
> Would it be possible to test again with latest 8.4-rt build
> (kernel-rt-4.18.0-296.rt7.63.el8 at the time of writing).

Job submitted. Will update when job finish running.

> 
> We merged an RT specific change lately that it would be interesting
> to see if it might play a role here.
> 
> Thanks!