Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2218717

Summary:	on RHEL 8.8 Infoscale 7.4.2.4100 freezes a few seconds after boot (see crashdumps)
Product:	Red Hat Enterprise Linux 8	Reporter:	Vincent S. Cojot <vcojot>
Component:	kernel	Assignee:	Ming Lei <minlei>
kernel sub component:	Block Layer	QA Contact:	Storage QE <storage-qe>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	jmoyer, loberman, minlei
Version:	8.8	Flags:	pm-rhel: mirror+
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-07-12 21:41:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vincent S. Cojot 2023-06-30 01:19:35 UTC

Description of problem:

The latest patchlevel of Veritas Infoscale 7.4.2 for RHEL8 works well on RHEL8.y -except- RHEL 8.8. On RHEL 8.8 the system boots and then shortly thereafter:

[  261.570000] watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [kworker/15:3:28838]
[  261.570052] watchdog: BUG: soft lockup - CPU#82 stuck for 22s! [migration/82:508]
[  281.570037] watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [migration/9:70]
[  281.570123] watchdog: BUG: soft lockup - CPU#35 stuck for 23s! [lltdlv:26988]
[  285.570146] watchdog: BUG: soft lockup - CPU#66 stuck for 22s! [odm_clust_start:27115]
[  289.570129] watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [kworker/15:3:28838]
[  289.570182] watchdog: BUG: soft lockup - CPU#82 stuck for 22s! [migration/82:508]

Version-Release number of selected component (if applicable):

RHEL 8.8 with kernel-4.18.0-477.13.1.el8_8.x86_64
RHEL 8.8 with kernel-4.18.0-477.15.1.el8_8.x86_64

How reproducible:

100%. Infoscale 7.4.2 works well on all RHEL8 releases < 8.8.  (verified on 8.4, 8.6 EUS, 8.7)

On RHEL 8.8 the kernel freezes shortly after boot and the machine has to be crashdump'ed.


Actual results:

freezes a few seconds after boot and machine must be crashdump'ed.

Expected results:
should work as well and on previous minor release

Additional info:

Comment 1 Vincent S. Cojot 2023-06-30 01:20:44 UTC

When booted into 4.18.0-477.13.1.el8_8 on a VM, the following backtrace is seen:
 vxspec(POE) vxio(POE) vxdmp(POE) vxcafs(POE) vxportal(POE) fdd(POE) amf(POE) vxfs(POE) veki(POE) dell_rbu cfg80211 rfkill dcdbas intel_rapl_msr intel_rapl_common isst_if_common nfit libnvdimm snd_hda_codec_generic kvm_intel ledtrig_audio kvm irqbypass snd_hda_intel crct10dif_pclmul snd_intel_dspcfg snd_intel_sdw_acpi crc32_pclmul ghash_clmulni_intel snd_hda_codec snd_hda_core rapl snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer joydev pcspkr snd virtio_balloon soundcore i2c_piix4 nfsd binfmt_misc nfs_acl lockd auth_rpcgss grace sunrpc xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ata_generic qxl drm_ttm_helper ttm crc32c_intel drm_kms_helper syscopyarea serio_raw sysfillrect sysimgblt ahci fb_sys_fops virtio_console virtio_net libahci ata_piix net_failover virtio_blk virtio_scsi drm failover libata dm_mirror
[   72.873839]  dm_region_hash dm_log dm_mod fuse bridge stp llc
[   72.873845] CPU: 14 PID: 12616 Comm: lltdlv Kdump: loaded Tainted: P           OEL   --------- -  - 4.18.0-477.15.1.el8_8.x86_64 #1
[   72.873848] Hardware name: Red Hat KVM, BIOS 1.16.0-3.module+el8.8.0+16781+9f4724c2 04/01/2014
[   72.873850] RIP: 0010:native_safe_halt+0xe/0x20
[   72.873856] Code: 00 f0 80 48 02 20 48 8b 00 a8 08 75 c0 e9 79 ff ff ff 90 90 90 90 90 90 90 90 90 90 e9 07 00 00 00 0f 00 2d 06 a0 60 00 fb f4 <e9> 1d 12 40 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 e9 07 00 00
[   72.873859] RSP: 0018:ffffbe6b45babae0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[   72.873861] RAX: 0000000000000003 RBX: ffffffffc0da7598 RCX: 0000000000000008
[   72.873863] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffffc0da7598
[   72.873864] RBP: ffff985b9fdb3d40 R08: 0000000000000008 R09: 000000000000006c
[   72.873865] R10: ffffbe6b45babb80 R11: 0000000000000000 R12: 0000000000000000
[   72.873866] R13: 0000000000000001 R14: 0000000000000100 R15: 00000000003c0000
[   72.873868] FS:  0000000000000000(0000) GS:ffff985b9fd80000(0000) knlGS:0000000000000000
[   72.873869] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   72.873871] CR2: 00007fcbb34131a0 CR3: 00000002b9410003 CR4: 0000000000770ee0
[   72.873874] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   72.873874] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   72.873875] PKRU: 55555554
[   72.873876] Call Trace:
[   72.873878]  kvm_wait+0x58/0x60
[   72.873884]  __pv_queued_spin_lock_slowpath+0x268/0x2a0
[   72.873893]  _raw_spin_lock+0x1e/0x30
[   72.873895]  gms_gab_recv+0x153/0x280 [vxgms]
[   72.873902]  gab_deliver_loop+0x240/0x830 [gab]
[   72.873913]  ? gms_msg_register+0x30/0x30 [vxgms]
[   72.873916]  gab_receive+0x138c/0x1a50 [gab]
[   72.873924]  ? gms_msg_register+0x30/0x30 [vxgms]
[   72.873928]  ? gab_receive_port_que+0x12c/0x9b0 [gab]
[   72.873935]  ? _raw_spin_unlock_bh+0xa/0x20
[   72.873937]  gab_receive_port_que+0x12c/0x9b0 [gab]
[   72.873944]  ? gab_receive_que+0xce/0x260 [gab]
[   72.873951]  gab_receive_que+0xce/0x260 [gab]
[   72.873959]  gab_lrecv+0x153/0x470 [gab]
[   72.873968]  llt_make_recvupcall+0x7b/0x110 [llt]
[   72.873981]  llt_lrsrv_port+0x4ac/0xee0 [llt]
[   72.873991]  llt_deliver+0x11f/0x210 [llt]
[   72.874001]  ? llt_lrsrv_port+0xee0/0xee0 [llt]
[   72.874010]  kthread+0x134/0x150
[   72.874013]  ? set_kthread_struct+0x50/0x50
[   72.874015]  ret_from_fork+0x1f/0x40

Comment 2 Vincent S. Cojot 2023-06-30 01:23:41 UTC

I've reproduced the issue on two systems:

rh8x64 : a KVM guest running RHEL 8.8 + 4.18.0-477.13.1.el8_8
palanthas : a T630 Poweredge running RHEL 8.8 + 4.18.0-477.15.1.el8_8

Both systems were crashdump'ed when the system was hanging. These are the crashdumps:
-rw-r--r--.   1 raistlin users 826305902 Jun 29 20:34 vmcore-palanthas-20230629-230624191607.zip
-rw-r--r--.   1 raistlin users 321936741 Jun 29 20:26 vmcore-rh8x64-20230624-230624191607.zip

Comment 5 Vincent S. Cojot 2023-06-30 01:37:58 UTC

On a physical el8.8 host, I see backtraces like this:

[  493.581426] Call Trace:
[  493.581429]  _raw_spin_lock+0x1e/0x30
[  493.581436]  gms_event_wait+0xf/0x50 [vxgms]
[  493.581447]  gms_msg_gab_register+0x1b5/0x220 [vxgms]
[  493.581456]  ? odm_clust_start+0x1e0/0x1e0 [vxodm]
[  493.581472]  ? put_cred+0x20/0x20 [vxodm]
[  493.581484]  odm_clust_start+0x86/0x1e0 [vxodm]
[  493.581497]  ? put_cred+0x20/0x20 [vxodm]
[  493.581509]  odm_clust_start_thread+0xe/0x20 [vxodm]
[  493.581521]  odm_kthread_init+0x78/0xa0 [vxodm]
[  493.581534]  kthread+0x134/0x150
[  493.581539]  ? set_kthread_struct+0x50/0x50
[  493.581545]  ret_from_fork+0x1f/0x40

Comment 6 Vincent S. Cojot 2023-06-30 17:00:20 UTC

Tried disabling a few modules (VRTSodm) and still getting soft_lockups but they don't always point to a VRTS module code path, e.g:
[   80.716873] watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [kworker/10:2:233]
[   80.718221] Modules linked in: vxfen(POE) vxgms(POE) vxglm(POE) gab(POE) nft_counter nft_compat nf_tables nfnetlink llt(POE) dmpjbod(POE) dmpap(POE) dmpaa(POE) vxspec(POE) vxio(POE) vxdmp(POE) vxcafs(POE) vxportal(POE) fdd(POE) amf(POE) vxfs(POE) veki(POE) dell_rbu cfg80211 rfkill dcdbas intel_rapl_msr intel_rapl_common isst_if_common nfit libnvdimm snd_hda_codec_generic ledtrig_audio kvm_intel snd_hda_intel kvm snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core irqbypass crct10dif_pclmul snd_hwdep crc32_pclmul ghash_clmulni_intel snd_seq snd_seq_device rapl snd_pcm snd_timer joydev pcspkr snd soundcore virtio_balloon i2c_piix4 nfsd nfs_acl lockd binfmt_misc grace auth_rpcgss sunrpc xfs libcrc32c ata_generic sr_mod cdrom sd_mod t10_pi sg qxl drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ata_piix drm ahci libahci virtio_console libata virtio_net crc32c_intel net_failover serio_raw failover virtio_scsi virtio_blk dm_mirror dm_region_hash
[   80.718296]  dm_log dm_mod fuse bridge stp llc
[   80.718303] CPU: 10 PID: 233 Comm: kworker/10:2 Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-477.15.1.el8_8.x86_64 #1
[   80.718306] Hardware name: Red Hat KVM, BIOS 1.16.0-3.module+el8.8.0+16781+9f4724c2 04/01/2014
[   80.718309] Workqueue: events drm_fb_helper_damage_work [drm_kms_helper]
[   80.718327] RIP: 0010:smp_call_function_many_cond+0x256/0x290
[   80.718337] Code: 89 c7 e8 7d a9 82 00 3b 05 1b 98 e0 01 0f 83 34 fe ff ff 48 63 d0 49 8b 0e 48 03 0c d5 40 68 3b 94 8b 11 83 e2 01 74 09 f3 90 <8b> 11 83 e2 01 75 f7 eb c9 48 c7 c2 a0 43 fb 94 4c 89 ee 44 89 e7
[   80.718339] RSP: 0018:ffffac33c4fe3c80 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[   80.718342] RAX: 0000000000000002 RBX: 0000000000000000 RCX: ffff9c9e9faba840
[   80.718343] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9c9780029340
[   80.718344] RBP: 0000000000000000 R08: 0000000080000000 R09: ffff9c9780029f30
[   80.718345] R10: ffff9c9791b58bc0 R11: 0000000000000000 R12: ffff9c9e9fdfa840
[   80.718345] R13: 000000000000000f R14: ffff9c9e9fcb4180 R15: 0000000000000010
[   80.718347] FS:  0000000000000000(0000) GS:ffff9c9e9fc80000(0000) knlGS:0000000000000000
[   80.718348] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   80.718349] CR2: 0000564271ccc0c8 CR3: 00000001e6210002 CR4: 0000000000770ee0
[   80.718352] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   80.718353] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   80.718353] PKRU: 55555554
[   80.718354] Call Trace:
[   80.718368]  ? load_new_mm_cr3+0xe0/0xe0
[   80.718373]  ? load_new_mm_cr3+0xe0/0xe0
[   80.718375]  on_each_cpu+0x2b/0x60
[   80.718378]  flush_tlb_kernel_range+0x48/0x90
[   80.718380]  ? _cond_resched+0x15/0x30
[   80.718386]  ? unmap_kernel_range_noflush+0x3f6/0x4f0
[   80.718389]  __purge_vmap_area_lazy+0x70/0x730
[   80.718392]  free_vmap_area_noflush+0xed/0x100
[   80.718394]  remove_vm_area+0x95/0xa0
[   80.718395]  __vunmap+0x59/0x220
[   80.718397]  ttm_bo_vunmap+0x27/0xb0 [ttm]
[   80.718405]  qxl_bo_vunmap+0xa1/0xc0 [qxl]
[   80.718411]  drm_gem_vunmap+0x24/0x50 [drm]
[   80.718439]  drm_fb_helper_damage_work+0x179/0x310 [drm_kms_helper]
[   80.718447]  process_one_work+0x1a7/0x360
[   80.718452]  ? create_worker+0x1a0/0x1a0
[   80.718454]  worker_thread+0x30/0x390
[   80.718456]  ? create_worker+0x1a0/0x1a0
[   80.718458]  kthread+0x134/0x150
[   80.718461]  ? set_kthread_struct+0x50/0x50
[   80.718463]  ret_from_fork+0x1f/0x40
[   80.844934] LLT INFO V-14-1-11049 softirq not called for 1852 ticks,spawned on cpu 14, link 2
[root@rh8x64 ~]#

Comment 7 Vincent S. Cojot 2023-06-30 17:01:45 UTC

VRTS suggested to try this setting but it made no difference:

# grep thre /etc/llttab 
set-misc hbthread:0

[root@rh8x64 ~]# lltconfig -H query
Current LLT miscellaneous values:
  sleepalloc   = 1
  hbthread   = 0

Comment 8 loberman 2023-06-30 19:27:24 UTC

Galvatron has the vmcores
 retrace-server-interact 957873383 crash

Hey Vince can you recapture the panic but set 
kernel.softlockup_panic = 1 in /etc/sysctl.conf then sysctl -p 
or use echo 1 > /proc/sys/kernel/softlockup_panic

So we get the panic right at the soft lockup please

 You used sysrq c

[  128.046040] CPU: 13 PID: 226 Comm: kworker/13:1 Kdump: loaded Tainted: P           OEL   --------- -  - 4.18.0-477.13.1.el8_8.x86_64 #1
[  128.046043] Hardware name: Red Hat KVM, BIOS 1.16.0-3.module+el8.8.0+16781+9f4724c2 04/01/2014
[  128.046045] Workqueue: events drm_fb_helper_damage_work [drm_kms_helper]
[  128.046058] RIP: 0010:smp_call_function_many_cond+0x256/0x290
[  128.046065] Code: 89 c7 e8 4d ac 82 00 3b 05 eb 9a e0 01 0f 83 34 fe ff ff 48 63 d0 49 8b 0e 48 03 0c d5 40 68 5b 84 8b 11 83 e2 01 74 09 f3 90 <8b> 11 83 e2 01 75 f7 eb c9 48 c7 c2 a0 43 1b 85 4c 89 ee 44 89 e7
[  128.046067] RSP: 0018:ffffa028c4fd3b58 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[  128.046070] RAX: 0000000000000009 RBX: 0000000000000000 RCX: ffff8d001fc7a8a0
[  128.046071] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8cf900029ab0
[  128.046072] RBP: 0000000000000000 R08: 0000000080000000 R09: ffff8cf900029920
[  128.046073] R10: ffff8cf9415f7b78 R11: 0000000000000000 R12: ffff8d001fdfa8a0
[  128.046074] R13: 000000000000000f R14: ffff8d001fd74100 R15: 0000000000000010
[  128.046075] FS:  0000000000000000(0000) GS:ffff8d001fd40000(0000) knlGS:0000000000000000
[  128.046077] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  128.046078] CR2: 00007f7971d0c1a0 CR3: 00000000a8610002 CR4: 0000000000770ee0
[  128.046081] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  128.046082] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  128.046083] PKRU: 55555554
[  128.046084] Call Trace:
[  128.046086]  ? load_new_mm_cr3+0xe0/0xe0
[  128.046089]  ? load_new_mm_cr3+0xe0/0xe0
[  128.046091]  on_each_cpu+0x2b/0x60
[  128.046094]  flush_tlb_kernel_range+0x48/0x90
[  128.046096]  ? _cond_resched+0x15/0x30
[  128.046102]  ? unmap_kernel_range_noflush+0x3f6/0x4f0
[  128.046104]  __purge_vmap_area_lazy+0x70/0x730
[  128.046106]  free_vmap_area_noflush+0xed/0x100
[  128.046108]  remove_vm_area+0x95/0xa0
[  128.046109]  __vunmap+0x59/0x220
[  128.046111]  ttm_bo_vunmap+0x27/0xb0 [ttm]
[  128.046117]  qxl_draw_dirty_fb+0x2ad/0x450 [qxl]
[  128.046123]  qxl_framebuffer_surface_dirty+0xf8/0x1d0 [qxl]
[  128.046126]  ? kfree+0xd3/0x250
[  128.046130]  drm_fb_helper_damage_work+0x1aa/0x310 [drm_kms_helper]
[  128.046150]  process_one_work+0x1a7/0x360
[  128.046154]  ? create_worker+0x1a0/0x1a0
[  128.046156]  worker_thread+0x30/0x390
[  128.046158]  ? create_worker+0x1a0/0x1a0
[  128.046160]  kthread+0x134/0x150
[  128.046163]  ? set_kthread_struct+0x50/0x50
[  128.046164]  ret_from_fork+0x1f/0x40
[  128.223025] GAB WARNING V-15-1-20126 Port d[GAB_LEGACY_CLIENT (refcount 0)] not ready for reconfiguration, will retry
[  129.186986] LLT INFO V-14-1-11049 softirq not called for 11371 ticks,spawned on cpu 5, link 2
[  129.315293] sysrq: SysRq : Trigger a crash
[  129.316602] Kernel panic - not syncing: sysrq triggered crash
               
[  129.320939] CPU: 10 PID: 12696 Comm: bash Kdump: loaded Tainted: P           OEL   --------- -  - 4.18.0-477.13.1.el8_8.x86_64 #1
[  129.327278] Hardware name: Red Hat KVM, BIOS 1.16.0-3.module+el8.8.0+16781+9f4724c2 04/01/2014
[  129.329840] Call Trace:
[  129.330504]  dump_stack+0x41/0x60
[  129.331474]  panic+0xe7/0x2ac
[  129.332348]  ? printk+0x58/0x73
[  129.332902]  sysrq_handle_crash+0x11/0x20
[  129.333592]  __handle_sysrq.cold.13+0x48/0xff
[  129.334329]  write_sysrq_trigger+0x2b/0x40
[  129.335044]  proc_reg_write+0x39/0x60
[  129.335675]  vfs_write+0xa5/0x1b0
[  129.336184]  ksys_write+0x4f/0xb0
[  129.336599]  do_syscall_64+0x5b/0x1b0
[  129.337060]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[  129.337683] RIP: 0033:0x7fa026554a28
[  129.338124] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 15 4d 2a 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[  129.340371] RSP: 002b:00007ffc64bc2498 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  129.341298] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fa026554a28
[  129.342171] RDX: 0000000000000002 RSI: 0000561a82528700 RDI: 0000000000000001
[  129.343033] RBP: 0000561a82528700 R08: 000000000000000a R09: 00007fa0265b4ae0
[  129.343890] R10: 000000000000000a R11: 0000000000000246 R12: 00007fa0267f56e0
[  129.344736] R13: 0000000000000002 R14: 00007fa0267f0860 R15: 0000000000000002
crash> bt
PID: 12696    TASK: ffff8cfb04f4c000  CPU: 10   COMMAND: "bash"
 #0 [ffffa028c6bb3cd8] machine_kexec at ffffffff8326bec3
 #1 [ffffa028c6bb3d30] __crash_kexec at ffffffff833b564a
 #2 [ffffa028c6bb3df0] panic at ffffffff832f70bf
 #3 [ffffa028c6bb3e70] sysrq_handle_crash at ffffffff837fadb1
 #4 [ffffa028c6bb3e78] __handle_sysrq.cold.13 at ffffffff837fb6d4
 #5 [ffffa028c6bb3ea8] write_sysrq_trigger at ffffffff837fb57b
 #6 [ffffa028c6bb3eb8] proc_reg_write at ffffffff835ea7b9
 #7 [ffffa028c6bb3ed0] vfs_write at ffffffff83564bf5
 #8 [ffffa028c6bb3f00] ksys_write at ffffffff83564e7f
 #9 [ffffa028c6bb3f38] do_syscall_64 at ffffffff832052fb
#10 [ffffa028c6bb3f50] entry_SYSCALL_64_after_hwframe at ffffffff83e000a9
    RIP: 00007fa026554a28  RSP: 00007ffc64bc2498  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000002  RCX: 00007fa026554a28
    RDX: 0000000000000002  RSI: 0000561a82528700  RDI: 0000000000000001
    RBP: 0000561a82528700   R8: 000000000000000a   R9: 00007fa0265b4ae0
    R10: 000000000000000a  R11: 0000000000000246  R12: 00007fa0267f56e0
    R13: 0000000000000002  R14: 00007fa0267f0860  R15: 0000000000000002
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

Comment 9 Ming Lei 2023-07-02 14:17:19 UTC

(In reply to Vincent S. Cojot from comment #0)
> Description of problem:
> 
> The latest patchlevel of Veritas Infoscale 7.4.2 for RHEL8 works well on
> RHEL8.y -except- RHEL 8.8. On RHEL 8.8 the system boots and then shortly
> thereafter:
> 
> [  261.570000] watchdog: BUG: soft lockup - CPU#15 stuck for 22s!
> [kworker/15:3:28838]
> [  261.570052] watchdog: BUG: soft lockup - CPU#82 stuck for 22s!
> [migration/82:508]
> [  281.570037] watchdog: BUG: soft lockup - CPU#9 stuck for 23s!
> [migration/9:70]
> [  281.570123] watchdog: BUG: soft lockup - CPU#35 stuck for 23s!
> [lltdlv:26988]
> [  285.570146] watchdog: BUG: soft lockup - CPU#66 stuck for 22s!
> [odm_clust_start:27115]
> [  289.570129] watchdog: BUG: soft lockup - CPU#15 stuck for 22s!
> [kworker/15:3:28838]
> [  289.570182] watchdog: BUG: soft lockup - CPU#82 stuck for 22s!
> [migration/82:508]
> 
> Version-Release number of selected component (if applicable):
> 
> RHEL 8.8 with kernel-4.18.0-477.13.1.el8_8.x86_64
> RHEL 8.8 with kernel-4.18.0-477.15.1.el8_8.x86_64
> 
> How reproducible:
> 
> 100%. Infoscale 7.4.2 works well on all RHEL8 releases < 8.8.  (verified on
> 8.4, 8.6 EUS, 8.7)
> 
> On RHEL 8.8 the kernel freezes shortly after boot and the machine has to be
> crashdump'ed.

Hi Vincent, 

From the debug log, nothing related with block layer is dumpped, and not sure why
you set sub-component as block layer, :-)

I am not familiar with the related code path involved in your log.

But for this softlock issue, given you can trigger it reliablly on KVM guest, I'd suggest you
to narrow down and figure out the 1st 8.8 release with this issue, the it should be easier
to see which commit causes this issue.

Thanks,

Comment 10 Vincent S. Cojot 2023-07-07 02:13:14 UTC

Hi Ming,
Thank you for your reply. I've reached out to VRTS to let them know.
I could also reproduce the issue on a physical machine and on each of the RHEL8.8 kernels released so far.

Comment 11 Vincent S. Cojot 2023-07-12 21:39:28 UTC

Traced down to loading /etc/vx/kernel/vxgms.ko.4.18.0-425.3.1.el8.x86_64 and /etc/vx/kernel/vxglm.ko.4.18.0-425.3.1.el8.x86_64 on RHEL8.8 instead of the more recent /etc/vx/kernel/vxgms.ko.4.18.0-425.10.1.el8_7.x86_64 and /etc/vx/kernel/vxglm.ko.4.18.0-425.10.1.el8_7.x86_64