Bug 2175905

Summary: Server stalls cpu soft lockup after a while (hours) starting from kernel 4.18.0-425.13.1.el8_7.x86_64
Product: Red Hat Enterprise Linux 8 Reporter: svenvd.github
Component: kernelAssignee: LVM Team <lvm-team>
kernel sub component: Crypt QA Contact: Storage QE <storage-qe>
Status: NEW --- Docs Contact:
Severity: urgent    
Priority: unspecified CC: agk, msnitzer, okozina
Version: 8.7   
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: ---
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description svenvd.github 2023-03-06 18:12:46 UTC
Description of problem:

Server stalls cpu soft lockup after a while (hours) starting from kernel 4.18.0-425.13.1.el8_7.x86_64

How reproducible:

Steps to Reproduce:
1. Upgrade from kernel-4.18.0-372.32.1.el8_6.x86_64 to kernel-4.18.0-425.13.1.el8_7.x86_64


Result:

Every minute or so a new soft lockup message, making all VMs use a lot of cpu and making the server and VM (virt-manager/kvm) unusable. Hard reset is needed

Mar  6 18:18:26 server000 kernel: CPU: 14 PID: 32788 Comm: kworker/u256:14 Tainted: G             L   --------- -  - 4.18.0-425.13.1.el8_7.x86_64 #1
Mar  6 18:18:26 server000 kernel: Hardware name: Supermicro AS -5019D-FTN4/M11SDV-8C-LN4F, BIOS 1.0b 02/15/2020
Mar  6 18:18:26 server000 kernel: Workqueue: kcryptd/253:3 kcryptd_crypt [dm_crypt]
Mar  6 18:18:26 server000 kernel: RIP: 0010:aesni_xts_crypt8+0x11e/0x270
Mar  6 18:18:26 server000 kernel: Code: 0f 6f 26 66 41 0f ef c4 f3 0f 7f 06 66 44 0f 70 db 13 66 0f d4 db 66 41 0f 72 e3 1f 66 45 0f db da 66 41 0f ef db 66 0f 6f c3 <f3> 44 0f 6f 62 40 66 41 0f ef c4 f3 0f 7f 5e 40 f3 44 0f 6f 66 10
Mar  6 18:18:26 server000 kernel: RSP: 0018:ffffb0ce85a93c18 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
Mar  6 18:18:26 server000 kernel: RAX: ffffffff9d08edb0 RBX: 0000000000000080 RCX: 0000000000000000
Mar  6 18:18:26 server000 kernel: RDX: ffff898f42bf9e00 RSI: ffff898e89791e00 RDI: ffff898fc8574260
Mar  6 18:18:26 server000 kernel: RBP: ffffb0ce85a93d08 R08: ffff898f5adaf5a0 R09: 0000000000000020
Mar  6 18:18:26 server000 kernel: R10: ffff898fc85742d0 R11: ffffffff9d08eb50 R12: 0000000000000200
Mar  6 18:18:26 server000 kernel: R13: ffffffff9de0ab68 R14: ffffffff9de0ab68 R15: 0000000000000000
Mar  6 18:18:26 server000 kernel: FS:  0000000000000000(0000) GS:ffff89967d380000(0000) knlGS:0000000000000000
Mar  6 18:18:26 server000 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar  6 18:18:26 server000 kernel: CR2: 000055610f6253e2 CR3: 0000000166810000 CR4: 00000000003506e0
Mar  6 18:18:26 server000 kernel: Call Trace:
Mar  6 18:18:26 server000 kernel: ? glue_xts_req_128bit+0xe6/0x1a0
Mar  6 18:18:26 server000 kernel: ? _aesni_enc1+0xb0/0xb0
Mar  6 18:18:26 server000 kernel: ? crypt_convert+0x9d3/0x1040 [dm_crypt]
Mar  6 18:18:26 server000 kernel: ? crypt_page_alloc+0x49/0x60 [dm_crypt]
Mar  6 18:18:26 server000 kernel: ? mempool_alloc+0x67/0x180
Mar  6 18:18:26 server000 kernel: ? kcryptd_crypt+0x33c/0x460 [dm_crypt]
Mar  6 18:18:26 server000 kernel: ? process_one_work+0x1a7/0x360
Mar  6 18:18:26 server000 kernel: ? worker_thread+0x30/0x390
Mar  6 18:18:26 server000 kernel: ? create_worker+0x1a0/0x1a0
Mar  6 18:18:26 server000 kernel: ? kthread+0x10b/0x130
Mar  6 18:18:26 server000 kernel: ? set_kthread_struct+0x50/0x50
Mar  6 18:18:26 server000 kernel: ? ret_from_fork+0x35/0x40
Mar  6 18:18:54 server000 kernel: watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [kworker/u256:14:32788]
Mar  6 18:18:54 server000 kernel: Modules linked in: vhost_net tun vhost vhost_iotlb macvtap macvlan tap vfio_pci vfio_virqfd vfio_iommu_type1 vfio nfnetlink sunrpc dm_crypt ipmi_ssif igbvf intel_rapl_msr intel_rapl_common amd64_edac_mo
d edac_mce_amd kvm_amd kvm raid1 irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl joydev sp5100_tco pcspkr k10temp i2c_piix4 ccp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler i2c_designware_platform i2c_designware_core 
acpi_cpufreq ext4 mbcache jbd2 xfs libcrc32c sd_mod sg ast drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm nvme igb ahci libahci nvme_core drm dca crc32c_intel libata i2c_algo_bit t10_pi d
m_mirror dm_region_hash dm_log dm_mod

cryptsetup luksDump /dev/md0
LUKS header information
Version:        2
Epoch:          3
Metadata area:  16384 [bytes]
Keyslots area:  16744448 [bytes]
UUID:           REDACTED
Label:          (no label)
Subsystem:      (no subsystem)
Flags:          (no flags)

Data segments:
  0: crypt
        offset: 16777216 [bytes]
        length: (whole device)
        cipher: aes-xts-plain64
        sector: 512 [bytes]

Reverting to kernel-4.18.0-372.32.1.el8_6.x86_64 solves the issue

Comment 1 svenvd.github 2023-03-17 18:04:37 UTC
Update,

This was due to a faulty ECC RAM please close the bug.