Description of problem: Server stalls cpu soft lockup after a while (hours) starting from kernel 4.18.0-425.13.1.el8_7.x86_64 How reproducible: Steps to Reproduce: 1. Upgrade from kernel-4.18.0-372.32.1.el8_6.x86_64 to kernel-4.18.0-425.13.1.el8_7.x86_64 Result: Every minute or so a new soft lockup message, making all VMs use a lot of cpu and making the server and VM (virt-manager/kvm) unusable. Hard reset is needed Mar 6 18:18:26 server000 kernel: CPU: 14 PID: 32788 Comm: kworker/u256:14 Tainted: G L --------- - - 4.18.0-425.13.1.el8_7.x86_64 #1 Mar 6 18:18:26 server000 kernel: Hardware name: Supermicro AS -5019D-FTN4/M11SDV-8C-LN4F, BIOS 1.0b 02/15/2020 Mar 6 18:18:26 server000 kernel: Workqueue: kcryptd/253:3 kcryptd_crypt [dm_crypt] Mar 6 18:18:26 server000 kernel: RIP: 0010:aesni_xts_crypt8+0x11e/0x270 Mar 6 18:18:26 server000 kernel: Code: 0f 6f 26 66 41 0f ef c4 f3 0f 7f 06 66 44 0f 70 db 13 66 0f d4 db 66 41 0f 72 e3 1f 66 45 0f db da 66 41 0f ef db 66 0f 6f c3 <f3> 44 0f 6f 62 40 66 41 0f ef c4 f3 0f 7f 5e 40 f3 44 0f 6f 66 10 Mar 6 18:18:26 server000 kernel: RSP: 0018:ffffb0ce85a93c18 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13 Mar 6 18:18:26 server000 kernel: RAX: ffffffff9d08edb0 RBX: 0000000000000080 RCX: 0000000000000000 Mar 6 18:18:26 server000 kernel: RDX: ffff898f42bf9e00 RSI: ffff898e89791e00 RDI: ffff898fc8574260 Mar 6 18:18:26 server000 kernel: RBP: ffffb0ce85a93d08 R08: ffff898f5adaf5a0 R09: 0000000000000020 Mar 6 18:18:26 server000 kernel: R10: ffff898fc85742d0 R11: ffffffff9d08eb50 R12: 0000000000000200 Mar 6 18:18:26 server000 kernel: R13: ffffffff9de0ab68 R14: ffffffff9de0ab68 R15: 0000000000000000 Mar 6 18:18:26 server000 kernel: FS: 0000000000000000(0000) GS:ffff89967d380000(0000) knlGS:0000000000000000 Mar 6 18:18:26 server000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 6 18:18:26 server000 kernel: CR2: 000055610f6253e2 CR3: 0000000166810000 CR4: 00000000003506e0 Mar 6 18:18:26 server000 kernel: Call Trace: Mar 6 18:18:26 server000 kernel: ? glue_xts_req_128bit+0xe6/0x1a0 Mar 6 18:18:26 server000 kernel: ? _aesni_enc1+0xb0/0xb0 Mar 6 18:18:26 server000 kernel: ? crypt_convert+0x9d3/0x1040 [dm_crypt] Mar 6 18:18:26 server000 kernel: ? crypt_page_alloc+0x49/0x60 [dm_crypt] Mar 6 18:18:26 server000 kernel: ? mempool_alloc+0x67/0x180 Mar 6 18:18:26 server000 kernel: ? kcryptd_crypt+0x33c/0x460 [dm_crypt] Mar 6 18:18:26 server000 kernel: ? process_one_work+0x1a7/0x360 Mar 6 18:18:26 server000 kernel: ? worker_thread+0x30/0x390 Mar 6 18:18:26 server000 kernel: ? create_worker+0x1a0/0x1a0 Mar 6 18:18:26 server000 kernel: ? kthread+0x10b/0x130 Mar 6 18:18:26 server000 kernel: ? set_kthread_struct+0x50/0x50 Mar 6 18:18:26 server000 kernel: ? ret_from_fork+0x35/0x40 Mar 6 18:18:54 server000 kernel: watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [kworker/u256:14:32788] Mar 6 18:18:54 server000 kernel: Modules linked in: vhost_net tun vhost vhost_iotlb macvtap macvlan tap vfio_pci vfio_virqfd vfio_iommu_type1 vfio nfnetlink sunrpc dm_crypt ipmi_ssif igbvf intel_rapl_msr intel_rapl_common amd64_edac_mo d edac_mce_amd kvm_amd kvm raid1 irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl joydev sp5100_tco pcspkr k10temp i2c_piix4 ccp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler i2c_designware_platform i2c_designware_core acpi_cpufreq ext4 mbcache jbd2 xfs libcrc32c sd_mod sg ast drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm nvme igb ahci libahci nvme_core drm dca crc32c_intel libata i2c_algo_bit t10_pi d m_mirror dm_region_hash dm_log dm_mod cryptsetup luksDump /dev/md0 LUKS header information Version: 2 Epoch: 3 Metadata area: 16384 [bytes] Keyslots area: 16744448 [bytes] UUID: REDACTED Label: (no label) Subsystem: (no subsystem) Flags: (no flags) Data segments: 0: crypt offset: 16777216 [bytes] length: (whole device) cipher: aes-xts-plain64 sector: 512 [bytes] Reverting to kernel-4.18.0-372.32.1.el8_6.x86_64 solves the issue
Update, This was due to a faulty ECC RAM please close the bug.