Description of problem: The kernel is getting in a deadlock situation with the following error in dmesg: kernel: ------------[ cut here ]------------ kernel: cfs_rq->avg.load_avg || cfs_rq->avg.util_avg || cfs_rq->avg.runnable_avg kernel: WARNING: CPU: 62 PID: 383337 at kernel/sched/fair.c:3348 update_blocked_averages+0x62a/0x650 kernel: Modules linked in: ip_vs_rr xt_mark xt_ipvs xt_state ip_vs xt_nat veth vxlan ip6_udp_tunnel udp_tunnel xt_policy xt_conntrack ipt_MASQUERADE nf_conntrack_netlink nft_counter xt_addrtype nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink br_netfilter bridge stp llc rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache overlay intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp ledtrig_audio dell_smbios rfkill coretemp iTCO_wdt iTCO_vendor_support video crct10dif_pclmul wmi_bmof dell_wmi_descriptor crc32_pclmul dcdbas ghash_clmulni_intel ipmi_ssif rapl intel_cstate intel_uncore pcspkr ses enclosure scsi_transport_sas joydev i2c_i801 lpc_ich mei_me mei acpi_ipmi wmi ext4 ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mbcache jbd2 auth_rpcgss sunrpc xfs sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea bnx2x sysfillrect sysimgblt fb_sys_fops drm ahci libahci mdio kernel: libcrc32c megaraid_sas libata crc32c_intel i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod fuse kernel: CPU: 62 PID: 383337 Comm: kworker/62:0 Kdump: loaded Not tainted 4.18.0-365.el8.x86_64 #1 kernel: Hardware name: Dell Inc. PowerEdge M640/05YC4P, BIOS 2.12.2 07/12/2021 kernel: Workqueue: 0x0 (events) kernel: RIP: 0010:update_blocked_averages+0x62a/0x650 kernel: Code: c0 99 ad 9b c6 05 78 2e c3 01 01 e8 39 2f fc ff 0f 0b e9 47 fa ff ff 48 c7 c7 e0 9d ad 9b c6 05 5a 2e c3 01 01 e8 1f 2f fc ff <0f> 0b 8b 93 38 01 00 00 e9 8a fc ff ff 80 3d 46 2e c3 01 00 75 93 kernel: RSP: 0018:ffffa9e9a04efd68 EFLAGS: 00010086 kernel: RAX: 0000000000000000 RBX: ffff8e377ffeaec0 RCX: 0000000000000007 kernel: RDX: 0000000000000007 RSI: 00000000ffff7fff RDI: ffff8e377ffd6750 kernel: RBP: ffff8e377ffeb000 R08: 0000000000000000 R09: c0000000ffff7fff kernel: R10: 0000000000000001 R11: ffffa9e9a04efb80 R12: ffff8e377ffeb668 kernel: R13: 0000000000000001 R14: ffff8e377ffeae40 R15: 0000000000000000 kernel: FS: 0000000000000000(0000) GS:ffff8e377ffc0000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 00000000000000b0 CR3: 000000275ba10001 CR4: 00000000007706e0 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 kernel: PKRU: 55555554 kernel: Call Trace: kernel: ? entry_SYSCALL_64_after_hwframe+0xb9/0xca kernel: newidle_balance+0xcb/0x3c0 kernel: pick_next_task_fair+0x3e/0x3b0 kernel: __schedule+0x146/0x830 kernel: ? create_worker+0x1a0/0x1a0 kernel: schedule+0x35/0xa0 kernel: worker_thread+0xb7/0x390 kernel: ? create_worker+0x1a0/0x1a0 kernel: kthread+0x10a/0x120 kernel: ? set_kthread_struct+0x40/0x40 kernel: ret_from_fork+0x35/0x40 kernel: ---[ end trace 00c4093b0733bf91 ]--- Version-Release number of selected component (if applicable): Linux version 4.18.0-365.el8.x86_64 (mockbuild.centos.org) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-10) (GCC)) #1 SMP Thu Feb 10 16:11:23 UTC 2022 How reproducible: It is happening consistently when the system is under load, but, I am unable to reproduce the error at will. Steps to Reproduce: 1. Run system for a few days and wait. Sorry for the poor description, I haven't been able to reproduce on demand. 2. Has occured on multiple systems. 3. Did not occur on CentOS 8, has only started since moving to streams. Actual results: System freezes. Console doesn't respond to keyboard. Existing process may continue to run, but will lock up. System will respond to ping, but will not accept an ssh session. Expected results: System will run normally. Additional info: I am not sure what else to do to try and resolve this error. I'm happy to try any suggestions. Edit: typo
If I leave the box for longer, I start to see the following errors too: INFO: task kworker/64:2:125809 blocked for more than 120 seconds. Tainted: G W --------- - - 4.18.0-365.el8.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/64:2 state:D stack: 0 pid:125809 ppid: 2 flags:0x80004080 Workqueue: cgroup_destroy css_free_rwork_fn Call Trace: __schedule+0x2d1/0x830 schedule+0x35/0xa0 schedule_timeout+0x274/0x300 ? load_balance+0x163/0xc20 ? recalibrate_cpu_khz+0x10/0x10 ? ktime_get+0x3e/0xa0 wait_for_completion+0x96/0x100 flush_workqueue+0x14d/0x440 ? __switch_to_asm+0x35/0x70 cgroup1_pidlist_destroy_all+0x7c/0xa0 css_free_rwork_fn+0xe3/0x3a0 process_one_work+0x1a7/0x360 ? create_worker+0x1a0/0x1a0 worker_thread+0x30/0x390 ? create_worker+0x1a0/0x1a0 kthread+0x10a/0x120 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x35/0x40
*** Bug 2079179 has been marked as a duplicate of this bug. ***
*** Bug 2061658 has been marked as a duplicate of this bug. ***
*** Bug 2046454 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: kernel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7683