Bug 2056383

Summary: System freezes with callstack in dmesg: ret_from_fork
Product: Red Hat Enterprise Linux 8 Reporter: Mark Assad <mark.assad>
Component: kernelAssignee: Phil Auld <pauld>
kernel sub component: Scheduler QA Contact: Waylon Cude <wcude>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aleksandar.ivanisevic, bhu, bstinson, chuhu, hmatsumo, jim, junichi.nomura, jwboyer, vbendel, wcude
Version: 8.6Keywords: Triaged, ZStream
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-4.18.0-392.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2096305 (view as bug list) Environment:
Last Closed: 2022-11-08 10:21:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2037123    
Bug Blocks: 2090535, 2096305    

Description Mark Assad 2022-02-21 06:24:06 UTC
Description of problem:

The kernel is getting in a deadlock situation with the following error in dmesg:

kernel: ------------[ cut here ]------------
kernel: cfs_rq->avg.load_avg || cfs_rq->avg.util_avg || cfs_rq->avg.runnable_avg
kernel: WARNING: CPU: 62 PID: 383337 at kernel/sched/fair.c:3348 update_blocked_averages+0x62a/0x650
kernel: Modules linked in: ip_vs_rr xt_mark xt_ipvs xt_state ip_vs xt_nat veth vxlan ip6_udp_tunnel udp_tunnel xt_policy xt_conntrack ipt_MASQUERADE nf_conntrack_netlink nft_counter xt_addrtype nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink br_netfilter bridge stp llc rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache overlay intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp ledtrig_audio dell_smbios rfkill coretemp iTCO_wdt iTCO_vendor_support video crct10dif_pclmul wmi_bmof dell_wmi_descriptor crc32_pclmul dcdbas ghash_clmulni_intel ipmi_ssif rapl intel_cstate intel_uncore pcspkr ses enclosure scsi_transport_sas joydev i2c_i801 lpc_ich mei_me mei acpi_ipmi wmi ext4 ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mbcache jbd2 auth_rpcgss sunrpc xfs sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea bnx2x sysfillrect sysimgblt fb_sys_fops drm ahci libahci mdio
kernel:  libcrc32c megaraid_sas libata crc32c_intel i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod fuse
kernel: CPU: 62 PID: 383337 Comm: kworker/62:0 Kdump: loaded Not tainted 4.18.0-365.el8.x86_64 #1
kernel: Hardware name: Dell Inc. PowerEdge M640/05YC4P, BIOS 2.12.2 07/12/2021
kernel: Workqueue:  0x0 (events)
kernel: RIP: 0010:update_blocked_averages+0x62a/0x650
kernel: Code: c0 99 ad 9b c6 05 78 2e c3 01 01 e8 39 2f fc ff 0f 0b e9 47 fa ff ff 48 c7 c7 e0 9d ad 9b c6 05 5a 2e c3 01 01 e8 1f 2f fc ff <0f> 0b 8b 93 38 01 00 00 e9 8a fc ff ff 80 3d 46 2e c3 01 00 75 93
kernel: RSP: 0018:ffffa9e9a04efd68 EFLAGS: 00010086
kernel: RAX: 0000000000000000 RBX: ffff8e377ffeaec0 RCX: 0000000000000007
kernel: RDX: 0000000000000007 RSI: 00000000ffff7fff RDI: ffff8e377ffd6750
kernel: RBP: ffff8e377ffeb000 R08: 0000000000000000 R09: c0000000ffff7fff
kernel: R10: 0000000000000001 R11: ffffa9e9a04efb80 R12: ffff8e377ffeb668
kernel: R13: 0000000000000001 R14: ffff8e377ffeae40 R15: 0000000000000000
kernel: FS:  0000000000000000(0000) GS:ffff8e377ffc0000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00000000000000b0 CR3: 000000275ba10001 CR4: 00000000007706e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: PKRU: 55555554
kernel: Call Trace:
kernel:  ? entry_SYSCALL_64_after_hwframe+0xb9/0xca
kernel:  newidle_balance+0xcb/0x3c0
kernel:  pick_next_task_fair+0x3e/0x3b0
kernel:  __schedule+0x146/0x830
kernel:  ? create_worker+0x1a0/0x1a0
kernel:  schedule+0x35/0xa0
kernel:  worker_thread+0xb7/0x390
kernel:  ? create_worker+0x1a0/0x1a0
kernel:  kthread+0x10a/0x120
kernel:  ? set_kthread_struct+0x40/0x40
kernel:  ret_from_fork+0x35/0x40
kernel: ---[ end trace 00c4093b0733bf91 ]---

Version-Release number of selected component (if applicable):
Linux version 4.18.0-365.el8.x86_64 (mockbuild.centos.org) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-10) (GCC)) #1 SMP Thu Feb 10 16:11:23 UTC 2022

How reproducible:
It is happening consistently when the system is under load, but, I am unable to reproduce the error at will. 


Steps to Reproduce:
1. Run system for a few days and wait. Sorry for the poor description, I haven't been able to reproduce on demand. 
2. Has occured on multiple systems. 
3. Did not occur on CentOS 8, has only started since moving to streams. 


Actual results:
System freezes. Console doesn't respond to keyboard. Existing process may continue to run, but will lock up. 

System will respond to ping, but will not accept an ssh session.


Expected results:
System will run normally. 


Additional info:
I am not sure what else to do to try and resolve this error. I'm happy to try any suggestions.

Edit: typo

Comment 1 Mark Assad 2022-02-22 03:14:16 UTC
If I leave the box for longer, I start to see the following errors too:

 INFO: task kworker/64:2:125809 blocked for more than 120 seconds.
       Tainted: G        W        --------- -  - 4.18.0-365.el8.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/64:2    state:D stack:    0 pid:125809 ppid:     2 flags:0x80004080
 Workqueue: cgroup_destroy css_free_rwork_fn
 Call Trace:
  __schedule+0x2d1/0x830
  schedule+0x35/0xa0
  schedule_timeout+0x274/0x300
  ? load_balance+0x163/0xc20
  ? recalibrate_cpu_khz+0x10/0x10
  ? ktime_get+0x3e/0xa0
  wait_for_completion+0x96/0x100
  flush_workqueue+0x14d/0x440
  ? __switch_to_asm+0x35/0x70
  cgroup1_pidlist_destroy_all+0x7c/0xa0
  css_free_rwork_fn+0xe3/0x3a0
  process_one_work+0x1a7/0x360
  ? create_worker+0x1a0/0x1a0
  worker_thread+0x30/0x390
  ? create_worker+0x1a0/0x1a0
  kthread+0x10a/0x120
  ? set_kthread_struct+0x40/0x40
  ret_from_fork+0x35/0x40

Comment 6 Phil Auld 2022-04-27 11:31:43 UTC
*** Bug 2079179 has been marked as a duplicate of this bug. ***

Comment 8 Phil Auld 2022-04-28 13:59:31 UTC
*** Bug 2061658 has been marked as a duplicate of this bug. ***

Comment 19 Phil Auld 2022-07-07 16:58:42 UTC
*** Bug 2046454 has been marked as a duplicate of this bug. ***

Comment 21 errata-xmlrpc 2022-11-08 10:21:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7683