Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2136769

Summary: [Azure][RHEL-7] WALinuxAgent hangs indefinitely
Product: Red Hat Enterprise Linux 7 Reporter: Klaas Demter <klaas>
Component: WALinuxAgentAssignee: Vitaly Kuznetsov <vkuznets>
Status: CLOSED MIGRATED QA Contact: Yuxin Sun <yuxisun>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.9CC: narrieta, nnandigam, vkuznets, yacao, yuxisun
Target Milestone: rcKeywords: Extras, MigratedToJIRA
Target Release: ---Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-09-22 15:47:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Klaas Demter 2022-10-21 09:35:17 UTC
Description of problem:
It seems that WALinuxAgent sometimes gets throttled via cgroups to never run on RHEL 7.9 in Azure.


Version-Release number of selected component (if applicable):
RHEL 7.9
Kernel 3.10.0-1160.76.1.el7.x86_64
WALinuxAgent-2.3.0.2-4.el7_9.noarch (updated at runtime to 2.8.0.11 by MS)


How reproducible:
I am not sure how this happens. I can not reproduce it at will.


Steps to Reproduce:
1. Have a RHEL 7.9 in Azure with WALinux Agent running
2.
3.

Actual results:
Hang

Expected results:
No Hang

Additional info:
This seems to be related to private BZ 2135539

Comment 3 Klaas Demter 2022-10-21 09:38:25 UTC
Workaround:
echo "-1" > /sys/fs/cgroup/cpu,cpuacct/azure.slice/waagent.service/cpu.cfs_quota_us

Comment 4 Klaas Demter 2022-10-21 09:39:58 UTC
https://github.com/Azure/WALinuxAgent/issues/2674

Comment 5 Nageswara Nandigam 2022-10-24 17:23:53 UTC
@vkuznets @yuxisun What is the latest update? Have we identified it's due to kernel bug and what distros and versions may affect?

Comment 6 Vitaly Kuznetsov 2022-10-25 12:26:44 UTC
The problem seems to be specific to RHEL7 and the (possible) kernel problem is being investigated
in https://bugzilla.redhat.com/show_bug.cgi?id=2135539. WALinuxAgent itself is likely behaving
correctly but let's keep this BZ open until the root cause is known.

Comment 7 Yuxin Sun 2022-10-26 01:49:42 UTC
Hi Vitaly,

The upstream WALA (v2.9.0.0) dropped the CGroup support for RHEL(https://github.com/Azure/WALinuxAgent/pull/2685). Shell we first drop it as a workaround?

Comment 8 Vitaly Kuznetsov 2022-10-26 08:52:43 UTC
(In reply to Yuxin Sun from comment #7)
> Hi Vitaly,
> 
> The upstream WALA (v2.9.0.0) dropped the CGroup support for
> RHEL(https://github.com/Azure/WALinuxAgent/pull/2685). Shell we first drop
> it as a workaround?

I'm yet to understand the reasoning behind the change. RHEL kernels support cgroups
and the problem is only observed with RHEL7. RHEL8 and RHEL9 should not be affected,
why disabling cgroup support there? Also, I think https://bugzilla.redhat.com/show_bug.cgi?id=2136769#c3
is enough as a workaround.

Comment 9 norberto 2022-10-27 19:06:56 UTC
@Vitaly Kuznetsov 

> 
> the problem is only observed with RHEL7. RHEL8 and RHEL9 should not be affected,
>

Has it been confirmed that the issue affects only RHEL 7? For the moment we are disabling on all versions until we know more details about what conditions lead into this problem.

Not sure if this is helpful, but here is some data we collected from cpu.stat and cpuacct.stat.

Last signs of life:        2022-10-19T18:06:15.364334Z
Resumed via workaround:    2022-10-21T09:25:29.341336Z


Counter	TIMESTAMP	Value
Throttled Time	2022-10-19 17:00:32.9493940	0
Throttled Time	2022-10-19 17:05:33.1687969	0
Throttled Time	2022-10-19 17:10:33.4096282	0
Throttled Time	2022-10-19 17:15:33.6295604	0
Throttled Time	2022-10-19 17:20:33.8707334	0
Throttled Time	2022-10-19 17:25:34.0135357	0
Throttled Time	2022-10-19 17:30:34.2547307	0
Throttled Time	2022-10-19 17:35:34.4759241	0
Throttled Time	2022-10-19 17:40:34.7170992	0
Throttled Time	2022-10-19 17:45:34.9621925	0
Throttled Time	2022-10-19 17:50:35.1538511	0
Throttled Time	2022-10-19 17:55:35.4004396	0
Throttled Time	2022-10-19 18:00:35.5915517	0
Throttled Time	2022-10-19 18:05:35.8766400	0
Throttled Time	2022-10-21 09:26:29.5076150	141549.699674
% Processor Time	2022-10-19 17:00:32.9493917	0.329
% Processor Time	2022-10-19 17:05:33.1687955	0.319
% Processor Time	2022-10-19 17:10:33.4096266	0.33
% Processor Time	2022-10-19 17:15:33.6295588	0.343
% Processor Time	2022-10-19 17:20:33.8707320	0.333
% Processor Time	2022-10-19 17:25:34.0135335	0.333
% Processor Time	2022-10-19 17:30:34.2547293	0.312
% Processor Time	2022-10-19 17:35:34.4759223	0.36
% Processor Time	2022-10-19 17:40:34.7170978	0.336
% Processor Time	2022-10-19 17:45:34.9621908	0.326
% Processor Time	2022-10-19 17:50:35.1538495	0.326
% Processor Time	2022-10-19 17:55:35.4004379	0.326
% Processor Time	2022-10-19 18:00:35.5915502	0.33
% Processor Time	2022-10-19 18:05:35.8766384	0.319
% Processor Time	2022-10-21 09:26:29.5076138	0.001

Comment 10 norberto 2022-10-27 19:08:28 UTC
(The above data is from one of Klaas' machines)

Comment 11 Vitaly Kuznetsov 2022-10-31 09:47:59 UTC
(In reply to norberto from comment #9)
> @Vitaly Kuznetsov 
> 
> > 
> > the problem is only observed with RHEL7. RHEL8 and RHEL9 should not be affected,
> >
> 
> Has it been confirmed that the issue affects only RHEL 7? For the moment we
> are disabling on all versions until we know more details about what
> conditions lead into this problem.

The problem was reported with RHEL7 where the kernel is updated from 3.10. RHEL8
is 4.18 based, RHEL9 is 5.14. In case the issue is observed not only with RHEL7,
it would mean that it should also occur with other Linux distros and thus cgroup
support in WALinuxAgent should be disabled globally. I still think it's too big
of a hammer here.

Comment 12 norberto 2022-10-31 14:10:42 UTC
@Vitaly Kuznetsov Thanks for the reply. Could you give us some info about what the issue is, and/or what conditions may lead into it?

Comment 13 Vitaly Kuznetsov 2022-10-31 14:24:09 UTC
(In reply to norberto from comment #12)
> @Vitaly Kuznetsov Thanks for the reply. Could you give us some info about
> what the issue is, and/or what conditions may lead into it?

A (possible) kernel problem in RHEL7 is being investigated but there's no conclusion yet.
I've Cc:ed you on BZ2135539 to follow.

Comment 14 norberto 2022-10-31 14:27:12 UTC
Thanks!

Comment 15 Nageswara Nandigam 2022-11-01 18:45:19 UTC
@vkuznets Can you add me as well on BZ2135539 to follow. Thanks

Comment 16 Vitaly Kuznetsov 2022-11-02 10:09:02 UTC
(In reply to Nageswara Nandigam from comment #15)
> @vkuznets Can you add me as well on BZ2135539 to follow. Thanks

Done!

Comment 18 Klaas Demter 2023-02-21 08:51:41 UTC
The WALinuxAgent hangs are fixed in the 2.9.0.4 version of WALinuxAgent, they stopped using cgroups to limit cpu usage. The kernel issue is still unsolved.

Comment 19 Nageswara Nandigam 2023-08-28 21:39:27 UTC
@vkuznets We didn't get to root cause yet, but we want to revisit and enable it in WALinuxagent safely. is it safe to use cgroups for redhat/centos 8+ versions.

Comment 20 Vitaly Kuznetsov 2023-09-01 14:57:28 UTC
(In reply to Nageswara Nandigam from comment #19)
> @vkuznets We didn't get to root cause yet, but we want to revisit
> and enable it in WALinuxagent safely. is it safe to use cgroups for
> redhat/centos 8+ versions.

Yes, cgroups are heavily used in RHEL8/9 and the kernel issue was never reported.

Comment 22 RHEL Program Management 2023-09-22 15:47:09 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 23 RHEL Program Management 2023-09-22 15:47:39 UTC
This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.