Bug 2136769
| Summary: | [Azure][RHEL-7] WALinuxAgent hangs indefinitely | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Klaas Demter <klaas> |
| Component: | WALinuxAgent | Assignee: | Vitaly Kuznetsov <vkuznets> |
| Status: | CLOSED MIGRATED | QA Contact: | Yuxin Sun <yuxisun> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.9 | CC: | narrieta, nnandigam, vkuznets, yacao, yuxisun |
| Target Milestone: | rc | Keywords: | Extras, MigratedToJIRA |
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-09-22 15:47:39 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Klaas Demter
2022-10-21 09:35:17 UTC
Workaround: echo "-1" > /sys/fs/cgroup/cpu,cpuacct/azure.slice/waagent.service/cpu.cfs_quota_us @vkuznets @yuxisun What is the latest update? Have we identified it's due to kernel bug and what distros and versions may affect? The problem seems to be specific to RHEL7 and the (possible) kernel problem is being investigated in https://bugzilla.redhat.com/show_bug.cgi?id=2135539. WALinuxAgent itself is likely behaving correctly but let's keep this BZ open until the root cause is known. Hi Vitaly, The upstream WALA (v2.9.0.0) dropped the CGroup support for RHEL(https://github.com/Azure/WALinuxAgent/pull/2685). Shell we first drop it as a workaround? (In reply to Yuxin Sun from comment #7) > Hi Vitaly, > > The upstream WALA (v2.9.0.0) dropped the CGroup support for > RHEL(https://github.com/Azure/WALinuxAgent/pull/2685). Shell we first drop > it as a workaround? I'm yet to understand the reasoning behind the change. RHEL kernels support cgroups and the problem is only observed with RHEL7. RHEL8 and RHEL9 should not be affected, why disabling cgroup support there? Also, I think https://bugzilla.redhat.com/show_bug.cgi?id=2136769#c3 is enough as a workaround. @Vitaly Kuznetsov
>
> the problem is only observed with RHEL7. RHEL8 and RHEL9 should not be affected,
>
Has it been confirmed that the issue affects only RHEL 7? For the moment we are disabling on all versions until we know more details about what conditions lead into this problem.
Not sure if this is helpful, but here is some data we collected from cpu.stat and cpuacct.stat.
Last signs of life: 2022-10-19T18:06:15.364334Z
Resumed via workaround: 2022-10-21T09:25:29.341336Z
Counter TIMESTAMP Value
Throttled Time 2022-10-19 17:00:32.9493940 0
Throttled Time 2022-10-19 17:05:33.1687969 0
Throttled Time 2022-10-19 17:10:33.4096282 0
Throttled Time 2022-10-19 17:15:33.6295604 0
Throttled Time 2022-10-19 17:20:33.8707334 0
Throttled Time 2022-10-19 17:25:34.0135357 0
Throttled Time 2022-10-19 17:30:34.2547307 0
Throttled Time 2022-10-19 17:35:34.4759241 0
Throttled Time 2022-10-19 17:40:34.7170992 0
Throttled Time 2022-10-19 17:45:34.9621925 0
Throttled Time 2022-10-19 17:50:35.1538511 0
Throttled Time 2022-10-19 17:55:35.4004396 0
Throttled Time 2022-10-19 18:00:35.5915517 0
Throttled Time 2022-10-19 18:05:35.8766400 0
Throttled Time 2022-10-21 09:26:29.5076150 141549.699674
% Processor Time 2022-10-19 17:00:32.9493917 0.329
% Processor Time 2022-10-19 17:05:33.1687955 0.319
% Processor Time 2022-10-19 17:10:33.4096266 0.33
% Processor Time 2022-10-19 17:15:33.6295588 0.343
% Processor Time 2022-10-19 17:20:33.8707320 0.333
% Processor Time 2022-10-19 17:25:34.0135335 0.333
% Processor Time 2022-10-19 17:30:34.2547293 0.312
% Processor Time 2022-10-19 17:35:34.4759223 0.36
% Processor Time 2022-10-19 17:40:34.7170978 0.336
% Processor Time 2022-10-19 17:45:34.9621908 0.326
% Processor Time 2022-10-19 17:50:35.1538495 0.326
% Processor Time 2022-10-19 17:55:35.4004379 0.326
% Processor Time 2022-10-19 18:00:35.5915502 0.33
% Processor Time 2022-10-19 18:05:35.8766384 0.319
% Processor Time 2022-10-21 09:26:29.5076138 0.001
(The above data is from one of Klaas' machines) (In reply to norberto from comment #9) > @Vitaly Kuznetsov > > > > > the problem is only observed with RHEL7. RHEL8 and RHEL9 should not be affected, > > > > Has it been confirmed that the issue affects only RHEL 7? For the moment we > are disabling on all versions until we know more details about what > conditions lead into this problem. The problem was reported with RHEL7 where the kernel is updated from 3.10. RHEL8 is 4.18 based, RHEL9 is 5.14. In case the issue is observed not only with RHEL7, it would mean that it should also occur with other Linux distros and thus cgroup support in WALinuxAgent should be disabled globally. I still think it's too big of a hammer here. @Vitaly Kuznetsov Thanks for the reply. Could you give us some info about what the issue is, and/or what conditions may lead into it? (In reply to norberto from comment #12) > @Vitaly Kuznetsov Thanks for the reply. Could you give us some info about > what the issue is, and/or what conditions may lead into it? A (possible) kernel problem in RHEL7 is being investigated but there's no conclusion yet. I've Cc:ed you on BZ2135539 to follow. Thanks! @vkuznets Can you add me as well on BZ2135539 to follow. Thanks (In reply to Nageswara Nandigam from comment #15) > @vkuznets Can you add me as well on BZ2135539 to follow. Thanks Done! The WALinuxAgent hangs are fixed in the 2.9.0.4 version of WALinuxAgent, they stopped using cgroups to limit cpu usage. The kernel issue is still unsolved. @vkuznets We didn't get to root cause yet, but we want to revisit and enable it in WALinuxagent safely. is it safe to use cgroups for redhat/centos 8+ versions. (In reply to Nageswara Nandigam from comment #19) > @vkuznets We didn't get to root cause yet, but we want to revisit > and enable it in WALinuxagent safely. is it safe to use cgroups for > redhat/centos 8+ versions. Yes, cgroups are heavily used in RHEL8/9 and the kernel issue was never reported. Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. |