Bug 1746415 - The system.slice and kubepod.slice cpusets can get out of sync
Summary: The system.slice and kubepod.slice cpusets can get out of sync
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Multi-Arch
Version: 4.8
Hardware: ppc64le
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Manoj Kumar
QA Contact: Douglas Slavens
URL:
Whiteboard: multi-arch
Depends On:
Blocks: 1619379 1824893
TreeView+ depends on / blocked
 
Reported: 2019-08-28 12:21 UTC by Zvonko Kosic
Modified: 2022-09-12 22:14 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-12 22:14:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
IBM Linux Technology Center 182701 0 None None None 2019-12-03 22:42:36 UTC

Description Zvonko Kosic 2019-08-28 12:21:01 UTC
Description of problem:

The system.slice and kubepod.slice cpusets can get out of sync with the actual memory configuration as described in cpuset.mems.
 
To illustrate: when the Nvidia driver is unloaded or the persistence daemon is stopped, the GPU memory if offlined
normal:
# cat /sys/fs/cgroup/cpuset/cpuset.mems
0,8,252-255         <-- The GPU memory enumerates as numa nodes 252-255
 
nv unloaded; memory offlined:
# cat /sys/fs/cgroup/cpuset/cpuset.mems
0,8
 
restarting the persistence daemon will bring the GPU memory back online:
# cat /sys/fs/cgroup/cpuset/cpuset.mems
0,8,252-255 
 
In some cases, kubepods.slice is not updated properly to reflect the re-onlined GPU memory
# cat /sys/fs/cgroup/cpuset/kubepods.slice/cpuset.mems
0,8,253-255
 
The behavior is similar but less frequent for /sys/fs/cgroup/cpuset/system.slice/cpuset.mems which can affect usage of bare docker.
 
The workaround is to shutdown the kube daemon, reset the affected cpuset.mems manually, then restart the kube daemon.


Version-Release number of selected component (if applicable):


How reproducible:

Every time.


Steps to Reproduce:
1. Install OpenShift3.11 on Power
2. Follow this document to enable GPUs: https://docs.google.com/document/d/1tCWqutJeUzjQeQd9mcJjFMVYiSUAvzXDPjunJ7-UBio/edit
3. Try to run a CUDA workload on each of the GPUs

Actual results:

CUDA Device Init: Error Code -3

Expected results:

CUDA examples, vectorAdd, etc are running without error. 


Additional info:

Comment 1 Doug Lehr 2019-09-27 14:31:58 UTC
Andy,
  I helped root cause this with Zvonko,  if there's anything you need from my end please let me know!  This issue is becoming more problematic as time goes on for some reason.  

Essentially since GPU memory is treated as an extension of system memory on POWER9 AC922 machines with ppc64le architecture, they are added to the cpuset.mems cpuset attribute.  The issue is that GPU's only come online when someone tries to use them, or the `nvidia-persistenced` daemon is started up.  Neither of which are guaranteed to occur before Kubernetes, OpenShift, or docker create their cgroup slices.

I'm not sure if this is the correct place for this bug to finally end up, but either way we need to try to work towards getting this resolved.  Let me know if there's anything I can do to help!

Comment 2 Manoj Kumar 2019-09-28 14:32:35 UTC
Doug: From your comments it sounds like this is an underlying issue with cgroups. If so, should this be re-assigned to the kernel team?

Comment 3 Andy McCrae 2019-09-30 08:09:48 UTC
Hi Doug,

Thanks for the update! I'm waiting for some hardware to recreate the issue on ppc64le - but I did some initial looking at the parts of Kube that handle cpuset.mems and there doesn't seem to be anything arch specific there, which lines up with what you're saying.

I also followed up with Zvonko to confirm that he hasn't seen this impact x86_64, which he confirmed.

Andy

Comment 4 Dennis Gilmore 2019-10-04 14:23:56 UTC
Doug:  what provides the unit files for `nvidia-persistenced` daemon?  I suspect that the needed change here is for it to tell systemd that it has to be started before Kubernetes, OpenShift, and docker

Comment 5 Andy McCrae 2019-10-07 18:22:50 UTC
Hi Doug, I'm seeing some slightly different behaviour when attempting to recreate the bug.

If I start up the host with nvidia-persistenced disabled and Openshift starts-up, then when I enabled nvidia-persistenced the kubeslice cpuset.mems isn't updated automatically (which I think is inline with what is mentioned above).

That said, I believe the udev rules I have setup may be incorrect since stopping nvidia-persistenced doesn't offline the memory (that actually fails, journalctl output to follow):

-- Unit nvidia-persistenced.service has begun shutting down.
Oct 07 14:09:02 hostname nvidia-persistenced[81292]: Socket closed.
Oct 07 14:09:02 hostname nvidia-persistenced[81292]: device 0004:04:00.0 - persistence mode disabled.
Oct 07 14:09:02 hostname nvidia-persistenced[81292]: NUMA: Failed ioctl call to set device NUMA status: Device or resourc
Oct 07 14:09:02 hostname nvidia-persistenced[81292]: NUMA: Failed to set NUMA status to offline_in_progress
Oct 07 14:09:02 hostname nvidia-persistenced[81292]: device 0004:04:00.0 - NUMA: Failed to offline memory
Oct 07 14:09:02 hostname nvidia-persistenced[81292]: device 0004:04:00.0 - failed to offline memory.

Initially I saw similar issues when setting the memory online, but that was fixed by adjusting the udev rules:

sed -i 's/SUBSYSTEM!="memory", ACTION!="add", GOTO="memory_hotplug_end"/SUBSYSTEM=="*", GOTO="memory_hotplug_end"/' /etc/udev/rules.d/40-redhat.rules

Which means the memory hotadd request section looks like this:

# Memory hotadd request
SUBSYSTEM=="*", GOTO="memory_hotplug_end"
PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"

ENV{.state}="online"
PROGRAM="/bin/systemd-detect-virt", RESULT=="none", ENV{.state}="online_movable"
ATTR{state}=="offline", ATTR{state}="$env{.state}"

LABEL="memory_hotplug_end"

Does that look correct to you? I'd expect to see nvidia-persistenced offline/online the memory appropriately when it is stopped/started - which is more in line with what Zvonko has above.

I want to make sure I have everything configured correctly, so that I can recreate this in the same way.
Andy

Comment 6 Zvonko Kosic 2019-10-07 19:01:02 UTC
Another thing we have seen is that on the power systems I am using only one of the four GPU memories was showing not online

0-8,253-255 

whereas on Andy's systems, none of the GPU memory was showing online in the kubelet.slice. 

0-8 


When we edited the top-level slice the kubelet did not propagate the settings to the child slices, we had to edit them
by hand (besteffort, burstable) and only then the cpuset.mems were propagated to the Pods with a restart of the OpenShift node.

Comment 7 Doug Lehr 2019-10-07 19:14:52 UTC
Right...We feel Zvonko hit the race condition that's not 100% recreatable.  In this case, NVIDIA hadn't fully brought online all 4 GPUs, but it was in the process of doing so.

In Andy's case we're forcing the GPUs to not be online until we're ready.  Essentially causing the same root problem.  

Zvonko's is the case we'd expect most clients to hit. Especially with the 64GB GPUs as those obviously take longer to online.   Does this help?

I'll talk with Kris M. To make sure the udev rules are 100% correct.

Comment 12 Frank Novak 2019-12-03 20:20:49 UTC
What's the OS level on this?  Latest RHEL 7.6-alt zStream?

Comment 13 Frank Novak 2019-12-03 20:22:22 UTC
(In reply to Frank Novak from comment #12)
> What's the OS level on this?  Latest RHEL 7.6-alt zStream?

And what is the FW level on the systems, the Nvidia driver level?

and would be good to get the typical sosreport level info.. 
At least key logs, dmsg,....

Comment 14 Doug Lehr 2019-12-03 23:15:31 UTC
Hey Frank,
  Yup it's 
Rhel 7.6-alt
nvidia driver 418.87

FW info
 Product Name          : OpenPOWER Firmware
 Product Version       : witherspoon-OP9-v2.2-3.2
 Product Extra         : 	occ-58e422d
 Product Extra         : 	skiboot-v6.2-190-gc470806a2e5e
 Product Extra         : 	buildroot-2018.11.3-12-g222837a
 Product Extra         : 	capp-ucode-p9-dd2-v4
 Product Extra         : 	petitboot-v1.10.2
 Product Extra         : 	sbe-1410677
 Product Extra         : 	hostboot-binaries-hw021419a.930
 Product Extra         : 	bmc-firmware-version-2.03
 Product Extra         : 	hcode-hw031619a.940
 Product Extra         : 	machine-xml-e3e9aef
 Product Extra         : 	hostboot-3653c5d-p9f41e02
 Product Extra         : 	linux-4.19.26-openpower1-p2974ab8


Let me know if you need more, I don't think there's any SOS reports, no errors of any kind come out during startup.  This is literally just a race condition related to when kubernetes, docker, etc. create their cgroup slice vs when gpu memory's available.  The .slice files that are created are entirely valid and correct for the time they were created.


We can recreate this pretty easily, and it's mostly unrelated to driver/fw levels.  It's specifically a POWER9 only problem.

Comment 15 Frank Novak 2019-12-04 03:19:48 UTC
(In reply to Doug Lehr from comment #14)
> Hey Frank,
>   Yup it's 
> Rhel 7.6-alt
> nvidia driver 418.87
> 
> FW info
>  Product Name          : OpenPOWER Firmware
>  Product Version       : witherspoon-OP9-v2.2-3.2
>  Product Extra         : 	occ-58e422d
>  Product Extra         : 	skiboot-v6.2-190-gc470806a2e5e
>  Product Extra         : 	buildroot-2018.11.3-12-g222837a
>  Product Extra         : 	capp-ucode-p9-dd2-v4
>  Product Extra         : 	petitboot-v1.10.2
>  Product Extra         : 	sbe-1410677
>  Product Extra         : 	hostboot-binaries-hw021419a.930
>  Product Extra         : 	bmc-firmware-version-2.03
>  Product Extra         : 	hcode-hw031619a.940
>  Product Extra         : 	machine-xml-e3e9aef
>  Product Extra         : 	hostboot-3653c5d-p9f41e02
>  Product Extra         : 	linux-4.19.26-openpower1-p2974ab8
> 
> 
> Let me know if you need more, I don't think there's any SOS reports, no
> errors of any kind come out during startup.  This is literally just a race
> condition related to when kubernetes, docker, etc. create their cgroup slice
> vs when gpu memory's available.  The .slice files that are created are
> entirely valid and correct for the time they were created.
> 
> 
> We can recreate this pretty easily, and it's mostly unrelated to driver/fw
> levels.  It's specifically a POWER9 only problem.

What level of RHEL-alt 7.6 zStream..

Maybe not addressed yet, but it's say would be good to make sure existing issues not a factor..

We would really like to see the logs, dmesg for sure, but some stuff may be in the various logs captured.. 

there's also no description wrt to reproducer..

again, maybe for some it's obvious, but there's sooooo little here...
and maybe there are also other interactions and discussions internally, and someone in etc has a repro they are working???

Comment 16 Doug Lehr 2019-12-04 17:09:52 UTC
Frank...here's the writeup describing the problem.

Background
AC922 Hardware configuration

Before we jump into how cpusets affect running NVIDIA GPUs in a container, we need to understand what IBM and NVIDIA did with their joint POWER9/NVLINK2.0 venture. The POWER9 servers, specifically AC922s come with two physical POWER9 CPUs and up to four V100 NVIDIA GPUs. Section 2.1 of the IBM AC922 Redbook describes the hardware layout. In short, AC922s use NVLINK2.0 to connect the GPUs directly to the CPUs, instead of the traditional PCIe bus. This allows for faster bandwidth, lower latency, and the most important part of this whole discussion: coherent access to GPU memory.

It’s because of this coherency that we experience the uniqueness of this problem. To allow the GPU memory to be accessible by applications running on the CPU, the decision was made to online the GPUs as numa nodes.

A sample numactl –hardware command from an AC922 illustrates this setup:


numactl --hardware 

available: 6 nodes (0,8,252-255) 

node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5 2 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 

node 0 size: 257742 MB 

node 0 free: 48358 MB 

node 8 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 

node 8 size: 261735 MB 

node 8 free: 186807 MB 

node 252 cpus: 

node 252 size: 16128 MB 

node 252 free: 16115 MB 

node 253 cpus: 

node 253 size: 16128 MB 

node 253 free: 16117 MB 

node 254 cpus: 

node 254 size: 16128 MB 

node 254 free: 16117 MB 

node 255 cpus: 

node 255 size: 16128 MB 

node 255 free: 16117 MB 
 
Note: To avoid any potential issues with collisions on CPU nodes vs GPUs, the numbering for GPUs starts at 255 and goes backwards, while CPUs start at 0. On an AC922, we have two CPU sockets (0,8) with 80 threads each, and 256GB of memory each, and 4 GPUs(252-255), with 16GB of memory each. (GPU threads aren’t listed here.)

CPUSETS
Now that you understand the hardware makeup of an AC922, let’s dive into a little bit of background on cpusets(https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt). Cpusets are a mechanism that allows CPU and Memory nodes to be assigned for tasks, services, virtual machines, or containers. This allows the kernel to limit what resources can be seen. There are many aspects to cpusets and you can spend hours reading about all of them. In our case, we’re mostly interested in the cpuset.mems file under sysfs. Cpusets.mems lists what memory nodes are available at a given time. The default values are kept in /sys/fs/cgroup/cpuset/cpuset.mems with various subdirectories keeping their own copy of cpuset.mems.

The GPU nodes, however, don’t come on by default. The systemd service nvidia-persistenced will online the GPU memory and the cpusets will get updated.

For example:


nvidia-persistenced service up 

systemctl start nvidia-persistenced 

cat /sys/fs/cgroup/cpuset/cpuset.mems  

0,8,252-255 

 

nvidia-persistenced service down 

systemctl stop nvidia-persistenced 

cat /sys/fs/cgroup/cpuset/cpuset.mems  

0,8 

Slices
One final piece of background before we get to the crux of the issue is the concept of a slice unit. To the Linux kernel, “A slice unit is a concept for hierarchically managing resources of a group of processes.”

In this case, there are three “slices” that we need to be concerned about. With RHEL 7.6, using Redhat’s version of Docker, or Podman, the slice in question is the “system.slice” Normally located at /sys/fs/cgroup/cpuset/system.slice.

For Kubernetes or OpenShift, they use the “kubepods.slice” which is located at /sys/fs/cgroup/cpuset/kubepods.slice

Finally, later docker-ce versions appear to use the “docker” slice, which is at /sys/fs/cgroup/cpuset/docker. I’m not sure why they lost the “.slice” to the name, but that’s neither here nor there.

Within these slices, a subslice is created each time a container gets spun up, passing along the necessary cpuset information. Each slice and subslice contains various details, including the cpuset.mems file that contains our memory nodes.

So, what happened?
We talked about AC922 CPU memory being coherently attached to GPU memory. Well, GPU memory would need to stay online at all times. Normally when a device is no longer in use, the kernel will tear down the kernel module and devices in question. In order to keep the GPUs online, a systemd service was created; aptly named nvidia-persistenced. With this service, we can guarantee that the GPU memory will stay online regardless of the GPUs active use. The problem? This service comes up using systemd, same as Docker, and same as Kubernetes. Unless Docker or Kubernetes explicitly waits for the nvidia-persistenced service to startup and finish onlining GPU memory, which could take up to 5 minutes past startup, they will take what’s available in the master cpuset and use it as the base system configuration.

When a process grabs the cgroup too early, the cpuset.mems will reflect an incomplete list of memory resources. For example, “0,8,253-255″, which tells us there are two CPU nodes, and only three GPU nodes. If a system actually had just three GPU nodes then this is a valid description, but odds are that the system has four GPUs and the value should have been “0,8,252-255″ to signify all four GPUs are present.

Once a containerization platform has an incomplete list of the GPU memory nodes, the problem will get masked until CUDA tries to initialize memory against that node. Upon starting up a container, the NVIDIA driver and devices will be passed through, depending on what rules you have set up, regardless of what memory nodes are specified in the cgroup. This means that, although your cpuset.mems says you have 253-255 (nvidia0-nvidia2), and 252 (nvidia3) is missing, the NVIDIA container plugins or hooks can still pass nvidia3 into a container, because by the time the container was started, all four GPUs were online. We now have a case where we have GPU devices that don’t exist are being passed into the cgroup.

Why doesn’t this fail all the time?
Once a machine is in this incorrect state, GPU devices and drivers can be added to a container, and even driver-based commands such as nvidia-smi will provide the correct output. This is because none of those commands try to allocate memory on the GPU. I’m sure someone will speak up and tell me I’m wrong, and driver commands do in fact allocate “some” memory on a GPU and they’re probably right, but they’re not using the cgroup values to do so, and odds are, the request is being sent to the host and executed by the driver itself.

When code in a container tries to allocate CUDA memory against a device that doesn’t have a corresponding value in the cpuset.mems file, errors start to occur. Normally it’ll show up as cudaSuccess (3 vs. 0) initialization error, but other flavors can show up depending on how the memory is trying to be allocated. A lot of code, such as CUDA’s deviceQuery from its sample code package, will try to touch all devices available to it. When CUDA tries to allocate memory against it, things start to go wrong. Normally if you knew which device wasn’t in the cpuset.mems file, you could use options like setting CUDA_VISIBLE_DEVICES to cordon off the device, and the rest of the code should work. However, this isn’t a viable long-term solution as it effectively makes a GPU unusable in a containerized environment.

The Solution?
A bug has been created to track this problem: https://bugzilla.redhat.com/show_bug.cgi?id=1746415. While it’s being worked on, there are some workarounds, most of which involve correcting the problematic cpuset slices. I’ve written a script ( https://github.com/IBM/powerai/blob/master/support/cpuset_fix/cpuset_check.sh ) that will check the slices used by the common containerization platforms (Docker, Kubernetes, and OpenShift). If it detects a mismatch between the slice folders cpuset.mems and the master cpuset.mems, it will notify the user. If desired, the script will also correct the problem by removing the slice folders altogether. This needs to be done because the slice folders aren’t deleted when the respective services are shut down or restarted, so bouncing Kubernetes, for example, will keep the same kubepods.slice as before, and you’ll still have the problem.

If we remove the slice folders altogether prior to (re)starting the respective service, the service will regenerate the cgroup slice from the master version; allowing the correct values to be ingested and applied correctly.

I have dabbled a bit in trying to edit the cpuset.mems by hand for certain slice groups, and with the right permissions you should be able to do this. However, I don’t recommend it as you’ll end up with containers that may have differing copies of the cpuset.mems within a single orchestration, leading to some pretty unpredictable results. The best scenario I can think of at the moment is to bring down the service, run the script to remove the existing incorrect values, and let the service come up naturally.

One last caveat to mention: cgroups and cpusets all reside under sysfs. This means that they are regenerated after each reboot. So any time a system is restarted, there’s a risk that this issue could happen again. One workaround that has been explored includes delaying the startup of Docker and/or Kubernetes, OpenShift, etc. until the NVIDIA GPUS have time to come online. This may not be ideal, but is still a better alternative than having to shut down the service mid production to address this problem.

In summary, this is an issue that’s unique to a specific server, due to its ability to have coherently attached GPU memory. In creating this feature, an exposure in cgroup was discovered where node memory can be added after startup and not passed along to existing slices.

Comment 17 IBM Bug Proxy 2019-12-04 18:50:24 UTC
------- Comment From fnovak.com 2019-12-04 13:49 EDT-------
Doug, thanks for the detailed explanation.

Comment 18 IBM Bug Proxy 2019-12-04 22:10:20 UTC
------- Comment From fnovak.com 2019-12-04 17:01 EDT-------
Dennis Gilmore  (RH) had asked me to synch this BZ from RH side to to IBM...

I'm not sure whether this was simply for broader awareness, discussion, or whether RH is looking for some help here..

I gather RH has been working this for a while.. but on where it stands, etc..

Comment 19 Zvonko Kosic 2019-12-06 13:09:34 UTC
There was a bug for the kernel where "cpuset cgroup: when a CPU goes offline, it is removed from all cgroup's cpuset.cpus, but when it comes online, it is only restored to the root cpuset.cpus"

https://bugzilla.kernel.org/show_bug.cgi?id=42789

There is a patch for the CPU, but was this fix also applied to cpuset.mem? When memory gets online that all dependent cpuset.mem are populated?

Comment 20 Douglas Slavens 2020-10-08 21:17:13 UTC
Reassigning to Manoj and Archana. Discussed with Jeremy - Andy doesn't have the hardware to reproduce and Manoj is managing the nvidia gpu story for OCP 4.6 on power.

Comment 21 Douglas Slavens 2020-10-28 20:54:22 UTC
turning off the blocker flag since this is not a blocker for 3.11.

Comment 22 Dan Li 2020-12-02 15:22:49 UTC
Adding "UpcomingSprint" and de-escalating as the team does not believe the bug is "Urgent"

Comment 23 Dan Li 2020-12-02 15:22:57 UTC
Also de-escalating due to the bug lacking activities

Comment 24 Dan Li 2021-02-24 15:19:02 UTC
The bug triage team has reviewed this bug, since there is no activity in the past 2 and half months, we are closing this bug. Please re-open if this bug is still needed.

Comment 25 Manoj Kumar 2021-03-05 21:21:49 UTC
Reopening, as Zvonko is getting the RHEL team engaged to look at this bug.

Comment 27 Waiman Long 2021-03-09 18:29:51 UTC
(In reply to Zvonko Kosic from comment #19)
> There was a bug for the kernel where "cpuset cgroup: when a CPU goes
> offline, it is removed from all cgroup's cpuset.cpus, but when it comes
> online, it is only restored to the root cpuset.cpus"
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=42789
> 
> There is a patch for the CPU, but was this fix also applied to cpuset.mem?
> When memory gets online that all dependent cpuset.mem are populated?

Actually, this is an inherent limitation of cpuset v1 because there is only one cpu and memory mask per cpuset. There is no way to save a previous state and so a cpu offline/online or memory offline/online will change the cpu and memory mask. This can only be fully addressed by switching to cpuset v2 where there is a separate effective mask and designated mask which is invariant.

IOW, this problem cannot be fully fixed unless by switching to cgroup v2.

-Longman

Comment 28 Dan Li 2021-03-17 14:25:04 UTC
@manokuma do you think this bug will be resolved before the end of this sprint (before March 20th)? If not, can we set the "Reviewed-in-Sprint" flag to "+"

Comment 29 Dan Li 2021-03-18 18:01:19 UTC
Since it does not seem that this bug will be fixed before the end of this sprint per the latest email today, I'm setting the "Reviewed-in-Sprint" flag to "+"

Comment 30 Dan Li 2021-04-07 14:05:59 UTC
I'm setting the "Reviewed-in-Sprint" flag to "+" for this sprint

Comment 31 Dan Li 2021-04-28 14:06:45 UTC
I'm setting the "Reviewed-in-Sprint" flag to "+" for this sprint

Comment 32 Dan Li 2021-05-19 14:08:48 UTC
I'm setting the "Reviewed-in-Sprint" flag to "+" for this sprint

Comment 33 Dan Li 2021-06-09 15:03:36 UTC
Can we set the "Target Release" to "---" since the bug may not be fixed during 4.8?

Comment 34 Dan Li 2021-06-30 14:05:23 UTC
Setting the target release as "---" as 4.8 is GA'ing in 2 weeks. Also setting reviewed in sprint

Comment 35 Dan Li 2021-07-21 14:13:19 UTC
Rafael Sene will ping Manoj to see if there are any updates on this bug. In the meantime adding reviewed-in-sprint

Comment 37 Dan Li 2021-08-11 14:24:15 UTC
Adding "reviewed-in-sprint" flag as this bug is unlikely to be fixed during this sprint.

Comment 38 Manoj Kumar 2021-08-11 14:32:54 UTC
No significant update.  Will have to wait for OpenShift to move to cGroups v2 and for us to retest this.

Comment 39 Dan Li 2021-08-30 16:33:14 UTC
Hi Manoj, do you think this bug will be resolved before the end of the sprint? If not, I'd like to set the "Reviewed-in-Sprint" flag.

Comment 40 Manoj Kumar 2021-08-31 15:56:30 UTC
It has not been clear to me when the move to cGroups v2 occurs, whether it is in RHEL 8.5 or RHEL 8.6.   I am guessing not in this sprint.

Comment 41 Dan Li 2021-09-20 18:48:38 UTC
Hi Manoj, do you think this bug will continued to be open in the upcoming sprint? If so, I'd like to set the "Reviewed-in-Sprint" flag.

Comment 42 Dan Li 2021-09-22 14:05:21 UTC
Adding "reviewed-in-sprint"

Comment 44 Dan Li 2021-10-11 14:41:05 UTC
Hi Manoj, do you think this bug will continued to be open in the upcoming sprint? If so, I'd like to set the "Reviewed-in-Sprint" flag.

Comment 45 Dan Li 2021-10-13 14:09:18 UTC
Adding "reviewed-in-sprint"

Comment 46 Dan Li 2021-11-01 13:22:26 UTC
Hi Manoj, do you think this bug will continued to be open in the upcoming sprint? If so, I'd like to set the "Reviewed-in-Sprint" flag.

Comment 47 Dan Li 2021-11-05 13:35:17 UTC
Adding "reviewed-in-sprint". This bug is waiting for cgroup v2 in RHCOS, so no actions are taken currently.

Comment 48 Dan Li 2021-11-22 21:08:19 UTC
Setting "reviewed-in-sprint" flag as we are waiting at this point.

Comment 52 Dan Li 2022-01-04 17:39:13 UTC
Setting "reviewed-in-sprint" flag as we are waiting at this point.

Comment 54 Dan Li 2022-01-24 16:24:10 UTC
Setting "reviewed-in-sprint" flag as we are waiting at this point (after talking with the Power team)

Comment 55 Dan Li 2022-02-14 18:44:35 UTC
Setting "reviewed-in-sprint" flag as we are waiting at this point (after talking with the Power team)

Comment 56 Dan Li 2022-03-07 14:48:42 UTC
Setting "reviewed-in-sprint" flag as we are waiting at this point (after talking with the Power team). The change may happen during the 4.11 timeframe.

Comment 57 Dan Li 2022-04-18 13:56:21 UTC
Setting "reviewed-in-sprint" as per discussion during the Multi-Arch Devel meeting last week, that this bug is waiting on a later release of RHEL to be able to be verified.

Comment 58 Dan Li 2022-05-09 13:40:20 UTC
Hi Manoj, I think we mentioned that RHEL 8.6 may introduce support for cgroups v2 in crio. Do we think that this bug is ready for verification in the upcoming weeks?

Comment 59 Dan Li 2022-05-13 12:35:23 UTC
Adding reviewed-in-sprint as it is unlikely that verification is complete before the end of the current sprint.

Comment 60 Dan Li 2022-05-31 19:15:21 UTC
Though this bug might be verifiable with cGroups v2 being in 8.6, it is not the default.  So it will take some effort for someone to validate. I'm keeping this bug as reviewed-in-sprint+ as validation may flow into future sprints

Comment 63 Douglas Slavens 2022-09-12 22:14:55 UTC
This bz was not fixed in 4.11 and looks like it won't be fixed for 4.12 so closing.


Note You need to log in before you can comment on or make changes to this bug.