Created attachment 1694806 [details] Pod spec Description of problem: Version-Release number of selected component (if applicable): $ oc version Client Version: 4.4.3 Server Version: 4.4.0-rc.13 Kubernetes Version: v1.17.1 How reproducible: We create a pod with a container which includes a small C binary (see attachment). The pod is guaranteed and has the following limits: limits: cpu: 100m memory: 40M Steps to Reproduce: 1. 2. 3. Actual results: After a few hours the container is reported by the kubelet as exited: state: terminated: containerID: cri-o://796e0736e1c1e02f7fde386a3977c2caafacb4dde1eaba64ddd4e3e5d2ccf0f3 exitCode: 255 finishedAt: "2020-06-03T08:02:07Z" reason: Error startedAt: "2020-06-03T06:00:52Z" When looking at ps aux on the node, I see that the binary is still running: ``` root 2918543 0.0 0.0 1112 4 ? Ss 06:00 0:01 /usr/bin/container-disk --copy-path /var/run/kubevirt-ephemeral-disks/container-disk-data/11717a0e-d695-4fa0-a15a-4003b64dae22/disk_0 ``` It matches the UID of the VMI and it was startet at 06:00 like the pod. Further, the pod status says that the cri-o container ID is ` containerID: cri-o://796e0736e1c1e02f7fde386a3977c2caafacb4dde1eaba64ddd4e3e5d2ccf0f3`. When doing ``` # crictl ps -a | grep 796 796e0736e1c1e quay.io/yadu/kubevirt-centos7@sha256:549ef79f0bb73f6385855db2384b8a21873b2f9a243031f4b40a70bbbee50ab7 4 hours ago Exited volumerootdisk 0 a2a770cf8dfa6 ``` we see that it claims that the container exited. Further we can ask cri-o for the PID of the spawned process: ``` # crictl inspect 796e0736e1c1e { [...] "pid": 2918543, "sandboxId": "a2a770cf8dfa65e3c91f6718e08a2c6434a5068aeabfea35bb34fe700e608538" } ``` There we see that the process with pid `2918543` we see healthy and alive above with `ps` is indeed the one inside the "dead" container. I could not find anything in the logs, neider for cri-o nor the kubelet which would inidicate why this is happening. Expected results: cri-o should report the container as still alive. Additional info: Logs and additional output are available in https://bugzilla.redhat.com/show_bug.cgi?id=1837182. Yan Du can further give access to the cluster if this issue is unknown and further investigation is needed.
Note that sometimes we see also just that the processes exited with 255 but not processes are left over. I am not sure if they would have still be around if only one container out of the pod would have failed. Because once the whole pod is teared down, the namespaces get deleted and as such all processes killed.
would you be able to give me access to a node on which this is happening?
(In reply to Peter Hunt from comment #2) > would you be able to give me access to a node on which this is happening? Will provide them to you.
sadly, the cluster on which this occurred was torn down, making debugging difficult. Since this is originally reported against 4.4, I would not consider this a 4.5 issue. I will not close it because it certainly seems like a bug, but won't be able to make progress unless I (or someone else) reproduce(s)
The issue appears also no the new clusters. There is still a chance that this is just an issue introduced by CNV, but if cri-o has issues with small limits on longer running processes, this may deserve wider attention. Something like this may also slip through the periodic e2e tests, since they typically don't stay up for days. Proposing a blocker. ``` $ oc version Client Version: 4.5.0-0.nightly-2020-05-27-032756 Server Version: 4.5.0-0.nightly-2020-05-27-032756 Kubernetes Version: v1.18.2+9835af5 ``` ``` # crictl version Version: 0.1.0 RuntimeName: cri-o RuntimeVersion: 1.18.1-1.dev.rhaos4.5.git60ac541.el8 RuntimeApiVersion: v1alpha1 ```
we CNV would like to escalate it to Urgent severity to raise attention, also considering it is blocker. In CNV, the Bug 1837182 depends on this bug, VMI virt-launcher pod reaches Error state after running for 10-24 hours, it severely impacts the mission-critical of CNV customers, this may mean that VMIs are not functioning, right now no procedural workaround exists.
In doing a bit of debugging, I have figured out this is because of a faulty `runc state`, where the call fails for a reason we do not currently know, and the state is incorrectly set in cri-o. I have a PR up that adds additional debugging in this scenerio to see if we should retry the `runc state` call in this instance
Hi Peter I assume they were all running well after you replacing the crio binary, but now one pod is in Error Status now. $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES virt-launcher-vma-tnn2v 2/2 Running 0 16h 10.129.2.10 host-172-16-0-12 <none> <none> virt-launcher-vmb-f5kb5 1/2 Error 0 13h 10.129.2.13 host-172-16-0-12 <none> <none> And I also create a new vm(vmc) on the node host-172-16-0-12: $ oc get vmi vmc NAME AGE PHASE IP NODENAME vmc 3m31s Running 10.129.2.25/23 host-172-16-0-12 $ oc get pod virt-launcher-vmc-dv7j2 NAME READY STATUS RESTARTS AGE virt-launcher-vmc-dv7j2 2/2 Running 0 3m39s
All pods are in Error status now, even the new created one. $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES virt-launcher-vma-tnn2v 1/2 Error 0 21h 10.129.2.10 host-172-16-0-12 <none> <none> virt-launcher-vmb-f5kb5 1/2 Error 0 18h 10.129.2.13 host-172-16-0-12 <none> <none> virt-launcher-vmc-dv7j2 1/2 Error 0 4h32m 10.129.2.25 host-172-16-0-12 <none> <none>
(In reply to Yan Du from comment #13) > All pods are in Error status now, even the new created one. > > $ oc get pod -o wide > NAME READY STATUS RESTARTS AGE IP > NODE NOMINATED NODE READINESS GATES > virt-launcher-vma-tnn2v 1/2 Error 0 21h 10.129.2.10 > host-172-16-0-12 <none> <none> > virt-launcher-vmb-f5kb5 1/2 Error 0 18h 10.129.2.13 > host-172-16-0-12 <none> <none> > virt-launcher-vmc-dv7j2 1/2 Error 0 4h32m 10.129.2.25 > host-172-16-0-12 <none> <none> @Yan Du, thanks. That is what we were waiting for. Peter can hopefully now extract more useful information from the logs.
Peter, I hope this helps you a little: ``` Jun 09 02:42:59 host-172-16-0-12 crio[3200244]: time="2020-06-09 02:42:59.518150789Z" level=error msg="Failed to update container state for 15e886c54b54bf3234a24684daf020cda21a082cfeeef137e6c5e8c535a9c686: exit status 1" Jun 09 04:13:30 host-172-16-0-12 crio[3200244]: time="2020-06-09 04:13:30.520530875Z" level=error msg="Failed to update container state for f6f9e8480bd75c12615071ad085365e31f58ba70a94025d71be7944214e595d6: exit status 1" Jun 09 05:42:18 host-172-16-0-12 crio[3200244]: time="2020-06-09 05:42:18.831563200Z" level=error msg="Failed to update container state for 33c27e0c0e7fa5b193a517caffad510c385c7143b872aca868b4ea8fe075c179: exit status 1" Jun 09 06:45:06 host-172-16-0-12 crio[3200244]: time="2020-06-09 06:45:06.459904099Z" level=error msg="Failed to update container state for b54fb87d96bc7abf5a116c724fca168a443b5c39bba75cd780c29495760943f4: exit status 1" ``` I was hoping for something more intuitive :)
ah dang, I too was hoping for a more informative error. Let me put a new version with more informative output
Since we did not make any progress with trying to get useful output, I am trying now to narrow down under which conditions this happens by starting pods with different specs: ``` all-like-virt-launcher 1/1 Running 0 3m46s all-like-virt-launcher-huge-cpu 1/1 Running 0 3m46s all-like-virt-launcher-no-devices 1/1 Running 0 3m46s all-like-virt-launcher-no-exec-readiness 1/1 Running 0 3m46s all-like-virt-launcher-no-privileges 1/1 Running 0 3m46s ``` To describe the differences: * all-like-virt-launcher looks exactly like our VMI pods, but without running a VMI * all-like-virt-launcher-huge-cpu requests a lot of cpu, so that we see if it is connected to https://bugzilla.redhat.com/show_bug.cgi?id=1819908 * all-like-virt-launcher-no-devices does not request devices like kvm, vhost-net and so forth * all-like-virt-launcher-no-exec-readiness does not use an exec readiness, in case that this has something to do with an exec probe * all-like-virt-launcher-no-privileges does not request privileges like NET_ADMIN.
So the first container restarted: ``` $ oc get pods all-like-virt-launcher-no-devices -o yaml | grep terminated -A 3 terminated: containerID: cri-o://5831bb1e269b6d92652e713d15edba3495953be23256c06dc4585da5608170d3 exitCode: 255 finishedAt: "2020-06-10T14:06:37Z" ``` I can see that pretty much exactly at this time this is in the logs: ``` Jun 10 14:06:38 host-172-16-0-30 systemd[1]: crio-d6030614e3464603e9511574447bdbaeb0c0fa8cceaa06cef592a5778d14ec51.scope: Failed to set cpuset.cpus: No space left on device ``` So they are connected, however, ``` $ oc get pods all-like-virt-launcher-no-devices -o yaml | grep containerStatuses -A 1 f:containerStatuses: {} f:hostIP: {} -- containerStatuses: - containerID: cri-o://d6030614e3464603e9511574447bdbaeb0c0fa8cceaa06cef592a5778d14ec51 ``` shows that this message seems to be related to the newly created container and not to the old one. This new container is still running.
Can we try removing cpu limits from pods and use requests instead? That way we can rule out CFS (which has known issues).
(In reply to Mrunal Patel from comment #27) > Can we try removing cpu limits from pods and use requests instead? That way > we can rule out CFS (which has known issues). There are no cpu limits on the pods, just requests.
As an update here: we have determined there are cases where runc does not write the state.json file atomically. This, paired with CRI-O calling `runc state` for each exec sync request (all probes) leads to spurious failures, causing the container to be labeled as Exited. Because the virt-launcher pods have a RestartPolicy: Never, it was more clear when this happened. The solution is three fold: update runc to write atomically update cri-o to repeat `runc state` calls on error (just in case a similar issue pops up later) update cri-o to not use `runc state` for exec sync calls (reducing the likelihood of these failures happening) linked is the PR for the second item. The first will likely make it in to 4.5, but we may want to wait on the third till 4.6 to give it more time to soak
Good to see that there is an understanding of the (likely) root cause. However, you say there is a fix for (2). And (3) has to wait. But what about (1)? And in general: How much will the fixes that are planned for 4.5 improve the situation there?
I am verifying the fix for 1 now, it seems likely that it is working. If so, it will make 4.5.0 as well.
Test on latest OCP build, issue can not be reproduced. Client Version: 4.5.0-rc.2 Server Version: 4.5.0-rc.2 Kubernetes Version: v1.18.3+91d0edd
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409