Bug 1843470
| Summary: | [CRI-O] CRI-O reports container as exited although the process is still running | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Roman Mohr <roman> | ||||
| Component: | Node | Assignee: | Peter Hunt <pehunt> | ||||
| Node sub component: | CRI-O | QA Contact: | MinLi <minmli> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | urgent | ||||||
| Priority: | unspecified | CC: | aos-bugs, cpaquin, danken, dvossel, dwalsh, fdeutsch, jokerman, lbednar, mpatel, nagrawal, ncredi, pehunt, pelauter, rgarcia, rmohr, tsweeney, vromanso, xtian, yadu, ycui | ||||
| Version: | 4.4 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.5.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1850168 (view as bug list) | Environment: | |||||
| Last Closed: | 2020-07-13 17:42:56 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1837182, 1850168 | ||||||
| Attachments: |
|
||||||
|
Description
Roman Mohr
2020-06-03 11:30:04 UTC
Note that sometimes we see also just that the processes exited with 255 but not processes are left over. I am not sure if they would have still be around if only one container out of the pod would have failed. Because once the whole pod is teared down, the namespaces get deleted and as such all processes killed. would you be able to give me access to a node on which this is happening? (In reply to Peter Hunt from comment #2) > would you be able to give me access to a node on which this is happening? Will provide them to you. sadly, the cluster on which this occurred was torn down, making debugging difficult. Since this is originally reported against 4.4, I would not consider this a 4.5 issue. I will not close it because it certainly seems like a bug, but won't be able to make progress unless I (or someone else) reproduce(s) The issue appears also no the new clusters. There is still a chance that this is just an issue introduced by CNV, but if cri-o has issues with small limits on longer running processes, this may deserve wider attention. Something like this may also slip through the periodic e2e tests, since they typically don't stay up for days. Proposing a blocker. ``` $ oc version Client Version: 4.5.0-0.nightly-2020-05-27-032756 Server Version: 4.5.0-0.nightly-2020-05-27-032756 Kubernetes Version: v1.18.2+9835af5 ``` ``` # crictl version Version: 0.1.0 RuntimeName: cri-o RuntimeVersion: 1.18.1-1.dev.rhaos4.5.git60ac541.el8 RuntimeApiVersion: v1alpha1 ``` we CNV would like to escalate it to Urgent severity to raise attention, also considering it is blocker. In CNV, the Bug 1837182 depends on this bug, VMI virt-launcher pod reaches Error state after running for 10-24 hours, it severely impacts the mission-critical of CNV customers, this may mean that VMIs are not functioning, right now no procedural workaround exists. In doing a bit of debugging, I have figured out this is because of a faulty `runc state`, where the call fails for a reason we do not currently know, and the state is incorrectly set in cri-o. I have a PR up that adds additional debugging in this scenerio to see if we should retry the `runc state` call in this instance Hi Peter I assume they were all running well after you replacing the crio binary, but now one pod is in Error Status now. $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES virt-launcher-vma-tnn2v 2/2 Running 0 16h 10.129.2.10 host-172-16-0-12 <none> <none> virt-launcher-vmb-f5kb5 1/2 Error 0 13h 10.129.2.13 host-172-16-0-12 <none> <none> And I also create a new vm(vmc) on the node host-172-16-0-12: $ oc get vmi vmc NAME AGE PHASE IP NODENAME vmc 3m31s Running 10.129.2.25/23 host-172-16-0-12 $ oc get pod virt-launcher-vmc-dv7j2 NAME READY STATUS RESTARTS AGE virt-launcher-vmc-dv7j2 2/2 Running 0 3m39s All pods are in Error status now, even the new created one. $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES virt-launcher-vma-tnn2v 1/2 Error 0 21h 10.129.2.10 host-172-16-0-12 <none> <none> virt-launcher-vmb-f5kb5 1/2 Error 0 18h 10.129.2.13 host-172-16-0-12 <none> <none> virt-launcher-vmc-dv7j2 1/2 Error 0 4h32m 10.129.2.25 host-172-16-0-12 <none> <none> (In reply to Yan Du from comment #13) > All pods are in Error status now, even the new created one. > > $ oc get pod -o wide > NAME READY STATUS RESTARTS AGE IP > NODE NOMINATED NODE READINESS GATES > virt-launcher-vma-tnn2v 1/2 Error 0 21h 10.129.2.10 > host-172-16-0-12 <none> <none> > virt-launcher-vmb-f5kb5 1/2 Error 0 18h 10.129.2.13 > host-172-16-0-12 <none> <none> > virt-launcher-vmc-dv7j2 1/2 Error 0 4h32m 10.129.2.25 > host-172-16-0-12 <none> <none> @Yan Du, thanks. That is what we were waiting for. Peter can hopefully now extract more useful information from the logs. Peter, I hope this helps you a little: ``` Jun 09 02:42:59 host-172-16-0-12 crio[3200244]: time="2020-06-09 02:42:59.518150789Z" level=error msg="Failed to update container state for 15e886c54b54bf3234a24684daf020cda21a082cfeeef137e6c5e8c535a9c686: exit status 1" Jun 09 04:13:30 host-172-16-0-12 crio[3200244]: time="2020-06-09 04:13:30.520530875Z" level=error msg="Failed to update container state for f6f9e8480bd75c12615071ad085365e31f58ba70a94025d71be7944214e595d6: exit status 1" Jun 09 05:42:18 host-172-16-0-12 crio[3200244]: time="2020-06-09 05:42:18.831563200Z" level=error msg="Failed to update container state for 33c27e0c0e7fa5b193a517caffad510c385c7143b872aca868b4ea8fe075c179: exit status 1" Jun 09 06:45:06 host-172-16-0-12 crio[3200244]: time="2020-06-09 06:45:06.459904099Z" level=error msg="Failed to update container state for b54fb87d96bc7abf5a116c724fca168a443b5c39bba75cd780c29495760943f4: exit status 1" ``` I was hoping for something more intuitive :) ah dang, I too was hoping for a more informative error. Let me put a new version with more informative output Since we did not make any progress with trying to get useful output, I am trying now to narrow down under which conditions this happens by starting pods with different specs: ``` all-like-virt-launcher 1/1 Running 0 3m46s all-like-virt-launcher-huge-cpu 1/1 Running 0 3m46s all-like-virt-launcher-no-devices 1/1 Running 0 3m46s all-like-virt-launcher-no-exec-readiness 1/1 Running 0 3m46s all-like-virt-launcher-no-privileges 1/1 Running 0 3m46s ``` To describe the differences: * all-like-virt-launcher looks exactly like our VMI pods, but without running a VMI * all-like-virt-launcher-huge-cpu requests a lot of cpu, so that we see if it is connected to https://bugzilla.redhat.com/show_bug.cgi?id=1819908 * all-like-virt-launcher-no-devices does not request devices like kvm, vhost-net and so forth * all-like-virt-launcher-no-exec-readiness does not use an exec readiness, in case that this has something to do with an exec probe * all-like-virt-launcher-no-privileges does not request privileges like NET_ADMIN. So the first container restarted:
```
$ oc get pods all-like-virt-launcher-no-devices -o yaml | grep terminated -A 3
terminated:
containerID: cri-o://5831bb1e269b6d92652e713d15edba3495953be23256c06dc4585da5608170d3
exitCode: 255
finishedAt: "2020-06-10T14:06:37Z"
```
I can see that pretty much exactly at this time this is in the logs:
```
Jun 10 14:06:38 host-172-16-0-30 systemd[1]: crio-d6030614e3464603e9511574447bdbaeb0c0fa8cceaa06cef592a5778d14ec51.scope: Failed to set cpuset.cpus: No space left on device
```
So they are connected, however,
```
$ oc get pods all-like-virt-launcher-no-devices -o yaml | grep containerStatuses -A 1
f:containerStatuses: {}
f:hostIP: {}
--
containerStatuses:
- containerID: cri-o://d6030614e3464603e9511574447bdbaeb0c0fa8cceaa06cef592a5778d14ec51
```
shows that this message seems to be related to the newly created container and not to the old one. This new container is still running.
Can we try removing cpu limits from pods and use requests instead? That way we can rule out CFS (which has known issues). (In reply to Mrunal Patel from comment #27) > Can we try removing cpu limits from pods and use requests instead? That way > we can rule out CFS (which has known issues). There are no cpu limits on the pods, just requests. As an update here: we have determined there are cases where runc does not write the state.json file atomically. This, paired with CRI-O calling `runc state` for each exec sync request (all probes) leads to spurious failures, causing the container to be labeled as Exited. Because the virt-launcher pods have a RestartPolicy: Never, it was more clear when this happened. The solution is three fold: update runc to write atomically update cri-o to repeat `runc state` calls on error (just in case a similar issue pops up later) update cri-o to not use `runc state` for exec sync calls (reducing the likelihood of these failures happening) linked is the PR for the second item. The first will likely make it in to 4.5, but we may want to wait on the third till 4.6 to give it more time to soak Good to see that there is an understanding of the (likely) root cause. However, you say there is a fix for (2). And (3) has to wait. But what about (1)? And in general: How much will the fixes that are planned for 4.5 improve the situation there? I am verifying the fix for 1 now, it seems likely that it is working. If so, it will make 4.5.0 as well. Test on latest OCP build, issue can not be reproduced. Client Version: 4.5.0-rc.2 Server Version: 4.5.0-rc.2 Kubernetes Version: v1.18.3+91d0edd Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |