Bug 1843470

Summary:

[CRI-O] CRI-O reports container as exited although the process is still running

Product:

OpenShift Container Platform

Reporter:

Roman Mohr <roman>

Component:

Node

Assignee:

Peter Hunt <pehunt>

Node sub component:

CRI-O

QA Contact:

MinLi <minmli>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

urgent

Priority:

unspecified

CC:

aos-bugs, cpaquin, danken, dvossel, dwalsh, fdeutsch, jokerman, lbednar, mpatel, nagrawal, ncredi, pehunt, pelauter, rgarcia, rmohr, tsweeney, vromanso, xtian, yadu, ycui

Version:

4.4

Target Milestone:

---

Target Release:

4.5.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1850168 (view as bug list)

Environment:

Last Closed:

2020-07-13 17:42:56 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1837182, 1850168

Attachments:

Description	Flags
Pod spec	none

Description Roman Mohr 2020-06-03 11:30:04 UTC

Created attachment 1694806 [details]
Pod spec

Description of problem:


Version-Release number of selected component (if applicable):

$ oc version
Client Version: 4.4.3
Server Version: 4.4.0-rc.13
Kubernetes Version: v1.17.1

How reproducible:

We create a pod with a container which includes a small C binary (see attachment). The pod is guaranteed and has the following limits:

       limits:
         cpu: 100m
         memory: 40M


Steps to Reproduce:
1.
2.
3.

Actual results:

After a few hours the container is reported by the kubelet as exited:

    state:
      terminated:
        containerID: cri-o://796e0736e1c1e02f7fde386a3977c2caafacb4dde1eaba64ddd4e3e5d2ccf0f3
        exitCode: 255
        finishedAt: "2020-06-03T08:02:07Z"
        reason: Error
        startedAt: "2020-06-03T06:00:52Z"


When looking at ps aux on the node, I see that the binary is still running:

```
root     2918543  0.0  0.0   1112     4 ?        Ss   06:00   0:01 /usr/bin/container-disk --copy-path /var/run/kubevirt-ephemeral-disks/container-disk-data/11717a0e-d695-4fa0-a15a-4003b64dae22/disk_0
```

It matches the UID of the VMI and it was startet at 06:00 like the pod.

Further, the pod status says that the cri-o container ID is ` containerID: cri-o://796e0736e1c1e02f7fde386a3977c2caafacb4dde1eaba64ddd4e3e5d2ccf0f3`.

When doing

```
# crictl ps -a | grep 796
796e0736e1c1e       quay.io/yadu/kubevirt-centos7@sha256:549ef79f0bb73f6385855db2384b8a21873b2f9a243031f4b40a70bbbee50ab7                                                                                 4 hours ago         Exited              volumerootdisk                    0                   a2a770cf8dfa6
```

we see that it claims that the container exited.

Further we can ask cri-o for the PID of the spawned process:

```
# crictl inspect 796e0736e1c1e
{
[...]
  "pid": 2918543,
  "sandboxId": "a2a770cf8dfa65e3c91f6718e08a2c6434a5068aeabfea35bb34fe700e608538"
}

```

There we see that the process with pid `2918543` we see healthy and alive above with `ps` is indeed the one inside the "dead" container.

I could not find anything in the logs, neider for cri-o nor the kubelet which would inidicate why this is happening.

Expected results:

cri-o should report the container as still alive. 


Additional info:


Logs and additional output are available in https://bugzilla.redhat.com/show_bug.cgi?id=1837182.
Yan Du can further give access to the cluster if this issue is unknown and further investigation is needed.

Comment 1 Roman Mohr 2020-06-04 11:59:55 UTC

Note that sometimes we see also just that the processes exited with 255 but not processes are left over. I am not sure if they would have still be around if only one container out of the pod would have failed. Because once the whole pod is teared down, the namespaces get deleted and as such all processes killed.

Comment 2 Peter Hunt 2020-06-04 15:19:13 UTC

would you be able to give me access to a node on which this is happening?

Comment 3 Roman Mohr 2020-06-04 15:23:05 UTC

(In reply to Peter Hunt from comment #2)
> would you be able to give me access to a node on which this is happening?

Will provide them to you.

Comment 4 Peter Hunt 2020-06-04 19:49:58 UTC

sadly, the cluster on which this occurred was torn down, making debugging difficult. Since this is originally reported against 4.4, I would not consider this a 4.5 issue. I will not close it because it certainly seems like a bug, but won't be able to make progress unless I (or someone else) reproduce(s)

Comment 6 Roman Mohr 2020-06-08 08:07:48 UTC

The issue appears also no the new clusters. There is still a chance that this is just an issue introduced by CNV, but if cri-o has issues with small limits on longer running processes, this may deserve wider attention. Something like this may also slip through the periodic e2e tests, since they typically don't stay up for days. Proposing a blocker.


```
$ oc version
Client Version: 4.5.0-0.nightly-2020-05-27-032756
Server Version: 4.5.0-0.nightly-2020-05-27-032756
Kubernetes Version: v1.18.2+9835af5
```

```
# crictl version
Version:  0.1.0
RuntimeName:  cri-o
RuntimeVersion:  1.18.1-1.dev.rhaos4.5.git60ac541.el8
RuntimeApiVersion:  v1alpha1

```

Comment 7 Ying Cui 2020-06-08 09:49:39 UTC

we CNV would like to escalate it to Urgent severity to raise attention, also considering it is blocker.  

In CNV, the Bug 1837182 depends on this bug, VMI virt-launcher pod reaches Error state after running for 10-24 hours, it severely impacts the mission-critical of CNV customers, this may mean that VMIs are not functioning, right now no procedural workaround exists.

Comment 10 Peter Hunt 2020-06-08 19:27:42 UTC

In doing a bit of debugging, I have figured out this is because of a faulty `runc state`, where the call fails for a reason we do not currently know, and the state is incorrectly set in cri-o.

I have a PR up that adds additional debugging in this scenerio to see if we should retry the `runc state` call in this instance

Comment 12 Yan Du 2020-06-09 02:15:04 UTC

Hi Peter

I assume they were all running well after you replacing the crio binary, but now one pod is in Error Status now.

$ oc get pod -o wide
NAME                      READY   STATUS    RESTARTS   AGE   IP            NODE               NOMINATED NODE   READINESS GATES
virt-launcher-vma-tnn2v   2/2     Running   0          16h   10.129.2.10   host-172-16-0-12   <none>           <none>
virt-launcher-vmb-f5kb5   1/2     Error     0          13h   10.129.2.13   host-172-16-0-12   <none>           <none>

And I also create a new vm(vmc) on the node host-172-16-0-12:

$ oc get vmi vmc 
NAME   AGE     PHASE     IP               NODENAME
vmc    3m31s   Running   10.129.2.25/23   host-172-16-0-12

$ oc get pod virt-launcher-vmc-dv7j2 
NAME                      READY   STATUS    RESTARTS   AGE
virt-launcher-vmc-dv7j2   2/2     Running   0          3m39s

Comment 13 Yan Du 2020-06-09 06:44:21 UTC

All pods are in Error status now, even the new created one.

$ oc get pod -o wide
NAME                      READY   STATUS   RESTARTS   AGE     IP            NODE               NOMINATED NODE   READINESS GATES
virt-launcher-vma-tnn2v   1/2     Error    0          21h     10.129.2.10   host-172-16-0-12   <none>           <none>
virt-launcher-vmb-f5kb5   1/2     Error    0          18h     10.129.2.13   host-172-16-0-12   <none>           <none>
virt-launcher-vmc-dv7j2   1/2     Error    0          4h32m   10.129.2.25   host-172-16-0-12   <none>           <none>

Comment 14 Roman Mohr 2020-06-09 06:48:31 UTC

(In reply to Yan Du from comment #13)
> All pods are in Error status now, even the new created one.
> 
> $ oc get pod -o wide
> NAME                      READY   STATUS   RESTARTS   AGE     IP           
> NODE               NOMINATED NODE   READINESS GATES
> virt-launcher-vma-tnn2v   1/2     Error    0          21h     10.129.2.10  
> host-172-16-0-12   <none>           <none>
> virt-launcher-vmb-f5kb5   1/2     Error    0          18h     10.129.2.13  
> host-172-16-0-12   <none>           <none>
> virt-launcher-vmc-dv7j2   1/2     Error    0          4h32m   10.129.2.25  
> host-172-16-0-12   <none>           <none>

@Yan Du, thanks. That is what we were waiting for. Peter can hopefully now extract more useful information from the logs.

Comment 15 Roman Mohr 2020-06-09 06:56:09 UTC

Peter, I hope this helps you a little:


```
Jun 09 02:42:59 host-172-16-0-12 crio[3200244]: time="2020-06-09 02:42:59.518150789Z" level=error msg="Failed to update container state for 15e886c54b54bf3234a24684daf020cda21a082cfeeef137e6c5e8c535a9c686: exit status 1"
Jun 09 04:13:30 host-172-16-0-12 crio[3200244]: time="2020-06-09 04:13:30.520530875Z" level=error msg="Failed to update container state for f6f9e8480bd75c12615071ad085365e31f58ba70a94025d71be7944214e595d6: exit status 1"
Jun 09 05:42:18 host-172-16-0-12 crio[3200244]: time="2020-06-09 05:42:18.831563200Z" level=error msg="Failed to update container state for 33c27e0c0e7fa5b193a517caffad510c385c7143b872aca868b4ea8fe075c179: exit status 1"
Jun 09 06:45:06 host-172-16-0-12 crio[3200244]: time="2020-06-09 06:45:06.459904099Z" level=error msg="Failed to update container state for b54fb87d96bc7abf5a116c724fca168a443b5c39bba75cd780c29495760943f4: exit status 1"
```

I was hoping for something more intuitive :)

Comment 17 Peter Hunt 2020-06-09 13:55:04 UTC

ah dang, I too was hoping for a more informative error. Let me put a new version with more informative output

Comment 20 Roman Mohr 2020-06-10 13:01:33 UTC

Since we did not make any progress with trying to get useful output, I am trying now to narrow down under which conditions this happens by starting pods with different specs:

```
all-like-virt-launcher                     1/1     Running                    0          3m46s
all-like-virt-launcher-huge-cpu            1/1     Running                    0          3m46s
all-like-virt-launcher-no-devices          1/1     Running                    0          3m46s
all-like-virt-launcher-no-exec-readiness   1/1     Running                    0          3m46s
all-like-virt-launcher-no-privileges       1/1     Running                    0          3m46s
```

To describe the differences:

 * all-like-virt-launcher looks exactly like our VMI pods, but without running a VMI
 * all-like-virt-launcher-huge-cpu requests a lot of cpu, so that we see if it is connected to https://bugzilla.redhat.com/show_bug.cgi?id=1819908
 * all-like-virt-launcher-no-devices does not request devices like kvm, vhost-net and so forth
 * all-like-virt-launcher-no-exec-readiness does not use an exec readiness, in case that this has something to do with an exec probe
 * all-like-virt-launcher-no-privileges does not request privileges like NET_ADMIN.

Comment 25 Roman Mohr 2020-06-10 15:38:50 UTC

So the first container restarted:

```
$ oc get pods all-like-virt-launcher-no-devices -o yaml | grep terminated -A 3
      terminated:
        containerID: cri-o://5831bb1e269b6d92652e713d15edba3495953be23256c06dc4585da5608170d3
        exitCode: 255
        finishedAt: "2020-06-10T14:06:37Z"
```

I can see that pretty much exactly at this time this is in the logs:


```
Jun 10 14:06:38 host-172-16-0-30 systemd[1]: crio-d6030614e3464603e9511574447bdbaeb0c0fa8cceaa06cef592a5778d14ec51.scope: Failed to set cpuset.cpus: No space left on device
```

So they are connected, however,


```
$ oc get pods all-like-virt-launcher-no-devices -o yaml | grep containerStatuses -A 1
        f:containerStatuses: {}
        f:hostIP: {}
--
  containerStatuses:
  - containerID: cri-o://d6030614e3464603e9511574447bdbaeb0c0fa8cceaa06cef592a5778d14ec51

```

shows that this message seems to be related to the newly created container and not to the old  one. This new container is still running.

Comment 27 Mrunal Patel 2020-06-10 16:38:27 UTC

Can we try removing cpu limits from pods and use requests instead? That way we can rule out CFS (which has known issues).

Comment 28 Vladik Romanovsky 2020-06-10 17:01:06 UTC

(In reply to Mrunal Patel from comment #27)
> Can we try removing cpu limits from pods and use requests instead? That way
> we can rule out CFS (which has known issues).

There are no cpu limits on the pods, just requests.

Comment 29 Peter Hunt 2020-06-10 20:39:28 UTC

As an update here:

we have determined there are cases where runc does not write the state.json file atomically. This, paired with CRI-O calling `runc state` for each exec sync request (all probes) leads to spurious failures, causing the container to be labeled as Exited. Because the virt-launcher pods have a RestartPolicy: Never, it was more clear when this happened.

The solution is three fold:
update runc to write atomically
update cri-o to repeat `runc state` calls on error (just in case a similar issue pops up later)
update cri-o to not use `runc state` for exec sync calls (reducing the likelihood of these failures happening)

linked is the PR for the second item. The first will likely make it in to 4.5, but we may want to wait on the third till 4.6 to give it more time to soak

Comment 31 Fabian Deutsch 2020-06-11 09:21:46 UTC

Good to see that there is an understanding of the (likely) root cause.

However, you say there is a fix for (2).
And (3) has to wait.
But what about (1)?

And in general: How much will the fixes that are planned for 4.5 improve the situation there?

Comment 32 Peter Hunt 2020-06-11 17:58:07 UTC

I am verifying the fix for 1 now, it seems likely that it is working. If so, it will make 4.5.0 as well.

Comment 38 Yan Du 2020-06-24 04:58:45 UTC

Test on latest OCP build, issue can not be reproduced.

Client Version: 4.5.0-rc.2
Server Version: 4.5.0-rc.2
Kubernetes Version: v1.18.3+91d0edd

Comment 39 errata-xmlrpc 2020-07-13 17:42:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409