1837182 – VMI virt-launcher reaches Error state after running for 10-24 hours

Bug 1837182 - VMI virt-launcher reaches Error state after running for 10-24 hours

Summary: VMI virt-launcher reaches Error state after running for 10-24 hours

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	2.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	2.4.0
Assignee:	sgott
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:	1843470 1850168
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-19 02:49 UTC by Yan Du
Modified:	2020-07-28 19:10 UTC (History)
CC List:	9 users (show)
Fixed In Version:	v2.3.0-388
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-28 19:10:07 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
describe (10.16 KB, text/plain) 2020-05-19 02:50 UTC, Yan Du	no flags	Details
launcher pod log (234.75 KB, text/plain) 2020-05-19 02:51 UTC, Yan Du	no flags	Details
vm spec (1.06 KB, text/plain) 2020-05-19 02:52 UTC, Yan Du	no flags	Details
compute log (235.51 KB, text/plain) 2020-05-26 02:05 UTC, Yan Du	no flags	Details
Weird crash loops on a lot of pods (49.07 KB, text/plain) 2020-06-10 09:01 UTC, Roman Mohr	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1843470	0	unspecified	CLOSED	[CRI-O] CRI-O reports container as exited although the process is still running	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2020:3194	0	None	None	None	2020-07-28 19:10:18 UTC

Internal Links: 1843470

Description Yan Du 2020-05-19 02:49:33 UTC

Description of problem:
SRIOV VMI virt-launcher become Error after running for a long time


Version-Release number of selected component (if applicable):
Client Version: 4.4.3
Server Version: 4.4.0-rc.13
Kubernetes Version: v1.17.1
CNV2.3


How reproducible:
Always


Steps to Reproduce:
1. Create SriovNetworkNodePolicy and SriovNetwork
2. Create VMI with SR-IOV network
3. Keep the VMI running for a long time 


Actual results:
$ oc get pod$ oc get po
NAME                                READY   STATUS    RESTARTS   AGE
virt-launcher-vmi-centos-98qf2      1/2     Error     0          24h

$ oc get vmi
NAME            AGE   PHASE     IP            NODENAME
vmi-centos      24h   Running   10.0.2.2/24   cnv-qe-11.cnvqe.lab.eng.rdu2.redhat.com


Expected results:
Launcher pod should be running well all the time


Additional info:
1. VM spec has attached
2. Describe log and virt-launcher pod log has attached

Comment 1 Yan Du 2020-05-19 02:50:43 UTC

Created attachment 1689732 [details]
describe

Comment 2 Yan Du 2020-05-19 02:51:15 UTC

Created attachment 1689733 [details]
launcher pod log

Comment 3 Yan Du 2020-05-19 02:52:46 UTC

Created attachment 1689734 [details]
vm spec

Comment 4 Alona Kaplan 2020-05-24 19:12:06 UTC

(In reply to Yan Du from comment #2)
> Created attachment 1689733 [details]
> launcher pod log

The launcher has two containers - compute volumecontainerdisk. Can you please attach a separate log to each one of them?

Comment 5 Yan Du 2020-05-26 02:04:16 UTC

@Alona
I'm afraid the volumecontainerdisk log is empty, only compute container log exists. I can preserve the cluster for you if you need.

Comment 6 Yan Du 2020-05-26 02:05:20 UTC

Created attachment 1692079 [details]
compute log

Comment 7 Alona Kaplan 2020-05-26 07:12:34 UTC

Yan, can you please attach the journalctl -u kubelet output?

Comment 8 Yan Du 2020-05-26 07:40:15 UTC

(In reply to Alona Kaplan from comment #7)
> Yan, can you please attach the journalctl -u kubelet output?

The kubelet log is too big to attach in bugzilla, may I share the cluster to you?

Comment 9 Dan Kenigsberg 2020-05-26 13:44:41 UTC

Yan, in case you did not provide cluster credentials to Alona yet, please be aware that you can create a huge file in Google Drive and share the URL here.

Comment 10 Dan Kenigsberg 2020-05-26 14:45:53 UTC

Yan, how long is "long time"? Or Mergi reports an SR-IOV VM running well for over a week.

Comment 11 Yan Du 2020-05-27 02:15:24 UTC

(In reply to Dan Kenigsberg from comment #9)
> Yan, in case you did not provide cluster credentials to Alona yet, please be
> aware that you can create a huge file in Google Drive and share the URL here.

ok, I will do that. thanks

(In reply to Dan Kenigsberg from comment #10)
> Yan, how long is "long time"? Or Mergi reports an SR-IOV VM running well for
> over a week.

I can reproduce the issue after running the vmi for about 10-24 hours.

Comment 13 Dan Kenigsberg 2020-05-27 05:31:58 UTC

(In reply to Yan Du from comment #11)
> > Yan, how long is "long time"? Or Mergi reports an SR-IOV VM running well for
> > over a week.
> 
> I can reproduce the issue after running the vmi for about 10-24 hours.

Would you please try to reproduce it on cnv.engineering? Please start two VMs, one with an SR-IOV device and one without it?

Comment 14 Yan Du 2020-05-27 07:08:25 UTC

(In reply to Dan Kenigsberg from comment #13)
> (In reply to Yan Du from comment #11)
> > > Yan, how long is "long time"? Or Mergi reports an SR-IOV VM running well for
> > > over a week.
> > 
> > I can reproduce the issue after running the vmi for about 10-24 hours.
> 
> Would you please try to reproduce it on cnv.engineering? Please start two
> VMs, one with an SR-IOV device and one without it?

OK, I will prepare the vms

Comment 17 David Vossel 2020-06-02 19:59:36 UTC

(In reply to Yan Du from comment #11)
> (In reply to Dan Kenigsberg from comment #9)
> > Yan, in case you did not provide cluster credentials to Alona yet, please be
> > aware that you can create a huge file in Google Drive and share the URL here.
> 
> ok, I will do that. thanks
> 
> (In reply to Dan Kenigsberg from comment #10)
> > Yan, how long is "long time"? Or Mergi reports an SR-IOV VM running well for
> > over a week.
> 
> I can reproduce the issue after running the vmi for about 10-24 hours.

fyi, we can see exactly when the containerdisk started and when it stopped looking at the pod's status. from the logs, we have the following. 

      Started:      Mon, 18 May 2020 02:03:47 +0000
      Finished:     Mon, 18 May 2020 06:44:57 +0000

Comment 20 Roman Mohr 2020-06-03 07:30:15 UTC

Looking at

```
    State:          Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Mon, 18 May 2020 02:03:47 +0000
      Finished:     Mon, 18 May 2020 06:44:57 +0000
    Ready:          False
```

it looks like this is not OOM related, since that would be explicitly reflected in the status. Further it is interesting that it is exiting with `255`. We have in the code only `exit(0)` and `exit(1)` but no `exit(255)` or `exit(-1). This indicates that the cluster terminated the container for some reason. Let's check the kubelet logs.

@Yan Du, maybe I missed it, but the kubelet logs seem to start on May 25th and don't include May 18th.

Comment 21 Yan Du 2020-06-03 08:14:39 UTC

Hi, Roman

Yes, it seems that the old kubelet log will be cleared after a certain period of time. I checked the log today and it starts at 2020-06-01
-- Logs begin at Mon 2020-06-01 23:02:33 UTC, end at Wed 2020-06-03 08:05:49 UTC. --

I will try to reproduce the issue and get the log again.

Comment 23 Roman Mohr 2020-06-03 10:20:41 UTC

So here is the crazy thing. I am on the cluster. We are talking about pod `virt-launcher-centos3-glljf` for the VMI `centos3` with UID `11717a0e-d695-4fa0-a15a-4003b64dae22`. 



When looking at ps aux, I see that the container-disk binary is still running:

```
root     2918543  0.0  0.0   1112     4 ?        Ss   06:00   0:01 /usr/bin/container-disk --copy-path /var/run/kubevirt-ephemeral-disks/container-disk-data/11717a0e-d695-4fa0-a15a-4003b64dae22/disk_0
```

It matches the UID of the VMI and it was startet at 06:00 like the VMI.

Further, the pod status says that the cri-o container ID is ` containerID: cri-o://796e0736e1c1e02f7fde386a3977c2caafacb4dde1eaba64ddd4e3e5d2ccf0f3`.

When doing

```
# crictl ps -a | grep 796
796e0736e1c1e       quay.io/yadu/kubevirt-centos7@sha256:549ef79f0bb73f6385855db2384b8a21873b2f9a243031f4b40a70bbbee50ab7                                                                                 4 hours ago         Exited              volumerootdisk                    0                   a2a770cf8dfa6
```

we see that it claims that the container exited.

Further we can ask cri-o for the PID of the spawned process:

```
# crictl inspect 796e0736e1c1e
{
[...]
  "pid": 2918543,
  "sandboxId": "a2a770cf8dfa65e3c91f6718e08a2c6434a5068aeabfea35bb34fe700e608538"
}

```

There we see that the process with pid `2918543` we see healthy and alive above with `ps` is indeed the one inside the "dead" container.

This may be a cri-o bug.

Comment 24 Alona Kaplan 2020-06-03 10:37:33 UTC

BTW, I tried to debug Yan's cluster about a week ago. It had 3 error vms. All of them had failing containers (with 255 error code) in the virt-launcher pod.
However. one of them failed on container-disk container, other one on compute and the third on both.

Yan, do you see this behaviour only on vms with sriov interface?
Also, please let us know if the bug is reproducible on on cnv.engineering?

Comment 25 Roman Mohr 2020-06-03 11:34:10 UTC

Created https://bugzilla.redhat.com/show_bug.cgi?id=1843470 to get more input from cri-o.

Comment 27 Alona Kaplan 2020-06-03 22:07:59 UTC

@yan looking at cnv.engineering seems vmi-fedora failed migration and has a different error than described in the bug.
Couldn't find centos3.

@roman I rechecked Yan's cluster (the orignal one, not cnv.engineering). 'centos3' vmi fails on the volumecontainerdisk container and indeed it's proccess is still running as you mentioned.
But vmi-perf fails on both compute and volumecontainerdisk and both of the processes do not exist.

Since vmi-perf has the same 255 error described in the bug, but the process of the containters is not running, I don't think the kubelet bug is the root cause of the problem.
However I still don't understand what causes the containers to crash/stop responding and why kubelet marks running (unhealthy) containers as terminated.



The logs (reversed) -
'centos3' volumecontainerdisk error -
Jun 03 08:02:07 cnv-qe-12.cnvqe.lab.eng.rdu2.redhat.com hyperkube[2855]: I0603 08:02:07.993577    2855 policy_static.go:249] [cpumanager] static policy: RemoveContainer (container id: 796e0736e1c1e02f7fde386a3977c2caafacb4dde1eaba64ddd4e3e5d2ccf0f3)
Jun 03 08:02:07 cnv-qe-12.cnvqe.lab.eng.rdu2.redhat.com hyperkube[2855]: I0603 08:02:07.992142    2855 kubelet.go:1958] SyncLoop (PLEG): "virt-launcher-centos3-glljf_default(c4266cfb-8e30-4968-987d-a890998fc91f)", event: &pleg.PodLifecycleEvent{ID:"c4266cfb-8e30-4968-987d-a890998fc91f", Type:"ContainerDied", Data:"796e0736e1c1e02f7fde386a3977c2caafacb4dde1eaba64ddd4e3e5d2ccf0f3"}
Jun 03 08:02:07 cnv-qe-12.cnvqe.lab.eng.rdu2.redhat.com hyperkube[2855]: E0603 08:02:07.665991    2855 remote_runtime.go:351] ExecSync 796e0736e1c1e02f7fde386a3977c2caafacb4dde1eaba64ddd4e3e5d2ccf0f3 '/usr/bin/container-disk --health-check' from runtime service failed: rpc error: code = Unknown desc = container is not created or running
Jun 03 08:02:07 cnv-qe-12.cnvqe.lab.eng.rdu2.redhat.com hyperkube[2855]: E0603 08:02:07.665532    2855 remote_runtime.go:351] ExecSync 796e0736e1c1e02f7fde386a3977c2caafacb4dde1eaba64ddd4e3e5d2ccf0f3 '/usr/bin/container-disk --health-check' from runtime service failed: rpc error: code = Unknown desc = container is not created or running
Jun 03 08:02:07 cnv-qe-12.cnvqe.lab.eng.rdu2.redhat.com hyperkube[2855]: E0603 08:02:07.645136    2855 remote_runtime.go:351] ExecSync 796e0736e1c1e02f7fde386a3977c2caafacb4dde1eaba64ddd4e3e5d2ccf0f3 '/usr/bin/container-disk --health-check' from runtime service failed: rpc error: code = Unknown desc = command error: time="2020-06-03T08:02:07Z" level=error msg="exec failed: EOF"


vmi-perf compute error -
Jun 03 14:42:08 cnv-qe-12.cnvqe.lab.eng.rdu2.redhat.com hyperkube[2855]: I0603 14:42:08.596611    2855 manager.go:1007] Destroyed container: "/kubepods.slice/kubepods-pod4fd150b9_72b8_4510_a729_63d400600f35.slice/crio-4eaf4a5823e2c06a76b11d6cc391efc2233fd15c415822e57c6b70d043d1fbaa.scope" (aliases: [k8s_compute_virt-launcher-vmi-perf-ln9cb_default_4fd150b9-72b8-4510-a729-63d400600f35_0 4eaf4a5823e2c06a76b11d6cc391efc2233fd15c415822e57c6b70d043d1fbaa], namespace: "crio")
Jun 03 14:42:08 cnv-qe-12.cnvqe.lab.eng.rdu2.redhat.com hyperkube[2855]: I0603 14:42:08.299309    2855 manager.go:1007] Destroyed container: "/kubepods.slice/kubepods-pod4fd150b9_72b8_4510_a729_63d400600f35.slice/crio-conmon-4eaf4a5823e2c06a76b11d6cc391efc2233fd15c415822e57c6b70d043d1fbaa.scope" (aliases: [], namespace: "")
Jun 03 14:42:06 cnv-qe-12.cnvqe.lab.eng.rdu2.redhat.com hyperkube[2855]: I0603 14:42:06.559448    2855 status_manager.go:577] Status for pod "virt-launcher-vmi-perf-ln9cb_default(4fd150b9-72b8-4510-a729-63d400600f35)" updated successfully: (6, {Phase:Failed Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-06-03 07:55:54 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-06-03 11:52:30 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [compute volumecontainerdisk]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-06-03 11:52:30 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [compute volumecontainerdisk]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-06-03 07:55:54 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.9.96.51 PodIP:10.129.3.22 PodIPs:[{IP:10.129.3.22}] StartTime:2020-06-03 07:55:54 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:compute State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Error,Message:,StartedAt:2020-06-03 07:55:58 +0000 UTC,FinishedAt:2020-06-03 14:42:05 +0000 UTC,ContainerID:cri-o://4eaf4a5823e2c06a76b11d6cc391efc2233fd15c415822e57c6b70d043d1fbaa,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:registry.stage.redhat.io/container-native-virtualization/virt-launcher@sha256:1f7247af45ace7207601666be3ee7949a84d8bec3067aac2567f1c4d76a5f7fd ImageID:registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-virt-launcher@sha256:1f7247af45ace7207601666be3ee7949a84d8bec3067aac2567f1c4d76a5f7fd ContainerID:cri-o://4eaf4a5823e2c06a76b11d6cc391efc2233fd15c415822e57c6b70d043d1fbaa Started:0xc005b6801a} {Name:volumecontainerdisk State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Error,Message:,StartedAt:2020-06-03 07:56:10 +0000 UTC,FinishedAt:2020-06-03 11:52:29 +0000 UTC,ContainerID:cri-o://a61264c111d6408530a842fdb047e9994fe4b180e18c8a8c179acef68b620202,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/zshi/kubevirt-centos7:latest ImageID:quay.io/yadu/kubevirt-centos7@sha256:549ef79f0bb73f6385855db2384b8a21873b2f9a243031f4b40a70bbbee50ab7 ContainerID:cri-o://a61264c111d6408530a842fdb047e9994fe4b180e18c8a8c179acef68b620202 Started:0xc005b6801b}] QOSClass:Guaranteed EphemeralContainerStatuses:[]})
Jun 03 14:42:06 cnv-qe-12.cnvqe.lab.eng.rdu2.redhat.com hyperkube[2855]: I0603 14:42:06.559312    2855 status_manager.go:569] Patch status for pod "virt-launcher-vmi-perf-ln9cb_default(4fd150b9-72b8-4510-a729-63d400600f35)" with "{\"metadata\":{\"uid\":\"4fd150b9-72b8-4510-a729-63d400600f35\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"message\":\"containers with unready status: [compute volumecontainerdisk]\",\"type\":\"Ready\"},{\"message\":\"containers with unready status: [compute volumecontainerdisk]\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"containerID\":\"cri-o://4eaf4a5823e2c06a76b11d6cc391efc2233fd15c415822e57c6b70d043d1fbaa\",\"image\":\"registry.stage.redhat.io/container-native-virtualization/virt-launcher@sha256:1f7247af45ace7207601666be3ee7949a84d8bec3067aac2567f1c4d76a5f7fd\",\"imageID\":\"registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-virt-launcher@sha256:1f7247af45ace7207601666be3ee7949a84d8bec3067aac2567f1c4d76a5f7fd\",\"lastState\":{},\"name\":\"compute\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"terminated\":{\"containerID\":\"cri-o://4eaf4a5823e2c06a76b11d6cc391efc2233fd15c415822e57c6b70d043d1fbaa\",\"exitCode\":255,\"finishedAt\":\"2020-06-03T14:42:05Z\",\"reason\":\"Error\",\"startedAt\":\"2020-06-03T07:55:58Z\"}}},{\"containerID\":\"cri-o://a61264c111d6408530a842fdb047e9994fe4b180e18c8a8c179acef68b620202\",\"image\":\"quay.io/zshi/kubevirt-centos7:latest\",\"imageID\":\"quay.io/yadu/kubevirt-centos7@sha256:549ef79f0bb73f6385855db2384b8a21873b2f9a243031f4b40a70bbbee50ab7\",\"lastState\":{},\"name\":\"volumecontainerdisk\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"terminated\":{\"containerID\":\"cri-o://a61264c111d6408530a842fdb047e9994fe4b180e18c8a8c179acef68b620202\",\"exitCode\":255,\"finishedAt\":\"2020-06-03T11:52:29Z\",\"reason\":\"Error\",\"startedAt\":\"2020-06-03T07:56:10Z\"}}}],\"phase\":\"Failed\"}}"
Jun 03 14:42:06 cnv-qe-12.cnvqe.lab.eng.rdu2.redhat.com hyperkube[2855]: I0603 14:42:06.543761    2855 kubelet.go:1958] SyncLoop (PLEG): "virt-launcher-vmi-perf-ln9cb_default(4fd150b9-72b8-4510-a729-63d400600f35)", event: &pleg.PodLifecycleEvent{ID:"4fd150b9-72b8-4510-a729-63d400600f35", Type:"ContainerDied", Data:"4eaf4a5823e2c06a76b11d6cc391efc2233fd15c415822e57c6b70d043d1fbaa"}
Jun 03 14:42:06 cnv-qe-12.cnvqe.lab.eng.rdu2.redhat.com hyperkube[2855]: E0603 14:42:06.519691    2855 remote_runtime.go:351] ExecSync 4eaf4a5823e2c06a76b11d6cc391efc2233fd15c415822e57c6b70d043d1fbaa 'cat /var/run/kubevirt-infra/healthy' from runtime service failed: rpc error: code = Unknown desc = container is not created or running
Jun 03 14:42:06 cnv-qe-12.cnvqe.lab.eng.rdu2.redhat.com hyperkube[2855]: E0603 14:42:06.519024    2855 remote_runtime.go:351] ExecSync 4eaf4a5823e2c06a76b11d6cc391efc2233fd15c415822e57c6b70d043d1fbaa 'cat /var/run/kubevirt-infra/healthy' from runtime service failed: rpc error: code = Unknown desc = container is not created or running
Jun 03 14:42:06 cnv-qe-12.cnvqe.lab.eng.rdu2.redhat.com hyperkube[2855]: E0603 14:42:06.518562    2855 remote_runtime.go:351] ExecSync 4eaf4a5823e2c06a76b11d6cc391efc2233fd15c415822e57c6b70d043d1fbaa 'cat /var/run/kubevirt-infra/healthy' from runtime service failed: rpc error: code = Unknown desc = container is not created or running
Jun 03 14:42:05 cnv-qe-12.cnvqe.lab.eng.rdu2.redhat.com hyperkube[2855]: E0603 14:42:05.537268    2855 remote_runtime.go:351] ExecSync 4eaf4a5823e2c06a76b11d6cc391efc2233fd15c415822e57c6b70d043d1fbaa 'cat /var/run/kubevirt-infra/healthy' from runtime service failed: rpc error: code = Unknown desc = container is not created or running
Jun 03 14:42:05 cnv-qe-12.cnvqe.lab.eng.rdu2.redhat.com hyperkube[2855]: E0603 14:42:05.536863    2855 remote_runtime.go:351] ExecSync 4eaf4a5823e2c06a76b11d6cc391efc2233fd15c415822e57c6b70d043d1fbaa 'cat /var/run/kubevirt-infra/healthy' from runtime service failed: rpc error: code = Unknown desc = container is not created or running
Jun 03 14:42:05 cnv-qe-12.cnvqe.lab.eng.rdu2.redhat.com hyperkube[2855]: E0603 14:42:05.536539    2855 remote_runtime.go:351] ExecSync 4eaf4a5823e2c06a76b11d6cc391efc2233fd15c415822e57c6b70d043d1fbaa 'cat /var/run/kubevirt-infra/healthy' from runtime service failed: rpc error: code = Unknown desc = container is not created or running

Comment 28 Roman Mohr 2020-06-04 08:41:42 UTC

> @roman I rechecked Yan's cluster (the orignal one, not cnv.engineering). 'centos3' vmi fails on the volumecontainerdisk container and indeed it's proccess is still running as you mentioned.
> But vmi-perf fails on both compute and volumecontainerdisk and both of the processes do not exist.

> Since vmi-perf has the same 255 error described in the bug, but the process of the containters is not running, I don't think the kubelet bug is the root cause of the problem.
> However I still don't understand what causes the containers to crash/stop responding and why kubelet marks running (unhealthy) containers as terminated.

@Alone, as far as I could see, cri-o at some point just reports the container as down. At that point it will not allow the readiness probe to execute inside  the container anymore. I don't think that the failing readiness probe in the kubelet logs are related to the event where it marks the container as stopped. I think the kubelet simply followed cri-o here, which told it that it is stopped.

We can/should continue the investigation, but it can very well be that the issue in cri-o manifest itself different. Sometimes the processes go down this strange container stop, sometimes they stay. We should still treat https://bugzilla.redhat.com/show_bug.cgi?id=1843470 as very important from my perspective.

What do you think?

Comment 29 Alona Kaplan 2020-06-04 10:35:14 UTC

(In reply to Roman Mohr from comment #28)
> > @roman I rechecked Yan's cluster (the orignal one, not cnv.engineering). 'centos3' vmi fails on the volumecontainerdisk container and indeed it's proccess is still running as you mentioned.
> > But vmi-perf fails on both compute and volumecontainerdisk and both of the processes do not exist.
> 
> > Since vmi-perf has the same 255 error described in the bug, but the process of the containters is not running, I don't think the kubelet bug is the root cause of the problem.
> > However I still don't understand what causes the containers to crash/stop responding and why kubelet marks running (unhealthy) containers as terminated.
> 
> @Alone, as far as I could see, cri-o at some point just reports the
> container as down. At that point it will not allow the readiness probe to
> execute inside  the container anymore. I don't think that the failing
> readiness probe in the kubelet logs are related to the event where it marks
> the container as stopped. I think the kubelet simply followed cri-o here,
> which told it that it is stopped.
> 
> We can/should continue the investigation, but it can very well be that the
> issue in cri-o manifest itself different. Sometimes the processes go down
> this strange container stop, sometimes they stay. We should still treat
> https://bugzilla.redhat.com/show_bug.cgi?id=1843470 as very important from
> my perspective.
> 
> What do you think?

@Roman I think https://bugzilla.redhat.com/1843470 is indeed very important. But it is about "CRI-O reports container as exited although the process is still running".
Since just in 1 out of the 4 vms with the 255 error it is the case (the process is still running). Even if bug 1843470 will be solved, it will probably won't solve our problem.

Do you think crio causes the container processes to go down?

@Nelly, since I could see the issue only on one cluster (for sriov and non-sriov vms). Do you know if it was reproduced on other clusters as well?

Comment 30 Roman Mohr 2020-06-04 10:41:29 UTC

(In reply to Alona Kaplan from comment #29)
> (In reply to Roman Mohr from comment #28)
> > > @roman I rechecked Yan's cluster (the orignal one, not cnv.engineering). 'centos3' vmi fails on the volumecontainerdisk container and indeed it's proccess is still running as you mentioned.
> > > But vmi-perf fails on both compute and volumecontainerdisk and both of the processes do not exist.
> > 
> > > Since vmi-perf has the same 255 error described in the bug, but the process of the containters is not running, I don't think the kubelet bug is the root cause of the problem.
> > > However I still don't understand what causes the containers to crash/stop responding and why kubelet marks running (unhealthy) containers as terminated.
> > 
> > @Alone, as far as I could see, cri-o at some point just reports the
> > container as down. At that point it will not allow the readiness probe to
> > execute inside  the container anymore. I don't think that the failing
> > readiness probe in the kubelet logs are related to the event where it marks
> > the container as stopped. I think the kubelet simply followed cri-o here,
> > which told it that it is stopped.
> > 
> > We can/should continue the investigation, but it can very well be that the
> > issue in cri-o manifest itself different. Sometimes the processes go down
> > this strange container stop, sometimes they stay. We should still treat
> > https://bugzilla.redhat.com/show_bug.cgi?id=1843470 as very important from
> > my perspective.
> > 
> > What do you think?
> 
> @Roman I think https://bugzilla.redhat.com/1843470 is indeed very important.
> But it is about "CRI-O reports container as exited although the process is
> still running".
> Since just in 1 out of the 4 vms with the 255 error it is the case (the
> process is still running). Even if bug 1843470 will be solved, it will
> probably won't solve our problem.
> 
> Do you think crio causes the container processes to go down?
> 

Yes, I think that something is wrong in cri-o. Maybe a shim crashes, or something like that and sometimees it can properly clean up and sometimes it can't.
Do you think we should open a second issue just for the 255 case? I did not see that as disconnected.

> @Nelly, since I could see the issue only on one cluster (for sriov and
> non-sriov vms). Do you know if it was reproduced on other clusters as well?

Comment 31 Nelly Credi 2020-06-04 12:06:00 UTC

(In reply to Roman Mohr from comment #30)
> (In reply to Alona Kaplan from comment #29)
> > (In reply to Roman Mohr from comment #28)
> > > > @roman I rechecked Yan's cluster (the orignal one, not cnv.engineering). 'centos3' vmi fails on the volumecontainerdisk container and indeed it's proccess is still running as you mentioned.
> > > > But vmi-perf fails on both compute and volumecontainerdisk and both of the processes do not exist.
> > > 
> > > > Since vmi-perf has the same 255 error described in the bug, but the process of the containters is not running, I don't think the kubelet bug is the root cause of the problem.
> > > > However I still don't understand what causes the containers to crash/stop responding and why kubelet marks running (unhealthy) containers as terminated.
> > > 
> > > @Alone, as far as I could see, cri-o at some point just reports the
> > > container as down. At that point it will not allow the readiness probe to
> > > execute inside  the container anymore. I don't think that the failing
> > > readiness probe in the kubelet logs are related to the event where it marks
> > > the container as stopped. I think the kubelet simply followed cri-o here,
> > > which told it that it is stopped.
> > > 
> > > We can/should continue the investigation, but it can very well be that the
> > > issue in cri-o manifest itself different. Sometimes the processes go down
> > > this strange container stop, sometimes they stay. We should still treat
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1843470 as very important from
> > > my perspective.
> > > 
> > > What do you think?
> > 
> > @Roman I think https://bugzilla.redhat.com/1843470 is indeed very important.
> > But it is about "CRI-O reports container as exited although the process is
> > still running".
> > Since just in 1 out of the 4 vms with the 255 error it is the case (the
> > process is still running). Even if bug 1843470 will be solved, it will
> > probably won't solve our problem.
> > 
> > Do you think crio causes the container processes to go down?
> > 
> 
> Yes, I think that something is wrong in cri-o. Maybe a shim crashes, or
> something like that and sometimees it can properly clean up and sometimes it
> can't.
> Do you think we should open a second issue just for the 255 case? I did not
> see that as disconnected.
> 
> > @Nelly, since I could see the issue only on one cluster (for sriov and
> > non-sriov vms). Do you know if it was reproduced on other clusters as well?

no afaik

Comment 32 Alona Kaplan 2020-06-08 05:54:04 UTC

(In reply to Roman Mohr from comment #30)
> (In reply to Alona Kaplan from comment #29)
> > (In reply to Roman Mohr from comment #28)
> > > > @roman I rechecked Yan's cluster (the orignal one, not cnv.engineering). 'centos3' vmi fails on the volumecontainerdisk container and indeed it's proccess is still running as you mentioned.
> > > > But vmi-perf fails on both compute and volumecontainerdisk and both of the processes do not exist.
> > > 
> > > > Since vmi-perf has the same 255 error described in the bug, but the process of the containters is not running, I don't think the kubelet bug is the root cause of the problem.
> > > > However I still don't understand what causes the containers to crash/stop responding and why kubelet marks running (unhealthy) containers as terminated.
> > > 
> > > @Alone, as far as I could see, cri-o at some point just reports the
> > > container as down. At that point it will not allow the readiness probe to
> > > execute inside  the container anymore. I don't think that the failing
> > > readiness probe in the kubelet logs are related to the event where it marks
> > > the container as stopped. I think the kubelet simply followed cri-o here,
> > > which told it that it is stopped.
> > > 
> > > We can/should continue the investigation, but it can very well be that the
> > > issue in cri-o manifest itself different. Sometimes the processes go down
> > > this strange container stop, sometimes they stay. We should still treat
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1843470 as very important from
> > > my perspective.
> > > 
> > > What do you think?
> > 
> > @Roman I think https://bugzilla.redhat.com/1843470 is indeed very important.
> > But it is about "CRI-O reports container as exited although the process is
> > still running".
> > Since just in 1 out of the 4 vms with the 255 error it is the case (the
> > process is still running). Even if bug 1843470 will be solved, it will
> > probably won't solve our problem.
> > 
> > Do you think crio causes the container processes to go down?
> > 
> 
> Yes, I think that something is wrong in cri-o. Maybe a shim crashes, or
> something like that and sometimees it can properly clean up and sometimes it
> can't.
> Do you think we should open a second issue just for the 255 case? I did not
> see that as disconnected.

Currently I don't have an idea what caused the 255 error and whether it is crio or not.
But if it is crio, yes, I think it deserves a second issue since #BZ 1843470 is only about the cleanup, not the root cause of the crash.

Let's wait for the new clusters results, maybe it was a specific cluster only issue.

> 
> > @Nelly, since I could see the issue only on one cluster (for sriov and
> > non-sriov vms). Do you know if it was reproduced on other clusters as well?

Comment 34 Roman Mohr 2020-06-10 09:01:09 UTC

Created attachment 1696467 [details]
Weird crash loops on a lot of pods

@Yan Du the cluster looks in general pretty interesting. Do we track all the crashlooping pods in this system?

Comment 35 Yan Du 2020-06-10 10:01:02 UTC

Hi, Roman

Checking the cluster, I think all the pods which in crashed or keeping ContainerCreating are all located on the node host-172-16-0-12, as I know, Peter used a new crio binary on this host.

[cnv-qe-jenkins@cnv-executor-yadu24 ~]$ oc get pod -A -o wide| grep CrashLoopBackOff
openshift-cnv                                      ovs-cni-amd64-6hmmv                                               1/2     CrashLoopBackOff    79         7d5h    172.16.0.12    host-172-16-0-12   <none>           <none>
openshift-dns                                      dns-default-ww6sd                                                 2/3     CrashLoopBackOff    19         7d6h    10.129.2.49    host-172-16-0-12   <none>           <none>
openshift-sdn                                      ovs-5mksv                                                         0/1     CrashLoopBackOff    76         7d6h    172.16.0.12    host-172-16-0-12   <none>           <none>
openshift-sdn                                      sdn-ztd7q                                                         0/1     CrashLoopBackOff    56         7d6h    172.16.0.12    host-172-16-0-12   <none>           <none>
[cnv-qe-jenkins@cnv-executor-yadu24 ~]$ oc get pod -A -o wide| grep ContainerCreating
local-storage                                      local-block-local-diskmaker-x2crv                                 0/1     ContainerCreating   0          15m     <none>         host-172-16-0-12   <none>           <none>
local-storage                                      local-block-local-provisioner-94zvn                               0/1     ContainerCreating   0          15m     <none>         host-172-16-0-12   <none>           <none>
openshift-cnv                                      kubemacpool-mac-controller-manager-d46b4bf84-547rs                0/1     ContainerCreating   0          15m     <none>         host-172-16-0-12   <none>           <none>
openshift-cnv                                      virt-handler-vxszx                                                0/1     ContainerCreating   0          10m     <none>         host-172-16-0-12   <none>           <none>
openshift-monitoring                               prometheus-adapter-785d7fcd57-6vnzb                               0/1     ContainerCreating   0          46m     <none>         host-172-16-0-12   <none>           <none>

Comment 36 sgott 2020-06-23 21:17:22 UTC

Moving this to ON_QE to reflect the state of https://bugzilla.redhat.com/show_bug.cgi?id=1843470

Comment 37 Yan Du 2020-06-24 05:00:17 UTC

Test on latest build, issue can not be reproduced.

Client Version: 4.5.0-rc.2
Server Version: 4.5.0-rc.2
Kubernetes Version: v1.18.3+91d0edd
container-native-virtualization-hco-bundle-registry:v2.3.0-388

$ oc get pod
NAME                          READY   STATUS    RESTARTS   AGE
virt-launcher-centos1-lb5hq   2/2     Running   0          26h
virt-launcher-centos2-nxmnd   2/2     Running   0          26h
virt-launcher-vma-ztghp       2/2     Running   0          26h
virt-launcher-vmb-dpqql       2/2     Running   0          26h
$ oc get vmi
NAME      AGE   PHASE     IP            NODENAME
centos1   26h   Running   10.131.0.45   verify-24-26lzt-worker-gbt5p
centos2   26h   Running   10.129.2.52   verify-24-26lzt-worker-nt9gs
vma       26h   Running   10.128.2.78   verify-24-26lzt-worker-7cdqc
vmb       26h   Running   10.128.2.79   verify-24-26lzt-worker-7cdqc

Comment 40 errata-xmlrpc 2020-07-28 19:10:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3194

Note You need to log in before you can comment on or make changes to this bug.