Description of problem: The kubelet log is full of `container with ID starting with 1a5ff587bc9ec0987e3dcc471611aede0840d329725c6a6d32bbd1ddd1a22860 not found: ID does not exist` errors. This is a high churn cluster, not sure if relevant. Log lines total: $ cat /tmp/node.log|wc -l 2877604 Occurneces of the error: grep 'not found: ID does not exist' /tmp/node.log|wc -l 57034 Version-Release number of selected component (if applicable): ip-10-0-144-174.ec2.internal Ready worker 15d v1.16.2 <<redacted>> <none> Red Hat Enterprise Linux CoreOS 43.81.202003230848.0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.16.3-28.dev.rhaos4.3.git9aad8e4.el8 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Searching for trio related error revealed: Mar 27 00:17:45.936725 ip-10-0-131-192 crio[1338]: 2020-03-27T00:17:45Z [error] setPodNetworkAnnotation: status update failed for pod /: resource name may not be empty Mar 27 00:17:45.936725 ip-10-0-131-192 crio[1338]: 2020-03-27T00:17:45Z [error] SetNetworkStatus: failed to update the pod knative-serving-test-autoscale-build in out of cluster comm: setPodNetworkAnnotation: status update failed for pod /: resource name may not be empty
From the log there seemed to be some crio related errors: Mar 27 00:17:45.936817 ip-10-0-131-192 crio[1338]: 2020-03-27T00:17:45Z [error] Multus: error unsetting the networks status: SetNetworkStatus: failed to update the pod knative-serving-test-autoscale-build in out of cluster comm: setPodNetworkAnnotation: status update failed for pod /: resource name may not be empty Mar 27 10:51:17.800061 ip-10-0-131-192 crio[1338]: 2020-03-27T10:51:17Z [error] Multus: error unsetting the networks status: SetNetworkStatus: failed to query the pod rpm-repo-89d95fc74-9x7m2 in out of cluster comm: pods "rpm-repo-89d95fc74-9x7m2" not found Mar 27 11:25:42.681096 ip-10-0-131-192 crio[1338]: time="2020-03-27T11:25:42Z" level=error msg="container not running" Mar 27 11:25:42.785522 ip-10-0-131-192 crio[1338]: 2020-03-27T11:25:42Z [error] SetNetworkStatus: failed to query the pod rpm-repo-fbc4774d9-4vc6c in out of cluster comm: pods "rpm-repo-fbc4774d9-4vc6c" not found Mar 31 09:16:34.873955 ip-10-0-131-192 crio[1338]: time="2020-03-31 09:16:34.873911661Z" level=error msg="failed to create stopped file in container state. Restore may fail: open /var/lib/containers/storage/overlay-containers/cfb1383f1ed726080479240f0aa31f6bf98be539036aeee6712fa5eac7c921ba/userdata/stopped: no such file or directory" However, I searched cri-o codebase but didn't find setPodNetworkAnnotation or SetNetworkStatus. Peter, mind taking a look and see if you have some idea ? thanks
I noticed that for the container Id, 1a5ff, mentioned on Mar 27, appeared toward the end of the log : Mar 31 09:46:30.990427 ip-10-0-131-192 hyperkube[1618]: I0331 09:46:30.990384 1618 kubelet_pods.go:909] Pod "openshift-acme-exposer-build_ci-op-i7xcfyv9(e7195011-55aa-4cc2-90eb-4e57cf931b37)" is terminated, Error getting runtimeStatus from the podCache: rpc error: code = Unknown desc = container with ID starting with 1a5ff587bc9ec0987e3dcc471611aede0840d329725c6a6d32bbd1ddd1a22860 not found: ID does not exist Looking at the operator, openshift-acme-exposer-build_ci-op, this cluster seems to be involved in CI build. Haven't found clue for container Id, 1a5ff. I am going to look at other missing container Ids.
> I observed panic from TestUnconfiguredManifests. > Was the cluster used to run test suite ? The cluster is used to host the test pods created by all Openshift CI.
For container Id, 390f6f9d11a70928994798f2b9611e9860017a3516dab52548b936b7b5e73449, I saw in the log its container normally exited - yet, ContainerStatus() still complained that the container Id was missing. Looking at calls to ContainerStatus(), some are from runtime api query. Was there any failed test involving terminated container ? Thanks
I would be interested to know if similar errors are observed on OCP build which contains back port of the following: Merge pull request #88440 from smarterclayton/container_success_fix Ensure Kubelet always reports terminating pod container status
> Was there any failed test involving terminated container ? Sorry, I don't understand the question, can you elaborate? > I would be interested to know if similar errors are observed on OCP build which contains back port of the following: > > Merge pull request #88440 from smarterclayton/container_success_fix > > Ensure Kubelet always reports terminating pod container status Is there any way for us to get such a build?
I checked origin/release-4.4 and origin/release-4.5 branches which don't have the back port. I will let you know when the 1.18 rebase lands. > Was there any failed test involving terminated container ? I would expect the missing container to have some effect on test suite execution. Please look out and post here if you see test failure potentially related. I have checked some container Ids appearing in 'container with ID starting with' error message. From what I observed, most of the containers exited normally. I will keep searching for container Id which didn't go through normal termination steps.
With the following PR, we should have more clue about crio related issues: https://github.com/openshift/machine-config-operator/pull/1613
Can we have that setting without needing a new MCO version or is it hardcoded?
It is hard coded in crio.yaml for container runtime.
Looking at Kubernetes issues, if Pod is stuck in "Terminating" state, we may see such log. Please keep an eye on pod stuck during test runs. Thanks
Here were container Ids and their pod names: 1a5ff587bc9ec0987e3dcc471611aede0840d329725c6a6d32bbd1ddd1a22860 openshift-acme-exposer-build_ci-op-i7xcfyv9 390f6f9d11a70928994798f2b9611e9860017a3516dab52548b936b7b5e73449 rpm-repo-545bf4b666-n9ghf_ci-op-mx19lnb6 25425991358f401242168ecf117b9f63b8d85b22c5268462aa485e86e21641f6 a155836f-7030-11ea-9bb8-0a58ac106d04_ci We can see that various CI tests exhibited this behavior. They seem to be not inside origin repo.
sorry for the super late reply, going through my need infos very late... it seems SetNetworkStatus and setPodNetworkAnnotation come from Multus, the networking plugin, not CRI-O
This was where the log came from Kubelet#PodResourcesAreReclaimed: pkg/kubelet/kubelet_pods.go: klog.V(3).Infof("Pod %q is terminated, Error getting runtimeStatus from the podCache: %s", format.Pod(pod), err) Looking closer at the log attached to this BZ I also saw: Mar 27 13:42:01.063858 ip-10-0-131-192 hyperkube[1618]: I0327 13:42:00.992750 1618 status_manager.go:641] Pod status is inconsistent with cached status for pod "src-build_ci-op-fkystnk8(47434544-abbc-43a0-abb0-c2ff2e319a5f)", a reconciliation should be triggered: Mar 27 13:42:01.063858 ip-10-0-131-192 hyperkube[1618]: interface{}( The above has been fixed in 1.19 upstream. It would be helpful if we can collect log from more recent openshift build. Thanks
https://github.com/openshift/origin/pull/24841 back port of 88440 to 4.3 https://github.com/openshift/origin/pull/24649 back port of 88440 to 4.4 (merged)
Thanks! Do you know if its possible to figure out if a particular release contains a particular PR so we can check if we have it? Also I guess even after we have a release with the patch we will have to rotate the nodes, right?
``` we will have to rotate the nodes ``` Did you mean applying the release on respective nodes ? Yes. ``` release contains a particular PR ``` I still need to find whether the commit is contained in z (of 4.3.z) release. Looking at origin, I only see release branch such as origin/release-4.4
w.r.t. tracking specific commit, at the start of log we should see something similar to: Apr 09 06:44:09.993346 localhost kernel: Command line: BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-e39f93a64996750e818eafa63638b8b9ac8233fc24d3bc1cc1a552d3bd882036/vmlinuz-4.18.0-147.8.1.el8_1.x86_64 rhcos.root=crypt_rootfs console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.0/rhcos/e39f93a64996750e818eafa63638b8b9ac8233fc24d3bc1cc1a552d3bd882036/0 ignition.platform.id=gcp the hash from rhcos should be correlated with origin commit.
4.3 release contains the fix registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2020-04-13-101419
*** Bug 1822268 has been marked as a duplicate of this bug. ***
Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.
*** Bug 1826498 has been marked as a duplicate of this bug. ***
*** Bug 1822566 has been marked as a duplicate of this bug. ***
4.4 received many crio fixes related to `ID does not exist` and context deadline exceeded issues.