1819906 – Kubelet log full of container with ID starting with id <<some-id>> not found: ID does not exist errors

Bug 1819906 - Kubelet log full of container with ID starting with id <<some-id>> not found: ID does not exist errors

Summary: Kubelet log full of container with ID starting with id <<some-id>> not found:...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Ted Yu
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1822566 1826498 (view as bug list)
Depends On:
Blocks:	1822773 1827325
TreeView+	depends on / blocked

Reported:	2020-04-01 20:27 UTC by aaleman
Modified:	2020-05-11 22:36 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-11 22:36:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description aaleman 2020-04-01 20:27:57 UTC

Description of problem:


The kubelet log is full of `container with ID starting with 1a5ff587bc9ec0987e3dcc471611aede0840d329725c6a6d32bbd1ddd1a22860 not found: ID does not exist` errors. This is a high churn cluster, not sure if relevant.

Log lines total: $ cat /tmp/node.log|wc -l
2877604

Occurneces of the error: grep 'not found: ID does not exist' /tmp/node.log|wc -l
57034


Version-Release number of selected component (if applicable):

ip-10-0-144-174.ec2.internal   Ready    worker   15d   v1.16.2   <<redacted>>   <none>        Red Hat Enterprise Linux CoreOS 43.81.202003230848.0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.16.3-28.dev.rhaos4.3.git9aad8e4.el8


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ted Yu 2020-04-01 20:51:31 UTC

Searching for trio related error revealed:

Mar 27 00:17:45.936725 ip-10-0-131-192 crio[1338]: 2020-03-27T00:17:45Z [error] setPodNetworkAnnotation: status update failed for pod /: resource name may not be empty
Mar 27 00:17:45.936725 ip-10-0-131-192 crio[1338]: 2020-03-27T00:17:45Z [error] SetNetworkStatus: failed to update the pod knative-serving-test-autoscale-build in out of cluster comm: setPodNetworkAnnotation: status update failed for pod /: resource name may not be empty

Comment 3 Ted Yu 2020-04-02 17:31:13 UTC

From the log there seemed to be some crio related errors:

Mar 27 00:17:45.936817 ip-10-0-131-192 crio[1338]: 2020-03-27T00:17:45Z [error] Multus: error unsetting the networks status: SetNetworkStatus: failed to update the pod knative-serving-test-autoscale-build in out of cluster comm: setPodNetworkAnnotation: status update failed for pod /: resource name may not be empty

Mar 27 10:51:17.800061 ip-10-0-131-192 crio[1338]: 2020-03-27T10:51:17Z [error] Multus: error unsetting the networks status: SetNetworkStatus: failed to query the pod rpm-repo-89d95fc74-9x7m2 in out of cluster comm: pods "rpm-repo-89d95fc74-9x7m2" not found

Mar 27 11:25:42.681096 ip-10-0-131-192 crio[1338]: time="2020-03-27T11:25:42Z" level=error msg="container not running"
Mar 27 11:25:42.785522 ip-10-0-131-192 crio[1338]: 2020-03-27T11:25:42Z [error] SetNetworkStatus: failed to query the pod rpm-repo-fbc4774d9-4vc6c in out of cluster comm: pods "rpm-repo-fbc4774d9-4vc6c" not found

Mar 31 09:16:34.873955 ip-10-0-131-192 crio[1338]: time="2020-03-31 09:16:34.873911661Z" level=error msg="failed to create stopped file in container state. Restore may fail: open /var/lib/containers/storage/overlay-containers/cfb1383f1ed726080479240f0aa31f6bf98be539036aeee6712fa5eac7c921ba/userdata/stopped: no such file or directory"

However, I searched cri-o codebase but didn't find setPodNetworkAnnotation or SetNetworkStatus.

Peter, mind taking a look and see if you have some idea ?

thanks

Comment 4 Ted Yu 2020-04-02 17:47:44 UTC

I noticed that for the container Id, 1a5ff, mentioned on Mar 27, appeared toward the end of the log :

Mar 31 09:46:30.990427 ip-10-0-131-192 hyperkube[1618]: I0331 09:46:30.990384    1618 kubelet_pods.go:909] Pod "openshift-acme-exposer-build_ci-op-i7xcfyv9(e7195011-55aa-4cc2-90eb-4e57cf931b37)" is terminated, Error getting runtimeStatus from the podCache: rpc error: code = Unknown desc = container with ID starting with 1a5ff587bc9ec0987e3dcc471611aede0840d329725c6a6d32bbd1ddd1a22860 not found: ID does not exist

Looking at the operator, openshift-acme-exposer-build_ci-op, this cluster seems to be involved in CI build.

Haven't found clue for container Id, 1a5ff.
I am going to look at other missing container Ids.

Comment 6 aaleman 2020-04-02 18:16:11 UTC

> I observed panic from TestUnconfiguredManifests.
> Was the cluster used to run test suite ?

The cluster is used to host the test pods created by all Openshift CI.

Comment 7 Ted Yu 2020-04-02 18:31:27 UTC

For container Id, 390f6f9d11a70928994798f2b9611e9860017a3516dab52548b936b7b5e73449, I saw in the log its container normally exited - yet, ContainerStatus() still complained that the container Id was missing.

Looking at calls to ContainerStatus(), some are from runtime api query.

Was there any failed test involving terminated container ?


Thanks

Comment 8 Ted Yu 2020-04-02 18:43:54 UTC

I would be interested to know if similar errors are observed on OCP build which contains back port of the following:

    Merge pull request #88440 from smarterclayton/container_success_fix

    Ensure Kubelet always reports terminating pod container status

Comment 9 aaleman 2020-04-02 19:13:22 UTC

> Was there any failed test involving terminated container ?

Sorry, I don't understand the question, can you elaborate?

> I would be interested to know if similar errors are observed on OCP build which contains back port of the following:
>
>   Merge pull request #88440 from smarterclayton/container_success_fix
>
>    Ensure Kubelet always reports terminating pod container status

Is there any way for us to get such a build?

Comment 10 Ted Yu 2020-04-02 19:39:21 UTC

I checked origin/release-4.4 and origin/release-4.5 branches which don't have the back port.
I will let you know when the 1.18 rebase lands.

> Was there any failed test involving terminated container ?

I would expect the missing container to have some effect on test suite execution. Please look out and post here if you see test failure potentially related.

I have checked some container Ids appearing in 'container with ID starting with' error message.
From what I observed, most of the containers exited normally.

I will keep searching for container Id which didn't go through normal termination steps.

Comment 11 Ted Yu 2020-04-03 18:19:07 UTC

With the following PR, we should have more clue about crio related issues:

https://github.com/openshift/machine-config-operator/pull/1613

Comment 12 aaleman 2020-04-03 18:35:04 UTC

Can we have that setting without needing a new MCO version or is it hardcoded?

Comment 13 Ted Yu 2020-04-03 18:39:00 UTC

It is hard coded in crio.yaml for container runtime.

Comment 14 Ted Yu 2020-04-06 14:51:06 UTC

Looking at Kubernetes issues, if Pod is stuck in "Terminating" state, we may see such log.

Please keep an eye on pod stuck during test runs.

Thanks

Comment 15 Ted Yu 2020-04-06 15:22:59 UTC

Here were container Ids and their pod names:

1a5ff587bc9ec0987e3dcc471611aede0840d329725c6a6d32bbd1ddd1a22860  openshift-acme-exposer-build_ci-op-i7xcfyv9
390f6f9d11a70928994798f2b9611e9860017a3516dab52548b936b7b5e73449  rpm-repo-545bf4b666-n9ghf_ci-op-mx19lnb6
25425991358f401242168ecf117b9f63b8d85b22c5268462aa485e86e21641f6  a155836f-7030-11ea-9bb8-0a58ac106d04_ci

We can see that various CI tests exhibited this behavior.
They seem to be not inside origin repo.

Comment 16 Peter Hunt 2020-04-06 18:13:27 UTC

sorry for the super late reply, going through my need infos very late...

it seems SetNetworkStatus and setPodNetworkAnnotation come from Multus, the networking plugin, not CRI-O

Comment 17 Ted Yu 2020-04-06 20:08:09 UTC

This was where the log came from Kubelet#PodResourcesAreReclaimed:

pkg/kubelet/kubelet_pods.go:
            klog.V(3).Infof("Pod %q is terminated, Error getting runtimeStatus from the podCache: %s", format.Pod(pod), err)

Looking closer at the log attached to this BZ I also saw:

Mar 27 13:42:01.063858 ip-10-0-131-192 hyperkube[1618]: I0327 13:42:00.992750    1618 status_manager.go:641] Pod status is inconsistent with cached status for pod "src-build_ci-op-fkystnk8(47434544-abbc-43a0-abb0-c2ff2e319a5f)", a reconciliation should be triggered:
Mar 27 13:42:01.063858 ip-10-0-131-192 hyperkube[1618]:    interface{}(

The above has been fixed in 1.19 upstream.
It would be helpful if we can collect log from more recent openshift build.

Thanks

Comment 18 Ted Yu 2020-04-08 15:27:33 UTC

https://github.com/openshift/origin/pull/24841

back port of 88440 to 4.3

https://github.com/openshift/origin/pull/24649

back port of 88440 to 4.4 (merged)

Comment 19 aaleman 2020-04-08 15:51:51 UTC

Thanks! Do you know if its possible to figure out if a particular release contains a particular PR so we can check if we have it?

Also I guess even after we have a release with the patch we will have to rotate the nodes, right?

Comment 20 Ted Yu 2020-04-08 16:09:57 UTC

```
we will have to rotate the nodes
```
Did you mean applying the release on respective nodes ? Yes.

```
release contains a particular PR
```
I still need to find whether the commit is contained in z (of 4.3.z) release.
Looking at origin, I only see release branch such as origin/release-4.4

Comment 22 Ted Yu 2020-04-09 20:41:59 UTC

w.r.t. tracking specific commit, at the start of log we should see something similar to:

Apr 09 06:44:09.993346 localhost kernel: Command line: BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-e39f93a64996750e818eafa63638b8b9ac8233fc24d3bc1cc1a552d3bd882036/vmlinuz-4.18.0-147.8.1.el8_1.x86_64 rhcos.root=crypt_rootfs console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.0/rhcos/e39f93a64996750e818eafa63638b8b9ac8233fc24d3bc1cc1a552d3bd882036/0 ignition.platform.id=gcp

the hash from rhcos should be correlated with origin commit.

Comment 24 Ryan Phillips 2020-04-15 18:34:46 UTC

4.3 release contains the fix registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2020-04-13-101419

Comment 25 Ryan Phillips 2020-04-15 18:43:49 UTC

*** Bug 1822268 has been marked as a duplicate of this bug. ***

Comment 30 Stephen Cuppett 2020-04-20 12:24:06 UTC

Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 32 Ryan Phillips 2020-05-11 15:36:30 UTC

*** Bug 1826498 has been marked as a duplicate of this bug. ***

Comment 33 Ryan Phillips 2020-05-11 22:35:10 UTC

*** Bug 1822566 has been marked as a duplicate of this bug. ***

Comment 34 Ryan Phillips 2020-05-11 22:36:20 UTC

4.4 received many crio fixes related to `ID does not exist` and context deadline exceeded issues.

Note You need to log in before you can comment on or make changes to this bug.