Bug 1827325

Summary: [4.4] hyperkube: container with ID starting with ... not found: ID does not exist
Product: OpenShift Container Platform Reporter: Ryan Phillips <rphillips>
Component: NodeAssignee: Ryan Phillips <rphillips>
Status: CLOSED ERRATA QA Contact: Weinan Liu <weinliu>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.4CC: aos-bugs, bparees, gmontero, jminter, jokerman, kgarriso, mpatel, pehunt, schoudha, scuppett, sttts, vlaad, weinliu, wking, wzheng, zyu
Target Milestone: ---Keywords: Upgrades
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1826498 Environment:
Last Closed: 2020-05-04 11:50:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1819906, 1826498    
Bug Blocks: 1826329    

Description Ryan Phillips 2020-04-23 16:21:34 UTC
+++ This bug was initially created as a clone of Bug #1826498 +++

Description of problem:

 We get into this state when CNI times out on GCP. There are a couple of other CNI related fixes going in.

 540 238806:Apr 21 05:10:24.645254 ci-op-prmz6-m-2.c.openshift-gce-devel-ci.internal crio[1325]: time="2020-04-21 05:10:24.644532275Z" level=info msg="Removed container 2dc97bea64ed280d94189461528946dfa0366d5cfae     5ff5f3aa5ca65bf54b835: openshift-apiserver/apiserver-5b9677479-z84j7/fix-audit-permissions" id=24bde45f-e983-4b0a-9009-e7d29e25f782                                                                            
 541 238807:Apr 21 05:10:24.647175 ci-op-prmz6-m-2.c.openshift-gce-devel-ci.internal crio[1325]: time="2020-04-21 05:10:24.647072805Z" level=info msg="attempting to run pod sandbox with infra container: openshift     -apiserver/apiserver-5b9677479-z84j7/POD" id=01e8cbfe-7ec5-4385-9186-f0f9831ab297                                                                                                                              
 542 238828:Apr 21 05:10:24.913845 ci-op-prmz6-m-2.c.openshift-gce-devel-ci.internal hyperkube[1424]: I0421 05:10:24.913201    1424 manager.go:950] Added container: "/kubepods.slice/kubepods-burstable.slice/kubep     ods-burstable-pod8d047e85_f52d_42e9_bd7d_4d2626bd52c8.slice/crio-aeb8e6443733adfa7bf944bfe5b777915a1cbb86d290c71704f3beac5c2f84a9.scope" (aliases: [k8s_POD_apiserver-5b9677479-z84j7_openshift-apiserver_8d047     e85-f52d-42e9-bd7d-4d2626bd52c8_0 aeb8e6443733adfa7bf944bfe5b777915a1cbb86d290c71704f3beac5c2f84a9], namespace: "crio")                                                                                        
 543 238898:Apr 21 05:10:25.410752 ci-op-prmz6-m-2.c.openshift-gce-devel-ci.internal hyperkube[1424]: E0421 05:10:25.410720    1424 kuberuntime_manager.go:955] getPodContainerStatuses for pod "apiserver-5b9677479     -z84j7_openshift-apiserver(8d047e85-f52d-42e9-bd7d-4d2626bd52c8)" failed: rpc error: code = Unknown desc = container with ID starting with 2dc97bea64ed280d94189461528946dfa0366d5cfae5ff5f3aa5ca65bf54b835 not      found: ID does not exist 


Version-Release number of selected component (if applicable):
4.5

How reproducible:
related BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1826329
CI job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4/285

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Jim Minter on 2020-04-23 16:19:20 UTC ---

I'm seeing an issue where crictl shows that a pod has exited but the kubelet hasn't realised and openshift still thinks it's running.  Will attach logs.  The pod in question is oauth-openshift-7fd76bfc9-vv47t.  It lived around 2020/04/23 04:43:26 - 04:43:34 UTC.  At 04:43:27 the kubelet shows SyncLoop (PLEG) ContainerStarted *and* ContainerDied messages.  Some error in cri-o which is causing it to signal ContainerDied too early?

Apr 23 04:43:27 aro-master-1 hyperkube[3632442]: I0423 04:43:27.659181 3632442 kubelet.go:1953] SyncLoop (PLEG): "oauth-openshift-7fd76bfc9-vv47t_openshift-authentication(f9b3ca89-25c9-4852-af2b-4f7abf890a04)", event: &pleg.PodLifecycleEvent{ID:"f9b3ca89-25c9-4852-af2b-4f7abf890a04", Type:"ContainerDied", Data:"cf93a4b04ab3796b028640cca6c39baaf294152387f1bf75de13995b969e8906"}
Apr 23 04:43:27 aro-master-1 hyperkube[3632442]: I0423 04:43:27.659472 3632442 kubelet.go:1953] SyncLoop (PLEG): "oauth-openshift-7fd76bfc9-vv47t_openshift-authentication(f9b3ca89-25c9-4852-af2b-4f7abf890a04)", event: &pleg.PodLifecycleEvent{ID:"f9b3ca89-25c9-4852-af2b-4f7abf890a04", Type:"ContainerStarted", Data:"2d805ae593298f6ab7af88544db5fa1b35bf6aba05f3d6d415203ca9b866ef6c"}


crictl inspect 2d805ae593298 | grep finishedAt
    "finishedAt": "2020-04-23T04:43:33.401990767Z",


cri-o logs don't seem to show much.

--- Additional comment from Jim Minter on 2020-04-23 16:19:37 UTC ---

^ the above on 4.3.10.

--- Additional comment from Jim Minter on 2020-04-23 16:20:11 UTC ---



--- Additional comment from Jim Minter on 2020-04-23 16:20:27 UTC ---



--- Additional comment from Jim Minter on 2020-04-23 16:20:44 UTC ---

Comment 1 Ben Parees 2020-04-23 18:45:04 UTC
moving to 4.4.0 based on slack discussion where we are now seeing this on 4.4.0 install (not upgrade) jobs.

https://coreos.slack.com/archives/CK1AE4ZCK/p1587663762462100

Comment 3 Ryan Phillips 2020-04-24 17:07:04 UTC
*** Bug 1824353 has been marked as a duplicate of this bug. ***

Comment 4 Peter Hunt 2020-04-24 17:28:20 UTC
The PR linked caps the CNI request time to half of the request sent. That should fix this issue.

Comment 5 Kirsten Garrison 2020-04-25 00:17:51 UTC
https://github.com/cri-o/cri-o/pull/3659 Merged.. Moving to Modified

Comment 6 Kirsten Garrison 2020-04-25 00:41:10 UTC
QE should be pending until rhcos bumped. This worked is being at same time as builds for: https://bugzilla.redhat.com/show_bug.cgi?id=1826896

Comment 9 Kirsten Garrison 2020-04-25 01:13:44 UTC
Moving back to Post until updated image available for QE (which is in progress now).

Comment 12 Ryan Phillips 2020-04-27 15:54:20 UTC
Confirmed. We should verify this BZ and make sure we don't see the original symptom.

It does look like 1826329 is a followup.

Comment 13 Weinan Liu 2020-04-27 16:07:37 UTC
Based on comment 11, checking in multiple upgrading tests the fix is working, issue did not get repduced on upgrading to 4.4.0-rc.12 (4.4.0-0.nightly-2020-04-26-205915)

Comment 14 Gabe Montero 2020-04-27 16:32:11 UTC
As part of looking into the sample operator delays noted in #Comment 11 I tried to download http://10.73.131.57:9000/minio/openshift-must-gather/2020-04-26-11-12-42/must-gather.local.6992914659817673956.tar.gz but it appears to no longer be there.  I was able to use the access/secret keys and I see a lot of similar directories at http://10.73.131.57:9000/minio/openshift-must-gather/ but not 2020-04-26-11-12-42

Is there perhaps another must gather in one of other subdirectories I can look at?

or can http://10.73.131.57:9000/minio/openshift-must-gather/2020-04-26-11-12-42/must-gather.local.6992914659817673956.tar.gz  be reposted ?

thanks

Comment 15 Gabe Montero 2020-04-27 16:42:49 UTC
OK Wein Liu and I corresponded on slack and got the minio trick sorted out.

Comment 17 errata-xmlrpc 2020-05-04 11:50:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581