+++ This bug was initially created as a clone of Bug #1826498 +++ Description of problem: We get into this state when CNI times out on GCP. There are a couple of other CNI related fixes going in. 540 238806:Apr 21 05:10:24.645254 ci-op-prmz6-m-2.c.openshift-gce-devel-ci.internal crio[1325]: time="2020-04-21 05:10:24.644532275Z" level=info msg="Removed container 2dc97bea64ed280d94189461528946dfa0366d5cfae 5ff5f3aa5ca65bf54b835: openshift-apiserver/apiserver-5b9677479-z84j7/fix-audit-permissions" id=24bde45f-e983-4b0a-9009-e7d29e25f782 541 238807:Apr 21 05:10:24.647175 ci-op-prmz6-m-2.c.openshift-gce-devel-ci.internal crio[1325]: time="2020-04-21 05:10:24.647072805Z" level=info msg="attempting to run pod sandbox with infra container: openshift -apiserver/apiserver-5b9677479-z84j7/POD" id=01e8cbfe-7ec5-4385-9186-f0f9831ab297 542 238828:Apr 21 05:10:24.913845 ci-op-prmz6-m-2.c.openshift-gce-devel-ci.internal hyperkube[1424]: I0421 05:10:24.913201 1424 manager.go:950] Added container: "/kubepods.slice/kubepods-burstable.slice/kubep ods-burstable-pod8d047e85_f52d_42e9_bd7d_4d2626bd52c8.slice/crio-aeb8e6443733adfa7bf944bfe5b777915a1cbb86d290c71704f3beac5c2f84a9.scope" (aliases: [k8s_POD_apiserver-5b9677479-z84j7_openshift-apiserver_8d047 e85-f52d-42e9-bd7d-4d2626bd52c8_0 aeb8e6443733adfa7bf944bfe5b777915a1cbb86d290c71704f3beac5c2f84a9], namespace: "crio") 543 238898:Apr 21 05:10:25.410752 ci-op-prmz6-m-2.c.openshift-gce-devel-ci.internal hyperkube[1424]: E0421 05:10:25.410720 1424 kuberuntime_manager.go:955] getPodContainerStatuses for pod "apiserver-5b9677479 -z84j7_openshift-apiserver(8d047e85-f52d-42e9-bd7d-4d2626bd52c8)" failed: rpc error: code = Unknown desc = container with ID starting with 2dc97bea64ed280d94189461528946dfa0366d5cfae5ff5f3aa5ca65bf54b835 not found: ID does not exist Version-Release number of selected component (if applicable): 4.5 How reproducible: related BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1826329 CI job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4/285 Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: --- Additional comment from Jim Minter on 2020-04-23 16:19:20 UTC --- I'm seeing an issue where crictl shows that a pod has exited but the kubelet hasn't realised and openshift still thinks it's running. Will attach logs. The pod in question is oauth-openshift-7fd76bfc9-vv47t. It lived around 2020/04/23 04:43:26 - 04:43:34 UTC. At 04:43:27 the kubelet shows SyncLoop (PLEG) ContainerStarted *and* ContainerDied messages. Some error in cri-o which is causing it to signal ContainerDied too early? Apr 23 04:43:27 aro-master-1 hyperkube[3632442]: I0423 04:43:27.659181 3632442 kubelet.go:1953] SyncLoop (PLEG): "oauth-openshift-7fd76bfc9-vv47t_openshift-authentication(f9b3ca89-25c9-4852-af2b-4f7abf890a04)", event: &pleg.PodLifecycleEvent{ID:"f9b3ca89-25c9-4852-af2b-4f7abf890a04", Type:"ContainerDied", Data:"cf93a4b04ab3796b028640cca6c39baaf294152387f1bf75de13995b969e8906"} Apr 23 04:43:27 aro-master-1 hyperkube[3632442]: I0423 04:43:27.659472 3632442 kubelet.go:1953] SyncLoop (PLEG): "oauth-openshift-7fd76bfc9-vv47t_openshift-authentication(f9b3ca89-25c9-4852-af2b-4f7abf890a04)", event: &pleg.PodLifecycleEvent{ID:"f9b3ca89-25c9-4852-af2b-4f7abf890a04", Type:"ContainerStarted", Data:"2d805ae593298f6ab7af88544db5fa1b35bf6aba05f3d6d415203ca9b866ef6c"} crictl inspect 2d805ae593298 | grep finishedAt "finishedAt": "2020-04-23T04:43:33.401990767Z", cri-o logs don't seem to show much. --- Additional comment from Jim Minter on 2020-04-23 16:19:37 UTC --- ^ the above on 4.3.10. --- Additional comment from Jim Minter on 2020-04-23 16:20:11 UTC --- --- Additional comment from Jim Minter on 2020-04-23 16:20:27 UTC --- --- Additional comment from Jim Minter on 2020-04-23 16:20:44 UTC ---
moving to 4.4.0 based on slack discussion where we are now seeing this on 4.4.0 install (not upgrade) jobs. https://coreos.slack.com/archives/CK1AE4ZCK/p1587663762462100
*** Bug 1824353 has been marked as a duplicate of this bug. ***
The PR linked caps the CNI request time to half of the request sent. That should fix this issue.
https://github.com/cri-o/cri-o/pull/3659 Merged.. Moving to Modified
QE should be pending until rhcos bumped. This worked is being at same time as builds for: https://bugzilla.redhat.com/show_bug.cgi?id=1826896
Moving back to Post until updated image available for QE (which is in progress now).
Confirmed. We should verify this BZ and make sure we don't see the original symptom. It does look like 1826329 is a followup.
Based on comment 11, checking in multiple upgrading tests the fix is working, issue did not get repduced on upgrading to 4.4.0-rc.12 (4.4.0-0.nightly-2020-04-26-205915)
As part of looking into the sample operator delays noted in #Comment 11 I tried to download http://10.73.131.57:9000/minio/openshift-must-gather/2020-04-26-11-12-42/must-gather.local.6992914659817673956.tar.gz but it appears to no longer be there. I was able to use the access/secret keys and I see a lot of similar directories at http://10.73.131.57:9000/minio/openshift-must-gather/ but not 2020-04-26-11-12-42 Is there perhaps another must gather in one of other subdirectories I can look at? or can http://10.73.131.57:9000/minio/openshift-must-gather/2020-04-26-11-12-42/must-gather.local.6992914659817673956.tar.gz be reposted ? thanks
OK Wein Liu and I corresponded on slack and got the minio trick sorted out.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581