+++ This bug was initially created as a clone of Bug #1948066 +++ This is the new bug for: --- Additional comment from W. Trevor King on 2021-06-16 04:26:08 UTC --- Reasonable amount of those from recent releases seem to look like [1]: ns/openshift-multus pod/network-metrics-daemon-zx2pz node/ci-op-bwcbtfmb-25656-9n58p-master-1 - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-zx2pz_openshift-multus_968336b3-1fef-4098-8e2d-f37b3cbee8f7_0(6ea40a13af26babba135f17a209ba100ffcb534ff174da10eb569f8a045c36ac): Multus: [openshift-multus/network-metrics-daemon-zx2pz]: error getting pod: Unauthorized ns/openshift-multus pod/network-metrics-daemon-rjhfv node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus3-8ssnx - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-rjhfv_openshift-multus_7342244f-1581-4bd3-b6f1-25d013cc4e34_0(fb86f0f60c86921d1dda1dc977336fbc6a93eec6c03da3e3ee59c6c4a2a991a5): Multus: [openshift-multus/network-metrics-daemon-rjhfv]: error getting pod: Unauthorized Searching: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=network-metrics-daemon.*never+deleted.*reason/FailedCreatePodSandBox.*failed+to+create+pod+network+sandbox.*error+getting+pod:+Unauthorized' | grep 'failures match' | grep -v 'pull-ci-\|rehearse-' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 13 runs, 69% failed, 22% of failures match = 15% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 14 runs, 100% failed, 43% of failures match = 43% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 13 runs, 92% failed, 58% of failures match = 54% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 4 runs, 50% failed, 100% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade (all) - 3 runs, 67% failed, 100% of failures match = 67% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact release-openshift-okd-installer-e2e-aws-upgrade (all) - 7 runs, 100% failed, 29% of failures match = 29% impact release-openshift-origin-installer-old-rhcos-e2e-aws-4.9 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact But I guess that's not 4.7 -> 4.8, so I'll spin it off into a new bug. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1404639671656386560
That "error getting pod: Unauthorized" line for FailedCreatePodSandBox is also mentioned in bug 1963218 (for arti-test) and bug 1972167 (for openshift-kube-apiserver). Not sure if they share the same root cause or not.
Relevant chunk from the node logs for the job in comment 0: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1404639671656386560/artifacts/e2e-azure-upgrade/gather-extra/artifacts/nodes/ci-op-bwcbtfmb-25656-9n58p-master-1/journal | gunzip | grep -A10 'Started libcontainer container 4b2b63b11209712e239f2b29173d89447b9478c0de56f6a1e66061de38460bc8' Jun 15 04:37:19.339586 ci-op-bwcbtfmb-25656-9n58p-master-1 systemd[1]: Started libcontainer container 4b2b63b11209712e239f2b29173d89447b9478c0de56f6a1e66061de38460bc8. Jun 15 04:37:19.343521 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.343468730Z" level=info msg="Got pod network &{Name:network-metrics-daemon-zx2pz Namespace:openshift-multus ID:6ea40a13af26babba135f17a209ba100ffcb534ff174da10eb569f8a045c36ac NetNS:/var/run/netns/fb902f1d-5e3f-49c6-86ef-cbf5a11a331a Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" Jun 15 04:37:19.343521 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.343516530Z" level=info msg="About to add CNI network multus-cni-network (type=multus)" Jun 15 04:37:19.376344 ci-op-bwcbtfmb-25656-9n58p-master-1 systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory Jun 15 04:37:19.457624 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.457568814Z" level=info msg="Ran pod sandbox 4b2b63b11209712e239f2b29173d89447b9478c0de56f6a1e66061de38460bc8 with infra container: openshift-multus/multus-plz5b/POD" id=4669f6ea-4376-4825-9e53-76ea7638f7bb name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jun 15 04:37:19.459055 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.458906325Z" level=info msg="Checking image status: registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471" id=234711ec-af4a-4632-b29b-6cecbd79eb99 name=/runtime.v1alpha2.ImageService/ImageStatus Jun 15 04:37:19.459363 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.459319928Z" level=info msg="Image registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471 not found" id=234711ec-af4a-4632-b29b-6cecbd79eb99 name=/runtime.v1alpha2.ImageService/ImageStatus Jun 15 04:37:19.460213 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.460149734Z" level=info msg="Pulling image: registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471" id=4f1bde04-5d1f-4d57-bc3e-11b188a73754 name=/runtime.v1alpha2.ImageService/PullImage Jun 15 04:37:19.462669 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.462583153Z" level=info msg="Trying to access \"registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471\"" Jun 15 04:37:19.476316 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.476271659Z" level=error msg="Error adding network: Multus: [openshift-multus/network-metrics-daemon-zx2pz]: error getting pod: Unauthorized" Jun 15 04:37:19.476392 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.476320659Z" level=error msg="Error while adding pod to CNI network \"multus-cni-network\": Multus: [openshift-multus/network-metrics-daemon-zx2pz]: error getting pod: Unauthorized" Is that pulling an image as part of adding the pod to the CNI network? Or is the image-pull an orthogonal thing? The image still seems to be there: $ oc image info -o json registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471 | jq -r '.config.config.Labels | .["io.openshift.build.source-location"] + "/commit/" + .["io.openshift.build.commit.id"]' https://github.com/openshift/multus-cni/commit/f749e15a6896a96ce943b6257f1eef6b5bd7c029
Relevant CRI-O code for the comment 2 log snippet seems to be [1], wrapping a call to cni.AddNetworkList. [1]: https://github.com/cri-o/cri-o/blob/4e1d564b29b4d491a4527556198160461942cf73/vendor/github.com/cri-o/ocicni/pkg/ocicni/ocicni.go#L706-L709
Relevant multus code for the comment 2 log snippet seems to be [1], responding to a GetPod in CmdAdd. [1]: https://github.com/openshift/multus-cni/blob/f749e15a6896a96ce943b6257f1eef6b5bd7c029/pkg/multus/multus.go#L536-L553
What was going on around 04:37:19.[34]Z in the comment 0 job whose logs are in comment 2? $ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1404639671656386560/build-log.txt | grep 04:37:19 Jun 15 04:37:19.000 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/drop-icmp reason/Created Jun 15 04:37:19.000 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/drop-icmp reason/Pulled image/registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:82fd9e97d3a878d6d5524bfc06f8ee92010f79542c9c75626bfbcdb0e6994588 Jun 15 04:37:19.000 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/drop-icmp reason/Started Jun 15 04:37:19.000 I ns/openshift-multus pod/multus-plz5b node/ci-op-bwcbtfmb-25656-9n58p-master-1 container/kube-multus reason/Pulled duration/0.432s image/registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471 Jun 15 04:37:19.000 I ns/openshift-multus pod/multus-plz5b node/ci-op-bwcbtfmb-25656-9n58p-master-1 container/kube-multus reason/Pulling image/registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471 Jun 15 04:37:19.000 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/kube-rbac-proxy reason/Created Jun 15 04:37:19.000 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/kube-rbac-proxy reason/Pulled image/registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:3869910c1e208b125bdecd4ac2d8b2cae42efe221c704491b86aa9b18ce95a65 Jun 15 04:37:19.000 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/kube-rbac-proxy reason/Started Jun 15 04:37:19.000 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/sdn reason/Started Jun 15 04:37:19.000 I ns/openshift-multus pod/multus-additional-cni-plugins-js86n node/ci-op-bwcbtfmb-25656-9n58p-master-1 container/whereabouts-cni-bincopy reason/Created Jun 15 04:37:19.000 I ns/openshift-multus pod/multus-additional-cni-plugins-js86n node/ci-op-bwcbtfmb-25656-9n58p-master-1 container/whereabouts-cni-bincopy reason/Pulled duration/0.334s image/registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:506cdb07bdc6a610a09b85e4daa1229b37c998f90ce0398d8829972b2beb170d Jun 15 04:37:19.000 I ns/openshift-multus pod/multus-additional-cni-plugins-js86n node/ci-op-bwcbtfmb-25656-9n58p-master-1 container/whereabouts-cni-bincopy reason/Pulling image/registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:506cdb07bdc6a610a09b85e4daa1229b37c998f90ce0398d8829972b2beb170d Jun 15 04:37:19.000 I ns/openshift-multus pod/multus-additional-cni-plugins-js86n node/ci-op-bwcbtfmb-25656-9n58p-master-1 container/whereabouts-cni-bincopy reason/Started Jun 15 04:37:19.000 W ns/openshift-multus pod/network-metrics-daemon-zx2pz node/ci-op-bwcbtfmb-25656-9n58p-master-1 reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-zx2pz_openshift-multus_968336b3-1fef-4098-8e2d-f37b3cbee8f7_0(6ea40a13af26babba135f17a209ba100ffcb534ff174da10eb569f8a045c36ac): Multus: [openshift-multus/network-metrics-daemon-zx2pz]: error getting pod: Unauthorized Jun 15 04:37:19.189 I ns/openshift-multus pod/multus-additional-cni-plugins-js86n node/ci-op-bwcbtfmb-25656-9n58p-master-1 container/routeoverride-cni reason/ContainerExit code/0 cause/Completed Jun 15 04:37:19.777 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/kube-rbac-proxy reason/ContainerStart duration/2.00s Jun 15 04:37:19.777 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/sdn reason/ContainerStart duration/2.00s Jun 15 04:37:19.777 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/drop-icmp reason/ContainerStart duration/2.00s Jun 15 04:37:19.777 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/kube-rbac-proxy reason/Ready Jun 15 04:37:19.777 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/drop-icmp reason/Ready So right after whereabouts-cni-bincopy Started. I don't really know what that does, but "Unauthorized" sounds RBAC-y and "bincopy" sounds like local binaries, so I'm not sure how they'd be related. $ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1404639671656386560/build-log.txt | grep 'clusteroperator/network' Jun 15 04:37:01.766 W clusteroperator/network condition/Progressing status/True reason/Deploying changed: DaemonSet "openshift-multus/multus" update is being processed (generation 2, observed generation 1) Jun 15 04:37:01.766 - 282s W clusteroperator/network condition/Progressing status/True reason/DaemonSet "openshift-multus/multus" update is being processed (generation 2, observed generation 1) Jun 15 04:39:01.228 I clusteroperator/network versions: operator 4.8.0-0.ci-2021-06-13-060313 -> 4.8.0-0.ci-2021-06-14-090014 Jun 15 04:41:44.483 W clusteroperator/network condition/Progressing status/False changed: Jun 15 04:50:45.340 W clusteroperator/network condition/Progressing status/True reason/Deploying changed: DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes) ... Hmm, so at 4:37:19, multus DaemonSet was rolling out. Events around the time of the error: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1404639671656386560/artifacts/e2e-azure-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-multus") | (.firstTimestamp // .metadata.creationTimestamp) + " " + .involvedObject.name + " " + .reason + ": " + .message' | sort | grep -2 2021-06-15T04:37:19Z 2021-06-15T04:37:18Z network-metrics-daemon-w6vbc Started: Started container network-metrics-daemon 2021-06-15T04:37:18Z network-metrics-daemon-zx2pz Scheduled: Successfully assigned openshift-multus/network-metrics-daemon-zx2pz to ci-op-bwcbtfmb-25656-9n58p-master-1 2021-06-15T04:37:19Z multus-additional-cni-plugins-js86n Created: Created container whereabouts-cni-bincopy 2021-06-15T04:37:19Z multus-additional-cni-plugins-js86n Pulled: Successfully pulled image "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:506cdb07bdc6a610a09b85e4daa1229b37c998f90ce0398d8829972b2beb170d" in 333.530384ms 2021-06-15T04:37:19Z multus-additional-cni-plugins-js86n Pulling: Pulling image "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:506cdb07bdc6a610a09b85e4daa1229b37c998f90ce0398d8829972b2beb170d" 2021-06-15T04:37:19Z multus-additional-cni-plugins-js86n Started: Started container whereabouts-cni-bincopy 2021-06-15T04:37:19Z multus-plz5b Pulled: Successfully pulled image "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471" in 431.559545ms 2021-06-15T04:37:19Z multus-plz5b Pulling: Pulling image "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471" 2021-06-15T04:37:19Z network-metrics-daemon-zx2pz FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-zx2pz_openshift-multus_968336b3-1fef-4098-8e2d-f37b3cbee8f7_0(6ea40a13af26babba135f17a209ba100ffcb534ff174da10eb569f8a045c36ac): Multus: [openshift-multus/network-metrics-daemon-zx2pz]: error getting pod: Unauthorized 2021-06-15T04:37:20Z multus SuccessfulDelete: Deleted pod: multus-bg755 2021-06-15T04:37:20Z multus-additional-cni-plugins-js86n Created: Created container whereabouts-cni And events around the impacted pod: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1404639671656386560/artifacts/e2e-azure-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-multus") | (.firstTimestamp // .metadata.creationTimestamp) + " " + .involvedObject.name + " " + .reason + ": " + .message' | sort | grep network-metrics-daemon-zx2pz 2021-06-15T04:37:18Z network-metrics-daemon SuccessfulCreate: Created pod: network-metrics-daemon-zx2pz 2021-06-15T04:37:18Z network-metrics-daemon-zx2pz Scheduled: Successfully assigned openshift-multus/network-metrics-daemon-zx2pz to ci-op-bwcbtfmb-25656-9n58p-master-1 2021-06-15T04:37:19Z network-metrics-daemon-zx2pz FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-zx2pz_openshift-multus_968336b3-1fef-4098-8e2d-f37b3cbee8f7_0(6ea40a13af26babba135f17a209ba100ffcb534ff174da10eb569f8a045c36ac): Multus: [openshift-multus/network-metrics-daemon-zx2pz]: error getting pod: Unauthorized 2021-06-15T04:37:43Z network-metrics-daemon-zx2pz AddedInterface: Add eth0 [10.129.0.72/23] from openshift-sdn 2021-06-15T04:37:43Z network-metrics-daemon-zx2pz Pulling: Pulling image "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:58de0fe7769789748227f186f51aa7bffb3571dc9030f930ff39ff3539e5127d" 2021-06-15T04:37:45Z network-metrics-daemon-zx2pz Created: Created container kube-rbac-proxy ... So 25 seconds later at 4:37:45Z, there was another attempt that went smoothly.
I've floated [1] to make this non-fatal in CI while we figure out and fix the root cause. [1]: https://github.com/openshift/origin/pull/26235
This issue is currently preventing PRs from merging for 4.8 and the master in the cluster-kube-apiserver-operator repo. Is someone working on it?
Yes, I believe this is a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1972167
We currently believe this is a dupe of 1972167, and should be verified as such, or reopened if determined otherwise. *** This bug has been marked as a duplicate of bug 1972167 ***
Bug 1972167 has been MODIFIED or later for at least four days now, and today in an unrelated origin master PR I saw [1]: : [sig-network] pods should successfully create sandboxes by writing network status 0s 1 failures to create the sandbox ns/openshift-multus pod/network-metrics-daemon-gqwwm node/ci-op-t2lww3dt-db044-6dlnx-master-2 - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-gqwwm_openshift-multus_dd7a77c9-c4e3-46ae-89bc-9e4685ef765c_0(adf45dda585ae0b65221fdf1a4e8ff914a88bd44e15a217cc05f84c165897e3c): Multus: [openshift-multus/network-metrics-daemon-gqwwm]: error setting the networks status, pod was already deleted: SetNetworkStatus: failed to query the pod network-metrics-daemon-gqwwm in out of cluster comm: Unauthorized That doesn't match the 'error getting pod: Unauthorized' carve-out from [2], so it failed the job. I'm not sure if there's still a missing part of cleanly handling token rotation, or if this is just the origin suite ratcheting down on strictness faster than the Multus tooling can keep up. [3] is up with a broader origin softening based on the fact that these hiccups seem to automatically recover. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26262/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1407415987656986624 [2]: https://github.com/openshift/origin/pull/26235/commits/8b6ec68cf75b3f90116dd16acbde3a900e601fba [3]: https://github.com/openshift/origin/pull/26208
I'm setting this as blocker- as I currently understand this to be incorrect -- but non-fatal. It seems as though this condition does recover, and that the "Unauthenticated" response from the api server will cause Multus to fail for an instance, however, an impacted pod will go into a crashloop and recover. Not ideal, however while it seems to occur with regularity in CI, it impacts a small number of pods. I'm still working towards an overall fix.
*** Bug 1948066 has been marked as a duplicate of this bug. ***
I've been taking a look at the CI results for this bug, and I'm curious if the threshold can be set at two failures? Each time I'm seeing this error, it's a single error. Because of the limitation in how quickly the Multus daemonset updates the authentication information, I believe that intermittent failures are to be expected. While incorrect, such issues will resolve on subsequent runs. This is due to the fact that the Multus CNI binary runs on disk, and the kubeconfig it uses is generated from the entrypoint script which updates each second. It's less likely to see more than one instance of this error, which may be more indicative of a potentially more impactful error condition. I believe this authentication issue will be resolved holistically by an upcoming refactor to Multus that has multus in a mode where it functions as a "thick plugin" -- with a CNI shim, and then a daemonset that runs resident in memory. In this mode, the resident daemon will use authentication from client-go, which should handle authentication in a fashion similar to other components running as part of the openshift infracture, e.g. operators, and controllers.