1972490 – 4.8 and later update CI failing: pods should successfully create sandboxes by other: error getting pod: Unauthorized

Bug 1972490 - 4.8 and later update CI failing: pods should successfully create sandboxes by other: error getting pod: Unauthorized

Summary: 4.8 and later update CI failing: pods should successfully create sandboxes by...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Douglas Smith
QA Contact:	Weibin Liang
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1948066 (view as bug list)
Depends On:	1948066 1972167
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-16 04:28 UTC by W. Trevor King
Modified:	2024-12-20 20:15 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1948066
Environment:
Last Closed:	2022-05-02 18:02:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description W. Trevor King 2021-06-16 04:28:37 UTC

+++ This bug was initially created as a clone of Bug #1948066 +++

This is the new bug for:

--- Additional comment from W. Trevor King on 2021-06-16 04:26:08 UTC ---

Reasonable amount of those from recent releases seem to look like [1]:

  ns/openshift-multus pod/network-metrics-daemon-zx2pz node/ci-op-bwcbtfmb-25656-9n58p-master-1 - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-zx2pz_openshift-multus_968336b3-1fef-4098-8e2d-f37b3cbee8f7_0(6ea40a13af26babba135f17a209ba100ffcb534ff174da10eb569f8a045c36ac): Multus: [openshift-multus/network-metrics-daemon-zx2pz]: error getting pod: Unauthorized
  ns/openshift-multus pod/network-metrics-daemon-rjhfv node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus3-8ssnx - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-rjhfv_openshift-multus_7342244f-1581-4bd3-b6f1-25d013cc4e34_0(fb86f0f60c86921d1dda1dc977336fbc6a93eec6c03da3e3ee59c6c4a2a991a5): Multus: [openshift-multus/network-metrics-daemon-rjhfv]: error getting pod: Unauthorized

Searching:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=network-metrics-daemon.*never+deleted.*reason/FailedCreatePodSandBox.*failed+to+create+pod+network+sandbox.*error+getting+pod:+Unauthorized' | grep 'failures match' | grep -v 'pull-ci-\|rehearse-' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 13 runs, 69% failed, 22% of failures match = 15% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 14 runs, 100% failed, 43% of failures match = 43% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 13 runs, 92% failed, 58% of failures match = 54% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 4 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade (all) - 3 runs, 67% failed, 100% of failures match = 67% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 7 runs, 100% failed, 29% of failures match = 29% impact
release-openshift-origin-installer-old-rhcos-e2e-aws-4.9 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

But I guess that's not 4.7 -> 4.8, so I'll spin it off into a new bug.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1404639671656386560

Comment 1 W. Trevor King 2021-06-16 04:35:09 UTC

That "error getting pod: Unauthorized" line for FailedCreatePodSandBox is also mentioned in bug 1963218 (for arti-test) and bug 1972167 (for openshift-kube-apiserver).  Not sure if they share the same root cause or not.

Comment 2 W. Trevor King 2021-06-16 04:48:10 UTC

Relevant chunk from the node logs for the job in comment 0:

 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1404639671656386560/artifacts/e2e-azure-upgrade/gather-extra/artifacts/nodes/ci-op-bwcbtfmb-25656-9n58p-master-1/journal | gunzip | grep -A10 'Started libcontainer container 4b2b63b11209712e239f2b29173d89447b9478c0de56f6a1e66061de38460bc8'
Jun 15 04:37:19.339586 ci-op-bwcbtfmb-25656-9n58p-master-1 systemd[1]: Started libcontainer container 4b2b63b11209712e239f2b29173d89447b9478c0de56f6a1e66061de38460bc8.
Jun 15 04:37:19.343521 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.343468730Z" level=info msg="Got pod network &{Name:network-metrics-daemon-zx2pz Namespace:openshift-multus ID:6ea40a13af26babba135f17a209ba100ffcb534ff174da10eb569f8a045c36ac NetNS:/var/run/netns/fb902f1d-5e3f-49c6-86ef-cbf5a11a331a Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Jun 15 04:37:19.343521 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.343516530Z" level=info msg="About to add CNI network multus-cni-network (type=multus)"
Jun 15 04:37:19.376344 ci-op-bwcbtfmb-25656-9n58p-master-1 systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory
Jun 15 04:37:19.457624 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.457568814Z" level=info msg="Ran pod sandbox 4b2b63b11209712e239f2b29173d89447b9478c0de56f6a1e66061de38460bc8 with infra container: openshift-multus/multus-plz5b/POD" id=4669f6ea-4376-4825-9e53-76ea7638f7bb name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Jun 15 04:37:19.459055 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.458906325Z" level=info msg="Checking image status: registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471" id=234711ec-af4a-4632-b29b-6cecbd79eb99 name=/runtime.v1alpha2.ImageService/ImageStatus
Jun 15 04:37:19.459363 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.459319928Z" level=info msg="Image registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471 not found" id=234711ec-af4a-4632-b29b-6cecbd79eb99 name=/runtime.v1alpha2.ImageService/ImageStatus
Jun 15 04:37:19.460213 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.460149734Z" level=info msg="Pulling image: registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471" id=4f1bde04-5d1f-4d57-bc3e-11b188a73754 name=/runtime.v1alpha2.ImageService/PullImage
Jun 15 04:37:19.462669 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.462583153Z" level=info msg="Trying to access \"registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471\""
Jun 15 04:37:19.476316 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.476271659Z" level=error msg="Error adding network: Multus: [openshift-multus/network-metrics-daemon-zx2pz]: error getting pod: Unauthorized"
Jun 15 04:37:19.476392 ci-op-bwcbtfmb-25656-9n58p-master-1 crio[1484]: time="2021-06-15 04:37:19.476320659Z" level=error msg="Error while adding pod to CNI network \"multus-cni-network\": Multus: [openshift-multus/network-metrics-daemon-zx2pz]: error getting pod: Unauthorized"

Is that pulling an image as part of adding the pod to the CNI network?  Or is the image-pull an orthogonal thing?  The image still seems to be there:

$ oc image info -o json registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471 | jq -r '.config.config.Labels | .["io.openshift.build.source-location"] + "/commit/" + .["io.openshift.build.commit.id"]'
https://github.com/openshift/multus-cni/commit/f749e15a6896a96ce943b6257f1eef6b5bd7c029

Comment 3 W. Trevor King 2021-06-16 05:00:10 UTC

Relevant CRI-O code for the comment 2 log snippet seems to be [1], wrapping a call to cni.AddNetworkList.

[1]: https://github.com/cri-o/cri-o/blob/4e1d564b29b4d491a4527556198160461942cf73/vendor/github.com/cri-o/ocicni/pkg/ocicni/ocicni.go#L706-L709

Comment 4 W. Trevor King 2021-06-16 05:04:34 UTC

Relevant multus code for the comment 2 log snippet seems to be [1], responding to a GetPod in CmdAdd.

[1]: https://github.com/openshift/multus-cni/blob/f749e15a6896a96ce943b6257f1eef6b5bd7c029/pkg/multus/multus.go#L536-L553

Comment 5 W. Trevor King 2021-06-16 05:24:58 UTC

What was going on around 04:37:19.[34]Z in the comment 0 job whose logs are in comment 2?

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1404639671656386560/build-log.txt | grep 04:37:19 
Jun 15 04:37:19.000 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/drop-icmp reason/Created
Jun 15 04:37:19.000 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/drop-icmp reason/Pulled image/registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:82fd9e97d3a878d6d5524bfc06f8ee92010f79542c9c75626bfbcdb0e6994588
Jun 15 04:37:19.000 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/drop-icmp reason/Started
Jun 15 04:37:19.000 I ns/openshift-multus pod/multus-plz5b node/ci-op-bwcbtfmb-25656-9n58p-master-1 container/kube-multus reason/Pulled duration/0.432s image/registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471
Jun 15 04:37:19.000 I ns/openshift-multus pod/multus-plz5b node/ci-op-bwcbtfmb-25656-9n58p-master-1 container/kube-multus reason/Pulling image/registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471
Jun 15 04:37:19.000 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/kube-rbac-proxy reason/Created
Jun 15 04:37:19.000 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/kube-rbac-proxy reason/Pulled image/registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:3869910c1e208b125bdecd4ac2d8b2cae42efe221c704491b86aa9b18ce95a65
Jun 15 04:37:19.000 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/kube-rbac-proxy reason/Started
Jun 15 04:37:19.000 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/sdn reason/Started
Jun 15 04:37:19.000 I ns/openshift-multus pod/multus-additional-cni-plugins-js86n node/ci-op-bwcbtfmb-25656-9n58p-master-1 container/whereabouts-cni-bincopy reason/Created
Jun 15 04:37:19.000 I ns/openshift-multus pod/multus-additional-cni-plugins-js86n node/ci-op-bwcbtfmb-25656-9n58p-master-1 container/whereabouts-cni-bincopy reason/Pulled duration/0.334s image/registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:506cdb07bdc6a610a09b85e4daa1229b37c998f90ce0398d8829972b2beb170d
Jun 15 04:37:19.000 I ns/openshift-multus pod/multus-additional-cni-plugins-js86n node/ci-op-bwcbtfmb-25656-9n58p-master-1 container/whereabouts-cni-bincopy reason/Pulling image/registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:506cdb07bdc6a610a09b85e4daa1229b37c998f90ce0398d8829972b2beb170d
Jun 15 04:37:19.000 I ns/openshift-multus pod/multus-additional-cni-plugins-js86n node/ci-op-bwcbtfmb-25656-9n58p-master-1 container/whereabouts-cni-bincopy reason/Started
Jun 15 04:37:19.000 W ns/openshift-multus pod/network-metrics-daemon-zx2pz node/ci-op-bwcbtfmb-25656-9n58p-master-1 reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-zx2pz_openshift-multus_968336b3-1fef-4098-8e2d-f37b3cbee8f7_0(6ea40a13af26babba135f17a209ba100ffcb534ff174da10eb569f8a045c36ac): Multus: [openshift-multus/network-metrics-daemon-zx2pz]: error getting pod: Unauthorized
Jun 15 04:37:19.189 I ns/openshift-multus pod/multus-additional-cni-plugins-js86n node/ci-op-bwcbtfmb-25656-9n58p-master-1 container/routeoverride-cni reason/ContainerExit code/0 cause/Completed 
Jun 15 04:37:19.777 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/kube-rbac-proxy reason/ContainerStart duration/2.00s
Jun 15 04:37:19.777 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/sdn reason/ContainerStart duration/2.00s
Jun 15 04:37:19.777 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/drop-icmp reason/ContainerStart duration/2.00s
Jun 15 04:37:19.777 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/kube-rbac-proxy reason/Ready
Jun 15 04:37:19.777 I ns/openshift-sdn pod/sdn-ltv96 node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus1-mhzfp container/drop-icmp reason/Ready

So right after whereabouts-cni-bincopy Started.  I don't really know what that does, but "Unauthorized" sounds RBAC-y and "bincopy" sounds like local binaries, so I'm not sure how they'd be related.

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1404639671656386560/build-log.txt | grep 'clusteroperator/network'
Jun 15 04:37:01.766 W clusteroperator/network condition/Progressing status/True reason/Deploying changed: DaemonSet "openshift-multus/multus" update is being processed (generation 2, observed generation 1)
Jun 15 04:37:01.766 - 282s  W clusteroperator/network condition/Progressing status/True reason/DaemonSet "openshift-multus/multus" update is being processed (generation 2, observed generation 1)
Jun 15 04:39:01.228 I clusteroperator/network versions: operator 4.8.0-0.ci-2021-06-13-060313 -> 4.8.0-0.ci-2021-06-14-090014
Jun 15 04:41:44.483 W clusteroperator/network condition/Progressing status/False changed: 
Jun 15 04:50:45.340 W clusteroperator/network condition/Progressing status/True reason/Deploying changed: DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
...

Hmm, so at 4:37:19, multus DaemonSet was rolling out.  Events around the time of the error:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1404639671656386560/artifacts/e2e-azure-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-multus") | (.firstTimestamp // .metadata.creationTimestamp) + " " + .involvedObject.name + " " + .reason + ": " + .message' | sort | grep -2 2021-06-15T04:37:19Z
2021-06-15T04:37:18Z network-metrics-daemon-w6vbc Started: Started container network-metrics-daemon
2021-06-15T04:37:18Z network-metrics-daemon-zx2pz Scheduled: Successfully assigned openshift-multus/network-metrics-daemon-zx2pz to ci-op-bwcbtfmb-25656-9n58p-master-1
2021-06-15T04:37:19Z multus-additional-cni-plugins-js86n Created: Created container whereabouts-cni-bincopy
2021-06-15T04:37:19Z multus-additional-cni-plugins-js86n Pulled: Successfully pulled image "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:506cdb07bdc6a610a09b85e4daa1229b37c998f90ce0398d8829972b2beb170d" in 333.530384ms
2021-06-15T04:37:19Z multus-additional-cni-plugins-js86n Pulling: Pulling image "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:506cdb07bdc6a610a09b85e4daa1229b37c998f90ce0398d8829972b2beb170d"
2021-06-15T04:37:19Z multus-additional-cni-plugins-js86n Started: Started container whereabouts-cni-bincopy
2021-06-15T04:37:19Z multus-plz5b Pulled: Successfully pulled image "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471" in 431.559545ms
2021-06-15T04:37:19Z multus-plz5b Pulling: Pulling image "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:2c881d3a4bdeafee4a06a1345d613cfd0b18d376e47c14bd03263d06ff9cc471"
2021-06-15T04:37:19Z network-metrics-daemon-zx2pz FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-zx2pz_openshift-multus_968336b3-1fef-4098-8e2d-f37b3cbee8f7_0(6ea40a13af26babba135f17a209ba100ffcb534ff174da10eb569f8a045c36ac): Multus: [openshift-multus/network-metrics-daemon-zx2pz]: error getting pod: Unauthorized
2021-06-15T04:37:20Z multus SuccessfulDelete: Deleted pod: multus-bg755
2021-06-15T04:37:20Z multus-additional-cni-plugins-js86n Created: Created container whereabouts-cni

And events around the impacted pod:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1404639671656386560/artifacts/e2e-azure-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-multus") | (.firstTimestamp // .metadata.creationTimestamp) + " " + .involvedObject.name + " " + .reason + ": " + .message' | sort | grep network-metrics-daemon-zx2pz
2021-06-15T04:37:18Z network-metrics-daemon SuccessfulCreate: Created pod: network-metrics-daemon-zx2pz
2021-06-15T04:37:18Z network-metrics-daemon-zx2pz Scheduled: Successfully assigned openshift-multus/network-metrics-daemon-zx2pz to ci-op-bwcbtfmb-25656-9n58p-master-1
2021-06-15T04:37:19Z network-metrics-daemon-zx2pz FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-zx2pz_openshift-multus_968336b3-1fef-4098-8e2d-f37b3cbee8f7_0(6ea40a13af26babba135f17a209ba100ffcb534ff174da10eb569f8a045c36ac): Multus: [openshift-multus/network-metrics-daemon-zx2pz]: error getting pod: Unauthorized
2021-06-15T04:37:43Z network-metrics-daemon-zx2pz AddedInterface: Add eth0 [10.129.0.72/23] from openshift-sdn
2021-06-15T04:37:43Z network-metrics-daemon-zx2pz Pulling: Pulling image "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:58de0fe7769789748227f186f51aa7bffb3571dc9030f930ff39ff3539e5127d"
2021-06-15T04:37:45Z network-metrics-daemon-zx2pz Created: Created container kube-rbac-proxy
...

So 25 seconds later at 4:37:45Z, there was another attempt that went smoothly.

Comment 6 W. Trevor King 2021-06-16 05:38:09 UTC

I've floated [1] to make this non-fatal in CI while we figure out and fix the root cause.

[1]: https://github.com/openshift/origin/pull/26235

Comment 7 Lukasz Szaszkiewicz 2021-06-17 08:07:15 UTC

This issue is currently preventing PRs from merging for 4.8 and the master in the cluster-kube-apiserver-operator repo. 
Is someone working on it?

Comment 8 Douglas Smith 2021-06-17 20:05:42 UTC

Yes, I believe this is a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1972167

Comment 9 Douglas Smith 2021-06-18 13:42:12 UTC

We currently believe this is a dupe of 1972167, and should be verified as such, or reopened if determined otherwise.

*** This bug has been marked as a duplicate of bug 1972167 ***

Comment 10 W. Trevor King 2021-06-23 00:02:11 UTC

Bug 1972167 has been MODIFIED or later for at least four days now, and today in an unrelated origin master PR I saw [1]:


  : [sig-network] pods should successfully create sandboxes by writing network status	0s
    1 failures to create the sandbox

    ns/openshift-multus pod/network-metrics-daemon-gqwwm node/ci-op-t2lww3dt-db044-6dlnx-master-2 - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-gqwwm_openshift-multus_dd7a77c9-c4e3-46ae-89bc-9e4685ef765c_0(adf45dda585ae0b65221fdf1a4e8ff914a88bd44e15a217cc05f84c165897e3c): Multus: [openshift-multus/network-metrics-daemon-gqwwm]: error setting the networks status, pod was already deleted: SetNetworkStatus: failed to query the pod network-metrics-daemon-gqwwm in out of cluster comm: Unauthorized

That doesn't match the 'error getting pod: Unauthorized' carve-out from [2], so it failed the job.  I'm not sure if there's still a missing part of cleanly handling token rotation, or if this is just the origin suite ratcheting down on strictness faster than the Multus tooling can keep up.  [3] is up with a broader origin softening based on the fact that these hiccups seem to automatically recover.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26262/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1407415987656986624
[2]: https://github.com/openshift/origin/pull/26235/commits/8b6ec68cf75b3f90116dd16acbde3a900e601fba
[3]: https://github.com/openshift/origin/pull/26208

Comment 11 Douglas Smith 2021-06-23 15:29:19 UTC

I'm setting this as blocker- as I currently understand this to be incorrect -- but non-fatal. It seems as though this condition does recover, and that the "Unauthenticated" response from the api server will cause Multus to fail for an instance, however, an impacted pod will go into a crashloop and recover. Not ideal, however while it seems to occur with regularity in CI, it impacts a small number of pods.

I'm still working towards an overall fix.

Comment 12 W. Trevor King 2021-07-02 04:57:26 UTC

*** Bug 1948066 has been marked as a duplicate of this bug. ***

Comment 13 Douglas Smith 2021-09-24 20:01:20 UTC

I've been taking a look at the CI results for this bug, and I'm curious if the threshold can be set at two failures? Each time I'm seeing this error, it's a single error. Because of the limitation in how quickly the Multus daemonset updates the authentication information, I believe that intermittent failures are to be expected. While incorrect, such issues will resolve on subsequent runs. This is due to the fact that the Multus CNI binary runs on disk, and the kubeconfig it uses is generated from the entrypoint script which updates each second. It's less likely to see more than one instance of this error, which may be more indicative of a potentially more impactful error condition.

I believe this authentication issue will be resolved holistically by an upcoming refactor to Multus that has multus in a mode where it functions as a "thick plugin" -- with a CNI shim, and then a daemonset that runs resident in memory. In this mode, the resident daemon will use authentication from client-go, which should handle authentication in a fashion similar to other components running as part of the openshift infracture, e.g. operators, and controllers.

Note You need to log in before you can comment on or make changes to this bug.