1972167 – Several operators degraded because Failed to create pod sandbox when installing an sts cluster

Bug 1972167 - Several operators degraded because Failed to create pod sandbox when installing an sts cluster

Summary: Several operators degraded because Failed to create pod sandbox when installi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Douglas Smith
QA Contact:	Weibin Liang
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1972693 (view as bug list)
Depends On:	1973423
Blocks:	1972490
TreeView+	depends on / blocked

Reported:	2021-06-15 11:40 UTC by wang lin
Modified:	2021-07-27 23:13 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Short times for certificate rotation caused Multus to fail to authenticate to the k8s api server. Multus creates its authentication as a kubeconfig on disk as Multus runs on the host as it's a CNI plugin. Consequence: Pods would fail to be created across the cluster. Fix: Multus now watches the secret in its deployment pod, and then updates the on-disk kubeconfig so it catches a certificate rotation. Result: Pods are created successfully.
Clone Of:
Clones:	1973423 (view as bug list)
Environment:
Last Closed:	2021-07-27 23:12:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
must-gather (11.89 MB, application/gzip) 2021-06-15 11:40 UTC, wang lin	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift multus-cni pull 108	0	None	closed	Bug 1972167: Updates entrypoint to rebuild kubeconfig when service account token or ca changes	2021-06-22 23:07:37 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:13:35 UTC

Description wang lin 2021-06-15 11:40:20 UTC

Created attachment 1791239 [details]
must-gather

Description of problem:
We can't install sts cluster using the latest nightly build, several operator are in degraded status.

what's the sts?
https://github.com/openshift/cloud-credential-operator/blob/master/docs/sts.md

What's the issue?
several operator degraded
$ oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-fc.9   False       True          True       3h19m
baremetal                                  4.8.0-fc.9   True        False         False      3h17m
cloud-credential                           4.8.0-fc.9   True        False         False      3h17m
cluster-autoscaler                         4.8.0-fc.9   True        False         False      3h18m
config-operator                            4.8.0-fc.9   True        False         False      3h19m
console                                                                                      
csi-snapshot-controller                    4.8.0-fc.9   True        False         False      3h19m
dns                                        4.8.0-fc.9   True        False         False      3h18m
etcd                                       4.8.0-fc.9   True        True          True       3h17m
image-registry                             4.8.0-fc.9   True        False         False      146m
ingress                                    4.8.0-fc.9   True        False         False      144m
insights                                   4.8.0-fc.9   True        False         False      146m
kube-apiserver                             4.8.0-fc.9   False       True          True       3h19m
kube-controller-manager                    4.8.0-fc.9   True        True          True       3h17m
kube-scheduler                             4.8.0-fc.9   True        True          True       3h16m
kube-storage-version-migrator              4.8.0-fc.9   True        False         False      3h19m
machine-api                                4.8.0-fc.9   True        False         False      143m
machine-approver                           4.8.0-fc.9   True        False         False      3h18m
machine-config                             4.8.0-fc.9   True        False         False      3h17m
marketplace                                4.8.0-fc.9   True        False         False      3h17m
monitoring                                 4.8.0-fc.9   True        False         False      142m
network                                    4.8.0-fc.9   True        False         False      3h20m
node-tuning                                4.8.0-fc.9   True        False         False      3h18m
openshift-apiserver                        4.8.0-fc.9   True        True          False      147m
openshift-controller-manager               4.8.0-fc.9   False       True          False      147m
openshift-samples                                                                            
operator-lifecycle-manager                 4.8.0-fc.9   True        False         False      3h18m
operator-lifecycle-manager-catalog         4.8.0-fc.9   True        False         False      3h18m
operator-lifecycle-manager-packageserver   4.8.0-fc.9   True        False         False      145m
service-ca                                 4.8.0-fc.9   True        False         False      3h19m
storage                                    4.8.0-fc.9   True        False         False      3h18m

###
We can see the below error directly
Warning  FailedCreatePodSandBox  2m19s (x475 over 105m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-6-ip-10-0-189-27.us-east-2.compute.internal_openshift-kube-apiserver_c3736355-5551-4c72-b2e7-3a35a9827cc9_0(b55d34075c839e4e10fc83e030d78f0fd7ce94b19e064d14a739a941ccf31fc4): Multus: [openshift-kube-apiserver/installer-6-ip-10-0-189-27.us-east-2.compute.internal]: error getting pod: Unauthorized
###

could check must gather log for a detail error message

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-06-14-145150
4.8.0-fc.9-x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install a basic sts cluster, the steps is in doc
https://deploy-preview-32722--osdocs.netlify.app/openshift-enterprise/latest/authentication/managing_cloud_provider_credentials/cco-mode-sts.html

Actual results:
The cluster should install successfully 

Expected results:
The cluster install fail

Additional info:

Comment 1 Douglas Smith 2021-06-15 13:29:02 UTC

Hello! This is the first time I've became aware of the STS functionality.

To clarify how Multus generates its secret. First, it's important realize there's kind of two parts of Multus in the context of OCP. 

* Multus CNI binary on disk (which produces the error as shown)
* A daemonset which sets the configuration for this binary.

The Multus binary which runs on disk uses a kubeconfig file. 

That kubeconfig file is generated generated from the second part -- a daemonset which configures Multus. This pods in this daemonset use the /var/run/secrets/kubernetes.io/serviceaccount/token and template a kubeconfig. 

The kubeconfig is generated here in the entrypoint script @ https://github.com/openshift/multus-cni/blame/master/images/entrypoint.sh#L229-L251

Regarding this being a regression -- This functionality has not changed materially in 4.8.0. There was a change to make the creation of this file atomic (in this commit: https://github.com/openshift/multus-cni/commit/6abe8ee06b82ee1dbfd71e1992257238a05191ff ).

Regarding Needinfo:

1. Which team is responsible for the STS feature so that I may consult with them?
2. Is this a release blocker?

Comment 2 Abhinav Dahiya 2021-06-15 17:17:24 UTC

STS install flow updates 

1. the authentications.config.openshift.io "cluster" objects spec.serviceAccountIssuer to s3 bucket URL, hosting the information about the keys used for signing service accounts.
2. the keys used to sign the service account keys.

> That kubeconfig file is generated generated from the second part -- a daemonset which configures Multus. This pods in this daemonset use the /var/run/secrets/kubernetes.io/serviceaccount/token and template a kubeconfig. 

now since the daemonset is using the service account token mounted to its pod and transferring it to the CNI binary for authenticating with k8s api, maybe something has changed in how service account tokens behave when non-default issuer is set in 4.8 ??

Comment 3 Douglas Smith 2021-06-15 18:51:08 UTC

Basically what it boils down to in this case is that Multus CNI generates its kubeconfig from the service account token. Multus is deployed on every OCP cluster, and in order for pods to come up with networking -- Multus needs to properly be able to talk to the API server.

What I need some clarification on here is: Does the kubeconfig for Multus need to be updated on a periodic basis, e.g. when the serviceaccount token changes?

> maybe something has changed in how service account tokens behave when non-default issuer is set in 4.8 ??

Is this change something that may have happened recently, and in what components? I'm unsure if this means there's a fix that's necessary in another component, or in Multus CNI / in the Multus CNI kubeconfig generation process.

----

Additionally, it may be worthwhile to notice that Clayton noted a deficiency for cert rotation for this process, but, that was based on a months/years timeline for cert rotation, so this had been de-prioritized: https://bugzilla.redhat.com/show_bug.cgi?id=1947165

Comment 4 wang lin 2021-06-16 03:10:05 UTC

(In reply to Douglas Smith from comment #1)
> Hello! This is the first time I've became aware of the STS functionality.
> 
> To clarify how Multus generates its secret. First, it's important realize
> there's kind of two parts of Multus in the context of OCP. 
> 
> * Multus CNI binary on disk (which produces the error as shown)
> * A daemonset which sets the configuration for this binary.
> 
> The Multus binary which runs on disk uses a kubeconfig file. 
> 
> That kubeconfig file is generated generated from the second part -- a
> daemonset which configures Multus. This pods in this daemonset use the
> /var/run/secrets/kubernetes.io/serviceaccount/token and template a
> kubeconfig. 
> 
> The kubeconfig is generated here in the entrypoint script @
> https://github.com/openshift/multus-cni/blame/master/images/entrypoint.
> sh#L229-L251
> 
> Regarding this being a regression -- This functionality has not changed
> materially in 4.8.0. There was a change to make the creation of this file
> atomic (in this commit:
> https://github.com/openshift/multus-cni/commit/
> 6abe8ee06b82ee1dbfd71e1992257238a05191ff ).
> 
> Regarding Needinfo:
> 
> 1. Which team is responsible for the STS feature so that I may consult with
> them?
I saw you have ping the right team

> 2. Is this a release blocker?
It definitely a release blocker, we need to support sts in 4.8 GA, now all sts installation affected.

Comment 5 wang lin 2021-06-16 04:50:52 UTC

The installer also prints some error message, paste here and hope it can help.

###logs from installer output
INFO Waiting up to 40m0s for the cluster at https://api.lwanstserr0616.qe.devcluster.openshift.com:6443 to initialize... 
DEBUG Still waiting for the cluster to initialize: Multiple errors are preventing progress: 
DEBUG * Could not update oauthclient "console" (413 of 676): the server does not recognize this resource, check extension API servers 
DEBUG * Could not update role "openshift-console-operator/prometheus-k8s" (611 of 676): resource may have been deleted 
DEBUG * Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (378 of 676): resource may have been deleted 
ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthServerConfigObservation_Error::OAuthServerServiceEndpointAccessibleController_SyncError::OAuthServerServiceEndpointsEndpointAccessibleController_SyncError::RouterCerts_NoRouterCertSecret: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server 
ERROR OAuthServerConfigObservationDegraded: secret "v4-0-config-system-router-certs" not found 
ERROR OAuthServerServiceEndpointAccessibleControllerDegraded: Get "https://172.30.161.119:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 
ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready 
ERROR RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret "v4-0-config-system-router-certs" not found 
INFO Cluster operator authentication Available is False with APIServices_Error::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::OAuthServerServiceEndpointsEndpointAccessibleController_ResourceNotFound::ReadyIngressNodes_NoReadyIngressNodes: APIServicesAvailable: "oauth.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 
INFO APIServicesAvailable: "user.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 
INFO OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.161.119:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 
INFO OAuthServerServiceEndpointsEndpointAccessibleControllerAvailable: endpoints "oauth-openshift" not found 
INFO ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods). 
INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform 
ERROR Cluster operator cluster-autoscaler Degraded is True with MissingDependency: machine-api not ready 
INFO Cluster operator ingress Available is Unknown with IngressDoesNotHaveAvailableCondition: The "default" ingress controller is not reporting an Available status condition. 
INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available. 
ERROR Cluster operator ingress Degraded is Unknown with IngressDoesNotHaveDegradedCondition: The "default" ingress controller is not reporting a Degraded status condition. 
ERROR Cluster operator kube-apiserver Degraded is True with StaticPods_Error: StaticPodsDegraded: pods "kube-apiserver-ip-10-0-147-186.us-east-2.compute.internal" not found 
INFO Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 5 
INFO Cluster operator kube-apiserver Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 5 
INFO Cluster operator kube-controller-manager Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 3 
ERROR Cluster operator kube-scheduler Degraded is True with NodeInstaller_InstallerPodFailed::StaticPods_Error: NodeInstallerDegraded: 1 nodes are failing on revision 3: 
ERROR NodeInstallerDegraded: installer: map"       
ERROR NodeInstallerDegraded:  },                   
ERROR NodeInstallerDegraded:  CertSecretNames: ([]string) (len=1 cap=1) { 
ERROR NodeInstallerDegraded:   (string) (len=30) "kube-scheduler-client-cert-key" 
ERROR NodeInstallerDegraded:  OptionalCertSecretNamePrefixes: ([]string) <nil>, 
ERROR NodeInstallerDegraded:  CertConfigMapNamePrefixes: ([]string) <nil>, 
ERROR NodeInstallerDegraded:  OptionalCertConfigMapNamePrefixes: ([]string) <nil>, 
ERROR NodeInstallerDegraded:  CertDir: (string) (len=57) "/etc/kubernetes/static-pod-resources/kube-scheduler-certs", 
ERROR NodeInstallerDegraded:  ResourceDir: (string) (len=36) "/etc/kubernetes/static-pod-resources", 
ERROR NodeInstallerDegraded:  PodManifestDir: (string) (len=25) "/etc/kubernetes/manifests", 
ERROR NodeInstallerDegraded:  Timeout: (time.Duration) 2m0s, 
ERROR NodeInstallerDegraded:  PodMutationFns: ([]installerpod.PodMutationFunc) <nil> 
ERROR NodeInstallerDegraded: })                    
ERROR NodeInstallerDegraded: W0616 03:51:44.852191       1 cmd.go:389] unable to get owner reference (falling back to namespace): Unauthorized 
ERROR NodeInstallerDegraded: I0616 03:51:44.852215       1 cmd.go:253] Creating target resource directory "/etc/kubernetes/static-pod-resources/kube-scheduler-pod-3" ... 
ERROR NodeInstallerDegraded: I0616 03:51:44.852252       1 cmd.go:181] Creating target resource directory "/etc/kubernetes/static-pod-resources/kube-scheduler-pod-3" ... 
ERROR NodeInstallerDegraded: I0616 03:51:44.852264       1 cmd.go:189] Getting secrets ... 
ERROR NodeInstallerDegraded: I0616 03:51:44.853691       1 copy.go:24] Failed to get secret openshift-kube-scheduler/localhost-recovery-client-token-3: Unauthorized 
ERROR NodeInstallerDegraded: W0616 03:51:44.854702       1 recorder.go:198] Error creating event &Event{ObjectMeta:{openshift-kube-scheduler.1688f3992b93a504  openshift-kube-scheduler    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},InvolvedObject:ObjectReference{Kind:Namespace,Namespace:openshift-kube-scheduler,Name:openshift-kube-scheduler,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:StaticPodInstallerFailed,Message:Installing revision 3: Unauthorized,Source:EventSource{Component:static-pod-installer,Host:,},FirstTimestamp:2021-06-16 03:51:44.853705988 +0000 UTC m=+0.505659592,LastTimestamp:2021-06-16 03:51:44.853705988 +0000 UTC m=+0.505659592,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}: Unauthorized 
ERROR NodeInstallerDegraded: F0616 03:51:44.854805       1 cmd.go:92] failed to copy: Unauthorized 
ERROR NodeInstallerDegraded:                       
ERROR StaticPodsDegraded: pods "openshift-kube-scheduler-ip-10-0-201-98.us-east-2.compute.internal" not found 
ERROR StaticPodsDegraded: pods "openshift-kube-scheduler-ip-10-0-147-186.us-east-2.compute.internal" not found 
INFO Cluster operator kube-scheduler Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 3 
INFO Cluster operator machine-api Progressing is True with SyncingResources: Progressing towards operator: 4.8.0-0.nightly-2021-06-14-145150 
INFO Cluster operator machine-api Available is False with Initializing: Operator is initializing 
INFO Cluster operator monitoring Available is Unknown with :  
INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. 
ERROR Cluster operator monitoring Degraded is Unknown with :  
INFO Cluster operator network ManagementStateDegraded is False with :  
INFO Cluster operator network Progressing is True with Deploying: Deployment "openshift-network-diagnostics/network-check-source" is not available (awaiting 1 nodes) 
INFO Cluster operator openshift-apiserver Available is False with APIServices_Error: APIServicesAvailable: "apps.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 
INFO APIServicesAvailable: "authorization.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 
INFO APIServicesAvailable: "build.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 
INFO APIServicesAvailable: "image.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 
INFO APIServicesAvailable: "project.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 
INFO APIServicesAvailable: "quota.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 
INFO APIServicesAvailable: "route.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 
INFO APIServicesAvailable: "security.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 
INFO APIServicesAvailable: "template.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 
INFO Cluster operator operator-lifecycle-manager Progressing is True with : Deployed 0.17.0 
INFO Cluster operator operator-lifecycle-manager-catalog Progressing is True with : Deployed 0.17.0 
INFO Cluster operator operator-lifecycle-manager-packageserver Available is False with :  
INFO Cluster operator operator-lifecycle-manager-packageserver Progressing is True with : Working toward 0.17.0 
ERROR Cluster initialization failed because one or more operators are not functioning properly. 
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below, 
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html 
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation 
FATAL failed to initialize the cluster: Multiple errors are preventing progress: 
FATAL * Could not update oauthclient "console" (413 of 676): the server does not recognize this resource, check extension API servers 
FATAL * Could not update role "openshift-console-operator/prometheus-k8s" (611 of 676): resource may have been deleted 
FATAL * Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (378 of 676): resource may have been deleted

Comment 6 zhaozhanqi 2021-06-16 09:56:48 UTC

I checked the stuck cluster yesterday.  After recreate the multus pods on master worker. this error 'error getting pod: Unauthorized' can be fixed and pods can be running

Comment 7 Sergiusz Urbaniak 2021-06-16 11:17:04 UTC

reading https://bugzilla.redhat.com/show_bug.cgi?id=1972167#c1:

> That kubeconfig file is generated generated from the second part -- a daemonset which configures Multus. This pods in this daemonset use the /var/run/secrets/kubernetes.io/serviceaccount/token and template a kubeconfig.

Just to make sure: the token is read inside the pod, not in the operator? Note that with Bound service account tokens there are no pregenerated secrets any more: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume

Instead, a projected volume for token secrets is being used.

Comment 10 Douglas Smith 2021-06-16 15:53:54 UTC

(In reply to Sergiusz Urbaniak from comment #7)
> reading https://bugzilla.redhat.com/show_bug.cgi?id=1972167#c1:
> 
> > That kubeconfig file is generated generated from the second part -- a daemonset which configures Multus. This pods in this daemonset use the /var/run/secrets/kubernetes.io/serviceaccount/token and template a kubeconfig.
> 
> Just to make sure: the token is read inside the pod, not in the operator?

Yes, that's correct. A daemonset is spun up, and file @ /var/run/secrets/kubernetes.io/serviceaccount/token is used

> Note that with Bound service account tokens there are no pregenerated
> secrets any more:
> https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-
> admin/#bound-service-account-token-volume
> 
> Instead, a projected volume for token secrets is being used.

This is new to me. The thing I find confusing however is that it does still work -- sometimes? It's not like the file is gone, in fact, we check for the presence of that file and error out otherwise @ https://github.com/openshift/multus-cni/blob/master/images/entrypoint.sh#L210

That is, see how comment #6 notes that:

> After recreate the multus pods on master worker. this error 'error getting pod: Unauthorized' can be fixed and pods can be running

Comment 11 Douglas Smith 2021-06-16 20:44:39 UTC

I've performed another analysis of the must-gather given the new information about the Bound service account tokens. 

However, it seems in all of the logs where we create the multus configuration, the service account token file is found.

Find the files in question with:

```
find . | grep -i kube-multus | grep -v additional | grep -i current.log
```

A typical log looks like:


```
2021-06-15T07:35:42.405676700Z 2021-06-15T07:35:42+00:00 Entering watch loop...
2021-06-15T08:28:48.454013698Z Successfully copied files in /usr/src/multus-cni/rhel8/bin/ to /host/opt/cni/bin/
2021-06-15T08:28:48.638201099Z 2021-06-15T08:28:48+00:00 WARN: {unknown parameter "-"}
2021-06-15T08:28:48.649025701Z 2021-06-15T08:28:48+00:00 Entrypoint skipped copying Multus binary.
2021-06-15T08:28:48.678883899Z 2021-06-15T08:28:48+00:00 Generating Multus configuration file using files in /host/var/run/multus/cni/net.d...
2021-06-15T08:28:48.690306190Z 2021-06-15T08:28:48+00:00 Attempting to find master plugin configuration, attempt 0
2021-06-15T08:28:50.981961336Z 2021-06-15T08:28:50+00:00 Nested capabilities string: 
2021-06-15T08:28:50.987954270Z 2021-06-15T08:28:50+00:00 Using /host/var/run/multus/cni/net.d/80-openshift-network.conf as a source to generate the Multus configuration
2021-06-15T08:28:51.000950295Z 2021-06-15T08:28:51+00:00 Config file created @ /host/etc/cni/net.d/00-multus.conf
2021-06-15T08:28:51.001073723Z { "cniVersion": "0.3.1", "name": "multus-cni-network", "type": "multus", "namespaceIsolation": true, "globalNamespaces": "default,openshift-multus,openshift-sriov-network-operator", "logLevel": "verbose", "binDir": "/opt/multus/bin", "readinessindicatorfile": "/var/run/multus/cni/net.d/80-openshift-network.conf", "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ { "cniVersion": "0.3.1", "name": "openshift-sdn", "type": "openshift-sdn" } ] }
```

It's notable that it does not contain a message about the service account token file path, as it's able to find it, here @ https://github.com/openshift/multus-cni/blob/master/images/entrypoint.sh#L210

Should this not work under this new model? How would I get the service account token in "the new fashion"?

Comment 12 Standa Laznicka 2021-06-17 11:12:42 UTC

The "new" tokens still appear in the same location, however they are tied to a given workload and are being rotated about every hour. Is your approach ready to refresh the kubeconfig when the token changes? Can we get a confirmation that multus can survive two hours without a restart?

Comment 13 Sergiusz Urbaniak 2021-06-17 11:43:31 UTC

Rotation might indeed be the culprit here, upstream also gives a nice overview of how they work: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume

Comment 14 Douglas Smith 2021-06-17 13:44:36 UTC

I've consulted with my team and we've settled on a path forward, I'm planning on having a PR in flight this afternoon (US eastern), and hopefully merged as well.

Comment 15 Douglas Smith 2021-06-18 13:42:11 UTC

*** Bug 1972490 has been marked as a duplicate of this bug. ***

Comment 16 Douglas Smith 2021-06-18 13:42:29 UTC

*** Bug 1948066 has been marked as a duplicate of this bug. ***

Comment 18 zhaozhanqi 2021-06-21 08:29:08 UTC

@lwang 

I triggered one sts cluster. found the job failed, However the cluster is working well

https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/25812/console


$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-0.nightly-2021-06-19-005119   True        False         False      3h12m
baremetal                                  4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h33m
cloud-credential                           4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h33m
cluster-autoscaler                         4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h34m
config-operator                            4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h35m
console                                    4.8.0-0.nightly-2021-06-19-005119   True        False         False      3h25m
csi-snapshot-controller                    4.8.0-0.nightly-2021-06-19-005119   True        False         False      3h41m
dns                                        4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h33m
etcd                                       4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h31m
image-registry                             4.8.0-0.nightly-2021-06-19-005119   True        False         False      3h42m
ingress                                    4.8.0-0.nightly-2021-06-19-005119   True        False         False      3h32m
insights                                   4.8.0-0.nightly-2021-06-19-005119   True        False         False      3h41m
kube-apiserver                             4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h30m
kube-controller-manager                    4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h33m
kube-scheduler                             4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h32m
kube-storage-version-migrator              4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h35m
machine-api                                4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h30m
machine-approver                           4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h33m
machine-config                             4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h33m
marketplace                                4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h33m
monitoring                                 4.8.0-0.nightly-2021-06-19-005119   True        False         False      3h32m
network                                    4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h35m
node-tuning                                4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h33m
openshift-apiserver                        4.8.0-0.nightly-2021-06-19-005119   True        False         False      3h42m
openshift-controller-manager               4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h33m
openshift-samples                          4.8.0-0.nightly-2021-06-19-005119   True        False         False      3h41m
operator-lifecycle-manager                 4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h33m
operator-lifecycle-manager-catalog         4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h33m
operator-lifecycle-manager-packageserver   4.8.0-0.nightly-2021-06-19-005119   True        False         False      3h42m
service-ca                                 4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h35m
storage                                    4.8.0-0.nightly-2021-06-19-005119   True        False         False      3h35m

I think this multus issue should be fixed.

Comment 19 wang lin 2021-06-22 03:33:31 UTC

yes,  no longer see `Failed to create pod sandbox` error, installation fail maybe other components hit the same unauthorized issue as multus because of the token rotations

Comment 20 zhaozhanqi 2021-06-22 03:35:23 UTC

ok, thank you Wang lin,  Then I moved this bug to verified.

Comment 21 W. Trevor King 2021-06-22 03:42:02 UTC

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&context=0&name=4.9&search=reason/FailedCreatePodSandBox' | jq -r 'to_entries
[].value | to_entries[].value[].context[]' | sed 's/.*://;s/\(pods\?\) "[^"]*" not found/\1 "..." not found/' | sort | uniq -c | sort -n | tail
      1  failed to find plugin "openshift-sdn" in path [/opt/multus/bin /var/lib/cni/bin /usr/libexec/cni]
      1  file does not exist
      1  missing content-type field
      2  request timed out
      3  EOF
      3  pods "..." not found
      3  timed out while waiting for OVS port binding
      6 
     12  i/o timeout
     41  Unauthorized
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&name=4.9&search=reason/FailedCreatePodSandBox.*Unauthorized' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 15 runs, 87% failed, 54% of failures match = 47% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 17 runs, 76% failed, 100% of failures match = 76% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 14 runs, 29% failed, 225% of failures match = 64% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade (all) - 3 runs, 33% failed, 200% of failures match = 67% impact
periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 6 runs, 50% failed, 167% of failures match = 83% impact
periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-metal-ipi-upgrade (all) - 6 runs, 83% failed, 80% of failures match = 67% impact
pull-ci-openshift-etcd-openshift-4.9-e2e-aws-upgrade (all) - 7 runs, 86% failed, 17% of failures match = 14% impact
release-openshift-origin-installer-old-rhcos-e2e-aws-4.9 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

So not entirely gone in 4.9 CI yet, but maybe once a few more 4.9 releases have been accepted and the fix has been backported to 4.8, the remaining jobs will green up too.

Comment 22 Sebastian Łaskawiec 2021-06-22 06:21:49 UTC

*** Bug 1972693 has been marked as a duplicate of this bug. ***

Comment 25 errata-xmlrpc 2021-07-27 23:12:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.