Created attachment 1791239 [details] must-gather Description of problem: We can't install sts cluster using the latest nightly build, several operator are in degraded status. what's the sts? https://github.com/openshift/cloud-credential-operator/blob/master/docs/sts.md What's the issue? several operator degraded $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-fc.9 False True True 3h19m baremetal 4.8.0-fc.9 True False False 3h17m cloud-credential 4.8.0-fc.9 True False False 3h17m cluster-autoscaler 4.8.0-fc.9 True False False 3h18m config-operator 4.8.0-fc.9 True False False 3h19m console csi-snapshot-controller 4.8.0-fc.9 True False False 3h19m dns 4.8.0-fc.9 True False False 3h18m etcd 4.8.0-fc.9 True True True 3h17m image-registry 4.8.0-fc.9 True False False 146m ingress 4.8.0-fc.9 True False False 144m insights 4.8.0-fc.9 True False False 146m kube-apiserver 4.8.0-fc.9 False True True 3h19m kube-controller-manager 4.8.0-fc.9 True True True 3h17m kube-scheduler 4.8.0-fc.9 True True True 3h16m kube-storage-version-migrator 4.8.0-fc.9 True False False 3h19m machine-api 4.8.0-fc.9 True False False 143m machine-approver 4.8.0-fc.9 True False False 3h18m machine-config 4.8.0-fc.9 True False False 3h17m marketplace 4.8.0-fc.9 True False False 3h17m monitoring 4.8.0-fc.9 True False False 142m network 4.8.0-fc.9 True False False 3h20m node-tuning 4.8.0-fc.9 True False False 3h18m openshift-apiserver 4.8.0-fc.9 True True False 147m openshift-controller-manager 4.8.0-fc.9 False True False 147m openshift-samples operator-lifecycle-manager 4.8.0-fc.9 True False False 3h18m operator-lifecycle-manager-catalog 4.8.0-fc.9 True False False 3h18m operator-lifecycle-manager-packageserver 4.8.0-fc.9 True False False 145m service-ca 4.8.0-fc.9 True False False 3h19m storage 4.8.0-fc.9 True False False 3h18m ### We can see the below error directly Warning FailedCreatePodSandBox 2m19s (x475 over 105m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-6-ip-10-0-189-27.us-east-2.compute.internal_openshift-kube-apiserver_c3736355-5551-4c72-b2e7-3a35a9827cc9_0(b55d34075c839e4e10fc83e030d78f0fd7ce94b19e064d14a739a941ccf31fc4): Multus: [openshift-kube-apiserver/installer-6-ip-10-0-189-27.us-east-2.compute.internal]: error getting pod: Unauthorized ### could check must gather log for a detail error message Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-06-14-145150 4.8.0-fc.9-x86_64 How reproducible: Always Steps to Reproduce: 1. Install a basic sts cluster, the steps is in doc https://deploy-preview-32722--osdocs.netlify.app/openshift-enterprise/latest/authentication/managing_cloud_provider_credentials/cco-mode-sts.html Actual results: The cluster should install successfully Expected results: The cluster install fail Additional info:
Hello! This is the first time I've became aware of the STS functionality. To clarify how Multus generates its secret. First, it's important realize there's kind of two parts of Multus in the context of OCP. * Multus CNI binary on disk (which produces the error as shown) * A daemonset which sets the configuration for this binary. The Multus binary which runs on disk uses a kubeconfig file. That kubeconfig file is generated generated from the second part -- a daemonset which configures Multus. This pods in this daemonset use the /var/run/secrets/kubernetes.io/serviceaccount/token and template a kubeconfig. The kubeconfig is generated here in the entrypoint script @ https://github.com/openshift/multus-cni/blame/master/images/entrypoint.sh#L229-L251 Regarding this being a regression -- This functionality has not changed materially in 4.8.0. There was a change to make the creation of this file atomic (in this commit: https://github.com/openshift/multus-cni/commit/6abe8ee06b82ee1dbfd71e1992257238a05191ff ). Regarding Needinfo: 1. Which team is responsible for the STS feature so that I may consult with them? 2. Is this a release blocker?
STS install flow updates 1. the authentications.config.openshift.io "cluster" objects spec.serviceAccountIssuer to s3 bucket URL, hosting the information about the keys used for signing service accounts. 2. the keys used to sign the service account keys. > That kubeconfig file is generated generated from the second part -- a daemonset which configures Multus. This pods in this daemonset use the /var/run/secrets/kubernetes.io/serviceaccount/token and template a kubeconfig. now since the daemonset is using the service account token mounted to its pod and transferring it to the CNI binary for authenticating with k8s api, maybe something has changed in how service account tokens behave when non-default issuer is set in 4.8 ??
Basically what it boils down to in this case is that Multus CNI generates its kubeconfig from the service account token. Multus is deployed on every OCP cluster, and in order for pods to come up with networking -- Multus needs to properly be able to talk to the API server. What I need some clarification on here is: Does the kubeconfig for Multus need to be updated on a periodic basis, e.g. when the serviceaccount token changes? > maybe something has changed in how service account tokens behave when non-default issuer is set in 4.8 ?? Is this change something that may have happened recently, and in what components? I'm unsure if this means there's a fix that's necessary in another component, or in Multus CNI / in the Multus CNI kubeconfig generation process. ---- Additionally, it may be worthwhile to notice that Clayton noted a deficiency for cert rotation for this process, but, that was based on a months/years timeline for cert rotation, so this had been de-prioritized: https://bugzilla.redhat.com/show_bug.cgi?id=1947165
(In reply to Douglas Smith from comment #1) > Hello! This is the first time I've became aware of the STS functionality. > > To clarify how Multus generates its secret. First, it's important realize > there's kind of two parts of Multus in the context of OCP. > > * Multus CNI binary on disk (which produces the error as shown) > * A daemonset which sets the configuration for this binary. > > The Multus binary which runs on disk uses a kubeconfig file. > > That kubeconfig file is generated generated from the second part -- a > daemonset which configures Multus. This pods in this daemonset use the > /var/run/secrets/kubernetes.io/serviceaccount/token and template a > kubeconfig. > > The kubeconfig is generated here in the entrypoint script @ > https://github.com/openshift/multus-cni/blame/master/images/entrypoint. > sh#L229-L251 > > Regarding this being a regression -- This functionality has not changed > materially in 4.8.0. There was a change to make the creation of this file > atomic (in this commit: > https://github.com/openshift/multus-cni/commit/ > 6abe8ee06b82ee1dbfd71e1992257238a05191ff ). > > Regarding Needinfo: > > 1. Which team is responsible for the STS feature so that I may consult with > them? I saw you have ping the right team > 2. Is this a release blocker? It definitely a release blocker, we need to support sts in 4.8 GA, now all sts installation affected.
The installer also prints some error message, paste here and hope it can help. ###logs from installer output INFO Waiting up to 40m0s for the cluster at https://api.lwanstserr0616.qe.devcluster.openshift.com:6443 to initialize... DEBUG Still waiting for the cluster to initialize: Multiple errors are preventing progress: DEBUG * Could not update oauthclient "console" (413 of 676): the server does not recognize this resource, check extension API servers DEBUG * Could not update role "openshift-console-operator/prometheus-k8s" (611 of 676): resource may have been deleted DEBUG * Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (378 of 676): resource may have been deleted ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthServerConfigObservation_Error::OAuthServerServiceEndpointAccessibleController_SyncError::OAuthServerServiceEndpointsEndpointAccessibleController_SyncError::RouterCerts_NoRouterCertSecret: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server ERROR OAuthServerConfigObservationDegraded: secret "v4-0-config-system-router-certs" not found ERROR OAuthServerServiceEndpointAccessibleControllerDegraded: Get "https://172.30.161.119:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready ERROR RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret "v4-0-config-system-router-certs" not found INFO Cluster operator authentication Available is False with APIServices_Error::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::OAuthServerServiceEndpointsEndpointAccessibleController_ResourceNotFound::ReadyIngressNodes_NoReadyIngressNodes: APIServicesAvailable: "oauth.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) INFO APIServicesAvailable: "user.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) INFO OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.161.119:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) INFO OAuthServerServiceEndpointsEndpointAccessibleControllerAvailable: endpoints "oauth-openshift" not found INFO ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods). INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform ERROR Cluster operator cluster-autoscaler Degraded is True with MissingDependency: machine-api not ready INFO Cluster operator ingress Available is Unknown with IngressDoesNotHaveAvailableCondition: The "default" ingress controller is not reporting an Available status condition. INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available. ERROR Cluster operator ingress Degraded is Unknown with IngressDoesNotHaveDegradedCondition: The "default" ingress controller is not reporting a Degraded status condition. ERROR Cluster operator kube-apiserver Degraded is True with StaticPods_Error: StaticPodsDegraded: pods "kube-apiserver-ip-10-0-147-186.us-east-2.compute.internal" not found INFO Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 5 INFO Cluster operator kube-apiserver Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 5 INFO Cluster operator kube-controller-manager Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 3 ERROR Cluster operator kube-scheduler Degraded is True with NodeInstaller_InstallerPodFailed::StaticPods_Error: NodeInstallerDegraded: 1 nodes are failing on revision 3: ERROR NodeInstallerDegraded: installer: map" ERROR NodeInstallerDegraded: }, ERROR NodeInstallerDegraded: CertSecretNames: ([]string) (len=1 cap=1) { ERROR NodeInstallerDegraded: (string) (len=30) "kube-scheduler-client-cert-key" ERROR NodeInstallerDegraded: OptionalCertSecretNamePrefixes: ([]string) <nil>, ERROR NodeInstallerDegraded: CertConfigMapNamePrefixes: ([]string) <nil>, ERROR NodeInstallerDegraded: OptionalCertConfigMapNamePrefixes: ([]string) <nil>, ERROR NodeInstallerDegraded: CertDir: (string) (len=57) "/etc/kubernetes/static-pod-resources/kube-scheduler-certs", ERROR NodeInstallerDegraded: ResourceDir: (string) (len=36) "/etc/kubernetes/static-pod-resources", ERROR NodeInstallerDegraded: PodManifestDir: (string) (len=25) "/etc/kubernetes/manifests", ERROR NodeInstallerDegraded: Timeout: (time.Duration) 2m0s, ERROR NodeInstallerDegraded: PodMutationFns: ([]installerpod.PodMutationFunc) <nil> ERROR NodeInstallerDegraded: }) ERROR NodeInstallerDegraded: W0616 03:51:44.852191 1 cmd.go:389] unable to get owner reference (falling back to namespace): Unauthorized ERROR NodeInstallerDegraded: I0616 03:51:44.852215 1 cmd.go:253] Creating target resource directory "/etc/kubernetes/static-pod-resources/kube-scheduler-pod-3" ... ERROR NodeInstallerDegraded: I0616 03:51:44.852252 1 cmd.go:181] Creating target resource directory "/etc/kubernetes/static-pod-resources/kube-scheduler-pod-3" ... ERROR NodeInstallerDegraded: I0616 03:51:44.852264 1 cmd.go:189] Getting secrets ... ERROR NodeInstallerDegraded: I0616 03:51:44.853691 1 copy.go:24] Failed to get secret openshift-kube-scheduler/localhost-recovery-client-token-3: Unauthorized ERROR NodeInstallerDegraded: W0616 03:51:44.854702 1 recorder.go:198] Error creating event &Event{ObjectMeta:{openshift-kube-scheduler.1688f3992b93a504 openshift-kube-scheduler 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Namespace,Namespace:openshift-kube-scheduler,Name:openshift-kube-scheduler,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:StaticPodInstallerFailed,Message:Installing revision 3: Unauthorized,Source:EventSource{Component:static-pod-installer,Host:,},FirstTimestamp:2021-06-16 03:51:44.853705988 +0000 UTC m=+0.505659592,LastTimestamp:2021-06-16 03:51:44.853705988 +0000 UTC m=+0.505659592,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}: Unauthorized ERROR NodeInstallerDegraded: F0616 03:51:44.854805 1 cmd.go:92] failed to copy: Unauthorized ERROR NodeInstallerDegraded: ERROR StaticPodsDegraded: pods "openshift-kube-scheduler-ip-10-0-201-98.us-east-2.compute.internal" not found ERROR StaticPodsDegraded: pods "openshift-kube-scheduler-ip-10-0-147-186.us-east-2.compute.internal" not found INFO Cluster operator kube-scheduler Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 3 INFO Cluster operator machine-api Progressing is True with SyncingResources: Progressing towards operator: 4.8.0-0.nightly-2021-06-14-145150 INFO Cluster operator machine-api Available is False with Initializing: Operator is initializing INFO Cluster operator monitoring Available is Unknown with : INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. ERROR Cluster operator monitoring Degraded is Unknown with : INFO Cluster operator network ManagementStateDegraded is False with : INFO Cluster operator network Progressing is True with Deploying: Deployment "openshift-network-diagnostics/network-check-source" is not available (awaiting 1 nodes) INFO Cluster operator openshift-apiserver Available is False with APIServices_Error: APIServicesAvailable: "apps.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) INFO APIServicesAvailable: "authorization.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) INFO APIServicesAvailable: "build.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) INFO APIServicesAvailable: "image.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) INFO APIServicesAvailable: "project.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) INFO APIServicesAvailable: "quota.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) INFO APIServicesAvailable: "route.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) INFO APIServicesAvailable: "security.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) INFO APIServicesAvailable: "template.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) INFO Cluster operator operator-lifecycle-manager Progressing is True with : Deployed 0.17.0 INFO Cluster operator operator-lifecycle-manager-catalog Progressing is True with : Deployed 0.17.0 INFO Cluster operator operator-lifecycle-manager-packageserver Available is False with : INFO Cluster operator operator-lifecycle-manager-packageserver Progressing is True with : Working toward 0.17.0 ERROR Cluster initialization failed because one or more operators are not functioning properly. ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below, ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation FATAL failed to initialize the cluster: Multiple errors are preventing progress: FATAL * Could not update oauthclient "console" (413 of 676): the server does not recognize this resource, check extension API servers FATAL * Could not update role "openshift-console-operator/prometheus-k8s" (611 of 676): resource may have been deleted FATAL * Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (378 of 676): resource may have been deleted
I checked the stuck cluster yesterday. After recreate the multus pods on master worker. this error 'error getting pod: Unauthorized' can be fixed and pods can be running
reading https://bugzilla.redhat.com/show_bug.cgi?id=1972167#c1: > That kubeconfig file is generated generated from the second part -- a daemonset which configures Multus. This pods in this daemonset use the /var/run/secrets/kubernetes.io/serviceaccount/token and template a kubeconfig. Just to make sure: the token is read inside the pod, not in the operator? Note that with Bound service account tokens there are no pregenerated secrets any more: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume Instead, a projected volume for token secrets is being used.
(In reply to Sergiusz Urbaniak from comment #7) > reading https://bugzilla.redhat.com/show_bug.cgi?id=1972167#c1: > > > That kubeconfig file is generated generated from the second part -- a daemonset which configures Multus. This pods in this daemonset use the /var/run/secrets/kubernetes.io/serviceaccount/token and template a kubeconfig. > > Just to make sure: the token is read inside the pod, not in the operator? Yes, that's correct. A daemonset is spun up, and file @ /var/run/secrets/kubernetes.io/serviceaccount/token is used > Note that with Bound service account tokens there are no pregenerated > secrets any more: > https://kubernetes.io/docs/reference/access-authn-authz/service-accounts- > admin/#bound-service-account-token-volume > > Instead, a projected volume for token secrets is being used. This is new to me. The thing I find confusing however is that it does still work -- sometimes? It's not like the file is gone, in fact, we check for the presence of that file and error out otherwise @ https://github.com/openshift/multus-cni/blob/master/images/entrypoint.sh#L210 That is, see how comment #6 notes that: > After recreate the multus pods on master worker. this error 'error getting pod: Unauthorized' can be fixed and pods can be running
I've performed another analysis of the must-gather given the new information about the Bound service account tokens. However, it seems in all of the logs where we create the multus configuration, the service account token file is found. Find the files in question with: ``` find . | grep -i kube-multus | grep -v additional | grep -i current.log ``` A typical log looks like: ``` 2021-06-15T07:35:42.405676700Z 2021-06-15T07:35:42+00:00 Entering watch loop... 2021-06-15T08:28:48.454013698Z Successfully copied files in /usr/src/multus-cni/rhel8/bin/ to /host/opt/cni/bin/ 2021-06-15T08:28:48.638201099Z 2021-06-15T08:28:48+00:00 WARN: {unknown parameter "-"} 2021-06-15T08:28:48.649025701Z 2021-06-15T08:28:48+00:00 Entrypoint skipped copying Multus binary. 2021-06-15T08:28:48.678883899Z 2021-06-15T08:28:48+00:00 Generating Multus configuration file using files in /host/var/run/multus/cni/net.d... 2021-06-15T08:28:48.690306190Z 2021-06-15T08:28:48+00:00 Attempting to find master plugin configuration, attempt 0 2021-06-15T08:28:50.981961336Z 2021-06-15T08:28:50+00:00 Nested capabilities string: 2021-06-15T08:28:50.987954270Z 2021-06-15T08:28:50+00:00 Using /host/var/run/multus/cni/net.d/80-openshift-network.conf as a source to generate the Multus configuration 2021-06-15T08:28:51.000950295Z 2021-06-15T08:28:51+00:00 Config file created @ /host/etc/cni/net.d/00-multus.conf 2021-06-15T08:28:51.001073723Z { "cniVersion": "0.3.1", "name": "multus-cni-network", "type": "multus", "namespaceIsolation": true, "globalNamespaces": "default,openshift-multus,openshift-sriov-network-operator", "logLevel": "verbose", "binDir": "/opt/multus/bin", "readinessindicatorfile": "/var/run/multus/cni/net.d/80-openshift-network.conf", "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ { "cniVersion": "0.3.1", "name": "openshift-sdn", "type": "openshift-sdn" } ] } ``` It's notable that it does not contain a message about the service account token file path, as it's able to find it, here @ https://github.com/openshift/multus-cni/blob/master/images/entrypoint.sh#L210 Should this not work under this new model? How would I get the service account token in "the new fashion"?
The "new" tokens still appear in the same location, however they are tied to a given workload and are being rotated about every hour. Is your approach ready to refresh the kubeconfig when the token changes? Can we get a confirmation that multus can survive two hours without a restart?
Rotation might indeed be the culprit here, upstream also gives a nice overview of how they work: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume
I've consulted with my team and we've settled on a path forward, I'm planning on having a PR in flight this afternoon (US eastern), and hopefully merged as well.
*** Bug 1972490 has been marked as a duplicate of this bug. ***
*** Bug 1948066 has been marked as a duplicate of this bug. ***
@lwang I triggered one sts cluster. found the job failed, However the cluster is working well https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/25812/console $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-06-19-005119 True False False 3h12m baremetal 4.8.0-0.nightly-2021-06-19-005119 True False False 4h33m cloud-credential 4.8.0-0.nightly-2021-06-19-005119 True False False 4h33m cluster-autoscaler 4.8.0-0.nightly-2021-06-19-005119 True False False 4h34m config-operator 4.8.0-0.nightly-2021-06-19-005119 True False False 4h35m console 4.8.0-0.nightly-2021-06-19-005119 True False False 3h25m csi-snapshot-controller 4.8.0-0.nightly-2021-06-19-005119 True False False 3h41m dns 4.8.0-0.nightly-2021-06-19-005119 True False False 4h33m etcd 4.8.0-0.nightly-2021-06-19-005119 True False False 4h31m image-registry 4.8.0-0.nightly-2021-06-19-005119 True False False 3h42m ingress 4.8.0-0.nightly-2021-06-19-005119 True False False 3h32m insights 4.8.0-0.nightly-2021-06-19-005119 True False False 3h41m kube-apiserver 4.8.0-0.nightly-2021-06-19-005119 True False False 4h30m kube-controller-manager 4.8.0-0.nightly-2021-06-19-005119 True False False 4h33m kube-scheduler 4.8.0-0.nightly-2021-06-19-005119 True False False 4h32m kube-storage-version-migrator 4.8.0-0.nightly-2021-06-19-005119 True False False 4h35m machine-api 4.8.0-0.nightly-2021-06-19-005119 True False False 4h30m machine-approver 4.8.0-0.nightly-2021-06-19-005119 True False False 4h33m machine-config 4.8.0-0.nightly-2021-06-19-005119 True False False 4h33m marketplace 4.8.0-0.nightly-2021-06-19-005119 True False False 4h33m monitoring 4.8.0-0.nightly-2021-06-19-005119 True False False 3h32m network 4.8.0-0.nightly-2021-06-19-005119 True False False 4h35m node-tuning 4.8.0-0.nightly-2021-06-19-005119 True False False 4h33m openshift-apiserver 4.8.0-0.nightly-2021-06-19-005119 True False False 3h42m openshift-controller-manager 4.8.0-0.nightly-2021-06-19-005119 True False False 4h33m openshift-samples 4.8.0-0.nightly-2021-06-19-005119 True False False 3h41m operator-lifecycle-manager 4.8.0-0.nightly-2021-06-19-005119 True False False 4h33m operator-lifecycle-manager-catalog 4.8.0-0.nightly-2021-06-19-005119 True False False 4h33m operator-lifecycle-manager-packageserver 4.8.0-0.nightly-2021-06-19-005119 True False False 3h42m service-ca 4.8.0-0.nightly-2021-06-19-005119 True False False 4h35m storage 4.8.0-0.nightly-2021-06-19-005119 True False False 3h35m I think this multus issue should be fixed.
yes, no longer see `Failed to create pod sandbox` error, installation fail maybe other components hit the same unauthorized issue as multus because of the token rotations
ok, thank you Wang lin, Then I moved this bug to verified.
$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&context=0&name=4.9&search=reason/FailedCreatePodSandBox' | jq -r 'to_entries [].value | to_entries[].value[].context[]' | sed 's/.*://;s/\(pods\?\) "[^"]*" not found/\1 "..." not found/' | sort | uniq -c | sort -n | tail 1 failed to find plugin "openshift-sdn" in path [/opt/multus/bin /var/lib/cni/bin /usr/libexec/cni] 1 file does not exist 1 missing content-type field 2 request timed out 3 EOF 3 pods "..." not found 3 timed out while waiting for OVS port binding 6 12 i/o timeout 41 Unauthorized $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&name=4.9&search=reason/FailedCreatePodSandBox.*Unauthorized' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 15 runs, 87% failed, 54% of failures match = 47% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 17 runs, 76% failed, 100% of failures match = 76% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 14 runs, 29% failed, 225% of failures match = 64% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade (all) - 3 runs, 33% failed, 200% of failures match = 67% impact periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 6 runs, 50% failed, 167% of failures match = 83% impact periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-metal-ipi-upgrade (all) - 6 runs, 83% failed, 80% of failures match = 67% impact pull-ci-openshift-etcd-openshift-4.9-e2e-aws-upgrade (all) - 7 runs, 86% failed, 17% of failures match = 14% impact release-openshift-origin-installer-old-rhcos-e2e-aws-4.9 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact So not entirely gone in 4.9 CI yet, but maybe once a few more 4.9 releases have been accepted and the fix has been backported to 4.8, the remaining jobs will green up too.
*** Bug 1972693 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438