Description of problem: Enhance the vpc-node-label-updater to avoid that network issue. Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2021-12-23-153012 How reproducible: 30% Steps to Reproduce: 1.oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.10.0-0.nightly-2021-12-23-153012 False False True 67m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.qe-chaoibm10.ibmcloud.qe.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.qe-chaoibm10.ibmcloud.qe.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server) baremetal 4.10.0-0.nightly-2021-12-23-153012 True False False 65m cloud-controller-manager 4.10.0-0.nightly-2021-12-23-153012 True False False 69m cloud-credential 4.10.0-0.nightly-2021-12-23-153012 True False False 64m cluster-autoscaler 4.10.0-0.nightly-2021-12-23-153012 True False False 65m config-operator 4.10.0-0.nightly-2021-12-23-153012 True False False 67m console 4.10.0-0.nightly-2021-12-23-153012 False True False 51m RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.qe-chaoibm10.ibmcloud.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.qe-chaoibm10.ibmcloud.qe.devcluster.openshift.com": dial tcp: lookup console-openshift-console.apps.qe-chaoibm10.ibmcloud.qe.devcluster.openshift.com on 172.30.0.10:53: no such host csi-snapshot-controller 4.10.0-0.nightly-2021-12-23-153012 True False False 66m dns 4.10.0-0.nightly-2021-12-23-153012 True False False 64m etcd 4.10.0-0.nightly-2021-12-23-153012 True False False 65m image-registry 4.10.0-0.nightly-2021-12-23-153012 True False False 51m ingress 4.10.0-0.nightly-2021-12-23-153012 True False True 51m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing) insights 4.10.0-0.nightly-2021-12-23-153012 True False False 60m kube-apiserver 4.10.0-0.nightly-2021-12-23-153012 True False False 54m kube-controller-manager 4.10.0-0.nightly-2021-12-23-153012 True False False 65m kube-scheduler 4.10.0-0.nightly-2021-12-23-153012 True False False 65m kube-storage-version-migrator 4.10.0-0.nightly-2021-12-23-153012 True False False 66m machine-api 4.10.0-0.nightly-2021-12-23-153012 True False False 61m machine-approver 4.10.0-0.nightly-2021-12-23-153012 True False False 65m machine-config 4.10.0-0.nightly-2021-12-23-153012 True False False 65m marketplace 4.10.0-0.nightly-2021-12-23-153012 True False False 64m monitoring 4.10.0-0.nightly-2021-12-23-153012 True False False 48m network 4.10.0-0.nightly-2021-12-23-153012 True False False 67m node-tuning 4.10.0-0.nightly-2021-12-23-153012 True False False 53m openshift-apiserver 4.10.0-0.nightly-2021-12-23-153012 True False False 55m openshift-controller-manager 4.10.0-0.nightly-2021-12-23-153012 True False False 66m openshift-samples 4.10.0-0.nightly-2021-12-23-153012 True False False 59m operator-lifecycle-manager 4.10.0-0.nightly-2021-12-23-153012 True False False 65m operator-lifecycle-manager-catalog 4.10.0-0.nightly-2021-12-23-153012 True False False 65m operator-lifecycle-manager-packageserver 4.10.0-0.nightly-2021-12-23-153012 True False False 60m service-ca 4.10.0-0.nightly-2021-12-23-153012 True False False 66m storage 4.10.0-0.nightly-2021-12-23-153012 True True False 50m IBMVPCBlockCSIDriverOperatorCRProgressing: IBMBlockDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods 2. oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.10.0-0.nightly-2021-12-23-153012 True False False 24m baremetal 4.10.0-0.nightly-2021-12-23-153012 True False False 98m cloud-controller-manager 4.10.0-0.nightly-2021-12-23-153012 True False False 103m cloud-credential 4.10.0-0.nightly-2021-12-23-153012 True False False 98m cluster-autoscaler 4.10.0-0.nightly-2021-12-23-153012 True False False 98m config-operator 4.10.0-0.nightly-2021-12-23-153012 True False False 100m console 4.10.0-0.nightly-2021-12-23-153012 True False False 24m csi-snapshot-controller 4.10.0-0.nightly-2021-12-23-153012 True False False 100m dns 4.10.0-0.nightly-2021-12-23-153012 True False False 98m etcd 4.10.0-0.nightly-2021-12-23-153012 True False False 98m image-registry 4.10.0-0.nightly-2021-12-23-153012 True False False 85m ingress 4.10.0-0.nightly-2021-12-23-153012 True False False 84m insights 4.10.0-0.nightly-2021-12-23-153012 True False False 94m kube-apiserver 4.10.0-0.nightly-2021-12-23-153012 True False False 87m kube-controller-manager 4.10.0-0.nightly-2021-12-23-153012 True False False 98m kube-scheduler 4.10.0-0.nightly-2021-12-23-153012 True False False 98m kube-storage-version-migrator 4.10.0-0.nightly-2021-12-23-153012 True False False 100m machine-api 4.10.0-0.nightly-2021-12-23-153012 True False False 95m machine-approver 4.10.0-0.nightly-2021-12-23-153012 True False False 98m machine-config 4.10.0-0.nightly-2021-12-23-153012 True False False 98m marketplace 4.10.0-0.nightly-2021-12-23-153012 True False False 98m monitoring 4.10.0-0.nightly-2021-12-23-153012 True False False 82m network 4.10.0-0.nightly-2021-12-23-153012 True False False 101m node-tuning 4.10.0-0.nightly-2021-12-23-153012 True False False 86m openshift-apiserver 4.10.0-0.nightly-2021-12-23-153012 True False False 89m openshift-controller-manager 4.10.0-0.nightly-2021-12-23-153012 True False False 99m openshift-samples 4.10.0-0.nightly-2021-12-23-153012 True False False 92m operator-lifecycle-manager 4.10.0-0.nightly-2021-12-23-153012 True False False 99m operator-lifecycle-manager-catalog 4.10.0-0.nightly-2021-12-23-153012 True False False 99m operator-lifecycle-manager-packageserver 4.10.0-0.nightly-2021-12-23-153012 True False False 93m service-ca 4.10.0-0.nightly-2021-12-23-153012 True False False 100m storage 4.10.0-0.nightly-2021-12-23-153012 True True False 84m IBMVPCBlockCSIDriverOperatorCRProgressing: IBMBlockDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods 3.oc get pods NAME READY STATUS RESTARTS AGE ibm-vpc-block-csi-controller-5dc949cf6-tsls4 5/5 Running 0 105m ibm-vpc-block-csi-driver-operator-5bfd7d58b8-9dgpk 1/1 Running 0 105m ibm-vpc-block-csi-node-b8475 3/3 Running 0 105m ibm-vpc-block-csi-node-dkvq8 3/3 Running 0 95m ibm-vpc-block-csi-node-gg782 3/3 Running 0 105m ibm-vpc-block-csi-node-w2w9b 3/3 Running 0 95m ibm-vpc-block-csi-node-w4qss 3/3 Running 0 105m ibm-vpc-block-csi-node-x9xn2 2/3 CrashLoopBackOff 22 (4m2s ago) 95m 4.oc logs pods/ibm-vpc-block-csi-node-x9xn2 -c csi-driver-registrar I0105 07:55:53.225272 1 main.go:164] Version: v4.10.0-202112140948.p0.gbb0bd82.assembly.stream-0-gc11b47a-dirty I0105 07:55:53.225377 1 main.go:165] Running node-driver-registrar in mode=registration I0105 07:55:53.246458 1 main.go:189] Attempting to open a gRPC connection with: "/csi/csi.sock" I0105 07:55:53.246487 1 connection.go:154] Connecting to unix:///csi/csi.sock I0105 07:55:53.247055 1 main.go:196] Calling CSI driver to discover driver name I0105 07:55:53.247092 1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo I0105 07:55:53.247098 1 connection.go:184] GRPC request: {} I0105 07:55:53.250284 1 connection.go:186] GRPC response: {"name":"vpc.block.csi.ibm.io","vendor_version":"vpcBlockDriver-279a88c2795e0ff8ee88784c0140cf7f0492a176"} I0105 07:55:53.250348 1 connection.go:187] GRPC error: <nil> I0105 07:55:53.250357 1 main.go:206] CSI driver name: "vpc.block.csi.ibm.io" I0105 07:55:53.250418 1 node_register.go:52] Starting Registration Server at: /registration/vpc.block.csi.ibm.io-reg.sock I0105 07:55:53.250582 1 node_register.go:61] Registration Server started at: /registration/vpc.block.csi.ibm.io-reg.sock I0105 07:55:53.250658 1 node_register.go:91] Skipping healthz server because HTTP endpoint is set to: "" I0105 07:55:54.642287 1 main.go:100] Received GetInfo call: &InfoRequest{} I0105 07:55:54.642607 1 main.go:107] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/vpc.block.csi.ibm.io/registration" I0105 07:55:54.697390 1 main.go:118] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = NotFound desc = {RequestID: 06ebd314-01e2-4c65-9a76-ca6341d74aa9 , Code: NodeMetadataInitFailed, Description: Failed to initialize node metadata, BackendError: One or few required node label(s) is/are missing [ibm-cloud.kubernetes.io/worker-id, failure-domain.beta.kubernetes.io/region, failure-domain.beta.kubernetes.io/zone]. Node Labels Found = [#map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/instance-type:bx2d-4x16 beta.kubernetes.io/os:linux failure-domain.beta.kubernetes.io/region:us-south failure-domain.beta.kubernetes.io/zone:us-south-1 kubernetes.io/arch:amd64 kubernetes.io/hostname:qe-chaoibm10-cxqj6-worker-1-7zq9m kubernetes.io/os:linux node-role.kubernetes.io/worker: node.kubernetes.io/instance-type:bx2d-4x16 node.openshift.io/os_id:rhcos topology.kubernetes.io/region:us-south topology.kubernetes.io/zone:us-south-1]], Action: Please check the node labels as per BackendError, accordingly you may add the labels manually},} E0105 07:55:54.697436 1 main.go:120] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = NotFound desc = {RequestID: 06ebd314-01e2-4c65-9a76-ca6341d74aa9 , Code: NodeMetadataInitFailed, Description: Failed to initialize node metadata, BackendError: One or few required node label(s) is/are missing [ibm-cloud.kubernetes.io/worker-id, failure-domain.beta.kubernetes.io/region, failure-domain.beta.kubernetes.io/zone]. Node Labels Found = [#map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/instance-type:bx2d-4x16 beta.kubernetes.io/os:linux failure-domain.beta.kubernetes.io/region:us-south failure-domain.beta.kubernetes.io/zone:us-south-1 kubernetes.io/arch:amd64 kubernetes.io/hostname:qe-chaoibm10-cxqj6-worker-1-7zq9m kubernetes.io/os:linux node-role.kubernetes.io/worker: node.kubernetes.io/instance-type:bx2d-4x16 node.openshift.io/os_id:rhcos topology.kubernetes.io/region:us-south topology.kubernetes.io/zone:us-south-1]], Action: Please check the node labels as per BackendError, accordingly you may add the labels manually}, restarting registration container. 5.oc logs ibm-vpc-block-csi-node-x9xn2 -c vpc-node-label-updater {"level":"info","timestamp":"2022-01-05T06:25:05.847Z","caller":"cmd/main.go:105","msg":"Starting controller for adding node labels","watcher-name":"vpc-node-label-updater"} {"level":"info","timestamp":"2022-01-05T06:25:16.259Z","caller":"cmd/main.go:113","msg":"Error retrieving the Node from the index for a given node. Error :","watcher-name":"vpc-node-label-updater","error":"Get \"https://172.30.0.1:443/api/v1/nodes/qe-chaoibm10-cxqj6-worker-1-7zq9m\": dial tcp 172.30.0.1:443: connect: connection refused"} 6.Manually add label to worker, the co/storage is normal. vpc-block-csi-driver-labels: "true" ibm-cloud.kubernetes.io/worker-id: 0727_1c399df3-e42e-4d5e-97ed-522af821d682 Actual results: CO/storage is DEGRADED Expected results: Enhance the vpc-node-label-updater to avoid that network issue. Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info:
It seems to me that the node labeller should retry few times before giving up (with exp. backoff?). In addition, Kubernetes should not even start the driver container until all init containers succeed - does the labeller return a proper exit code?
*** Bug 2034886 has been marked as a duplicate of this bug. ***
sure, will check, It will be good if we can get the cluster for debug, it will be easy for developer
looks following things needs to be done 1- vpc-node-label-updater init container should be failed (if any un-expected case) which will stop to run other container from same pod 2- Let kubernetes/Openshift re-try init container again until its success. Thats what we will put the fix
> 2- Let kubernetes/Openshift re-try init container again until its success. Maybe it should exit after few minutes of trying, waiting forever looks scary. But both 1. and 2. look good.
we have done the code changes and created release tag here https://github.com/IBM/vpc-node-label-updater/releases/tag/v4.1.1
Moving the ball to Red Hat to merge the upstream fix.
Marking as blocker, this can cause a whole cluster installation to fail.
Install cluster for several times and are successfully with 4.10.0-0.nightly-2022-01-25-023600
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056