2037276 – [IBMCLOUD] vpc-node-label-updater may fail to label nodes appropriately

Bug 2037276 - [IBMCLOUD] vpc-node-label-updater may fail to label nodes appropriately

Summary: [IBMCLOUD] vpc-node-label-updater may fail to label nodes appropriately

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Jonathan Dobson
QA Contact:	Chao Yang
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2034886 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-05 11:07 UTC by Chao Yang
Modified:	2022-03-10 16:37 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:37:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ibm-vpc-node-label-updater pull 7	0	None	Merged	Bug 2037276: vpc-node-label-updater may fail to label nodes appropriately	2022-01-18 20:52:55 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:37:23 UTC

Description Chao Yang 2022-01-05 11:07:34 UTC

Description of problem:
Enhance the vpc-node-label-updater to avoid that network issue.

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2021-12-23-153012

How reproducible:
30%

Steps to Reproduce:
1.oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.0-0.nightly-2021-12-23-153012   False       False         True       67m     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.qe-chaoibm10.ibmcloud.qe.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.qe-chaoibm10.ibmcloud.qe.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server)
baremetal                                  4.10.0-0.nightly-2021-12-23-153012   True        False         False      65m     
cloud-controller-manager                   4.10.0-0.nightly-2021-12-23-153012   True        False         False      69m     
cloud-credential                           4.10.0-0.nightly-2021-12-23-153012   True        False         False      64m     
cluster-autoscaler                         4.10.0-0.nightly-2021-12-23-153012   True        False         False      65m     
config-operator                            4.10.0-0.nightly-2021-12-23-153012   True        False         False      67m     
console                                    4.10.0-0.nightly-2021-12-23-153012   False       True          False      51m     RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.qe-chaoibm10.ibmcloud.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.qe-chaoibm10.ibmcloud.qe.devcluster.openshift.com": dial tcp: lookup console-openshift-console.apps.qe-chaoibm10.ibmcloud.qe.devcluster.openshift.com on 172.30.0.10:53: no such host
csi-snapshot-controller                    4.10.0-0.nightly-2021-12-23-153012   True        False         False      66m     
dns                                        4.10.0-0.nightly-2021-12-23-153012   True        False         False      64m     
etcd                                       4.10.0-0.nightly-2021-12-23-153012   True        False         False      65m     
image-registry                             4.10.0-0.nightly-2021-12-23-153012   True        False         False      51m     
ingress                                    4.10.0-0.nightly-2021-12-23-153012   True        False         True       51m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
insights                                   4.10.0-0.nightly-2021-12-23-153012   True        False         False      60m     
kube-apiserver                             4.10.0-0.nightly-2021-12-23-153012   True        False         False      54m     
kube-controller-manager                    4.10.0-0.nightly-2021-12-23-153012   True        False         False      65m     
kube-scheduler                             4.10.0-0.nightly-2021-12-23-153012   True        False         False      65m     
kube-storage-version-migrator              4.10.0-0.nightly-2021-12-23-153012   True        False         False      66m     
machine-api                                4.10.0-0.nightly-2021-12-23-153012   True        False         False      61m     
machine-approver                           4.10.0-0.nightly-2021-12-23-153012   True        False         False      65m     
machine-config                             4.10.0-0.nightly-2021-12-23-153012   True        False         False      65m     
marketplace                                4.10.0-0.nightly-2021-12-23-153012   True        False         False      64m     
monitoring                                 4.10.0-0.nightly-2021-12-23-153012   True        False         False      48m     
network                                    4.10.0-0.nightly-2021-12-23-153012   True        False         False      67m     
node-tuning                                4.10.0-0.nightly-2021-12-23-153012   True        False         False      53m     
openshift-apiserver                        4.10.0-0.nightly-2021-12-23-153012   True        False         False      55m     
openshift-controller-manager               4.10.0-0.nightly-2021-12-23-153012   True        False         False      66m     
openshift-samples                          4.10.0-0.nightly-2021-12-23-153012   True        False         False      59m     
operator-lifecycle-manager                 4.10.0-0.nightly-2021-12-23-153012   True        False         False      65m     
operator-lifecycle-manager-catalog         4.10.0-0.nightly-2021-12-23-153012   True        False         False      65m     
operator-lifecycle-manager-packageserver   4.10.0-0.nightly-2021-12-23-153012   True        False         False      60m     
service-ca                                 4.10.0-0.nightly-2021-12-23-153012   True        False         False      66m     
storage                                    4.10.0-0.nightly-2021-12-23-153012   True        True          False      50m     IBMVPCBlockCSIDriverOperatorCRProgressing: IBMBlockDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods

2. oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.0-0.nightly-2021-12-23-153012   True        False         False      24m     
baremetal                                  4.10.0-0.nightly-2021-12-23-153012   True        False         False      98m     
cloud-controller-manager                   4.10.0-0.nightly-2021-12-23-153012   True        False         False      103m    
cloud-credential                           4.10.0-0.nightly-2021-12-23-153012   True        False         False      98m     
cluster-autoscaler                         4.10.0-0.nightly-2021-12-23-153012   True        False         False      98m     
config-operator                            4.10.0-0.nightly-2021-12-23-153012   True        False         False      100m    
console                                    4.10.0-0.nightly-2021-12-23-153012   True        False         False      24m     
csi-snapshot-controller                    4.10.0-0.nightly-2021-12-23-153012   True        False         False      100m    
dns                                        4.10.0-0.nightly-2021-12-23-153012   True        False         False      98m     
etcd                                       4.10.0-0.nightly-2021-12-23-153012   True        False         False      98m     
image-registry                             4.10.0-0.nightly-2021-12-23-153012   True        False         False      85m     
ingress                                    4.10.0-0.nightly-2021-12-23-153012   True        False         False      84m     
insights                                   4.10.0-0.nightly-2021-12-23-153012   True        False         False      94m     
kube-apiserver                             4.10.0-0.nightly-2021-12-23-153012   True        False         False      87m     
kube-controller-manager                    4.10.0-0.nightly-2021-12-23-153012   True        False         False      98m     
kube-scheduler                             4.10.0-0.nightly-2021-12-23-153012   True        False         False      98m     
kube-storage-version-migrator              4.10.0-0.nightly-2021-12-23-153012   True        False         False      100m    
machine-api                                4.10.0-0.nightly-2021-12-23-153012   True        False         False      95m     
machine-approver                           4.10.0-0.nightly-2021-12-23-153012   True        False         False      98m     
machine-config                             4.10.0-0.nightly-2021-12-23-153012   True        False         False      98m     
marketplace                                4.10.0-0.nightly-2021-12-23-153012   True        False         False      98m     
monitoring                                 4.10.0-0.nightly-2021-12-23-153012   True        False         False      82m     
network                                    4.10.0-0.nightly-2021-12-23-153012   True        False         False      101m    
node-tuning                                4.10.0-0.nightly-2021-12-23-153012   True        False         False      86m     
openshift-apiserver                        4.10.0-0.nightly-2021-12-23-153012   True        False         False      89m     
openshift-controller-manager               4.10.0-0.nightly-2021-12-23-153012   True        False         False      99m     
openshift-samples                          4.10.0-0.nightly-2021-12-23-153012   True        False         False      92m     
operator-lifecycle-manager                 4.10.0-0.nightly-2021-12-23-153012   True        False         False      99m     
operator-lifecycle-manager-catalog         4.10.0-0.nightly-2021-12-23-153012   True        False         False      99m     
operator-lifecycle-manager-packageserver   4.10.0-0.nightly-2021-12-23-153012   True        False         False      93m     
service-ca                                 4.10.0-0.nightly-2021-12-23-153012   True        False         False      100m    
storage                                    4.10.0-0.nightly-2021-12-23-153012   True        True          False      84m     IBMVPCBlockCSIDriverOperatorCRProgressing: IBMBlockDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods

3.oc get pods
NAME                                                 READY   STATUS             RESTARTS        AGE
ibm-vpc-block-csi-controller-5dc949cf6-tsls4         5/5     Running            0               105m
ibm-vpc-block-csi-driver-operator-5bfd7d58b8-9dgpk   1/1     Running            0               105m
ibm-vpc-block-csi-node-b8475                         3/3     Running            0               105m
ibm-vpc-block-csi-node-dkvq8                         3/3     Running            0               95m
ibm-vpc-block-csi-node-gg782                         3/3     Running            0               105m
ibm-vpc-block-csi-node-w2w9b                         3/3     Running            0               95m
ibm-vpc-block-csi-node-w4qss                         3/3     Running            0               105m
ibm-vpc-block-csi-node-x9xn2                         2/3     CrashLoopBackOff   22 (4m2s ago)   95m

4.oc logs pods/ibm-vpc-block-csi-node-x9xn2 -c csi-driver-registrar
I0105 07:55:53.225272       1 main.go:164] Version: v4.10.0-202112140948.p0.gbb0bd82.assembly.stream-0-gc11b47a-dirty
I0105 07:55:53.225377       1 main.go:165] Running node-driver-registrar in mode=registration
I0105 07:55:53.246458       1 main.go:189] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0105 07:55:53.246487       1 connection.go:154] Connecting to unix:///csi/csi.sock
I0105 07:55:53.247055       1 main.go:196] Calling CSI driver to discover driver name
I0105 07:55:53.247092       1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo
I0105 07:55:53.247098       1 connection.go:184] GRPC request: {}
I0105 07:55:53.250284       1 connection.go:186] GRPC response: {"name":"vpc.block.csi.ibm.io","vendor_version":"vpcBlockDriver-279a88c2795e0ff8ee88784c0140cf7f0492a176"}
I0105 07:55:53.250348       1 connection.go:187] GRPC error: <nil>
I0105 07:55:53.250357       1 main.go:206] CSI driver name: "vpc.block.csi.ibm.io"
I0105 07:55:53.250418       1 node_register.go:52] Starting Registration Server at: /registration/vpc.block.csi.ibm.io-reg.sock
I0105 07:55:53.250582       1 node_register.go:61] Registration Server started at: /registration/vpc.block.csi.ibm.io-reg.sock
I0105 07:55:53.250658       1 node_register.go:91] Skipping healthz server because HTTP endpoint is set to: ""
I0105 07:55:54.642287       1 main.go:100] Received GetInfo call: &InfoRequest{}
I0105 07:55:54.642607       1 main.go:107] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/vpc.block.csi.ibm.io/registration"
I0105 07:55:54.697390       1 main.go:118] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = NotFound desc = {RequestID: 06ebd314-01e2-4c65-9a76-ca6341d74aa9 , Code: NodeMetadataInitFailed, Description: Failed to initialize node metadata, BackendError: One or few required node label(s) is/are missing [ibm-cloud.kubernetes.io/worker-id, failure-domain.beta.kubernetes.io/region, failure-domain.beta.kubernetes.io/zone]. Node Labels Found = [#map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/instance-type:bx2d-4x16 beta.kubernetes.io/os:linux failure-domain.beta.kubernetes.io/region:us-south failure-domain.beta.kubernetes.io/zone:us-south-1 kubernetes.io/arch:amd64 kubernetes.io/hostname:qe-chaoibm10-cxqj6-worker-1-7zq9m kubernetes.io/os:linux node-role.kubernetes.io/worker: node.kubernetes.io/instance-type:bx2d-4x16 node.openshift.io/os_id:rhcos topology.kubernetes.io/region:us-south topology.kubernetes.io/zone:us-south-1]], Action: Please check the node labels as per BackendError, accordingly you may add the labels manually},}
E0105 07:55:54.697436       1 main.go:120] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = NotFound desc = {RequestID: 06ebd314-01e2-4c65-9a76-ca6341d74aa9 , Code: NodeMetadataInitFailed, Description: Failed to initialize node metadata, BackendError: One or few required node label(s) is/are missing [ibm-cloud.kubernetes.io/worker-id, failure-domain.beta.kubernetes.io/region, failure-domain.beta.kubernetes.io/zone]. Node Labels Found = [#map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/instance-type:bx2d-4x16 beta.kubernetes.io/os:linux failure-domain.beta.kubernetes.io/region:us-south failure-domain.beta.kubernetes.io/zone:us-south-1 kubernetes.io/arch:amd64 kubernetes.io/hostname:qe-chaoibm10-cxqj6-worker-1-7zq9m kubernetes.io/os:linux node-role.kubernetes.io/worker: node.kubernetes.io/instance-type:bx2d-4x16 node.openshift.io/os_id:rhcos topology.kubernetes.io/region:us-south topology.kubernetes.io/zone:us-south-1]], Action: Please check the node labels as per BackendError, accordingly you may add the labels manually}, restarting registration container.

5.oc logs ibm-vpc-block-csi-node-x9xn2 -c vpc-node-label-updater
{"level":"info","timestamp":"2022-01-05T06:25:05.847Z","caller":"cmd/main.go:105","msg":"Starting controller for adding node labels","watcher-name":"vpc-node-label-updater"}
{"level":"info","timestamp":"2022-01-05T06:25:16.259Z","caller":"cmd/main.go:113","msg":"Error retrieving the Node from the index for a given node. Error :","watcher-name":"vpc-node-label-updater","error":"Get \"https://172.30.0.1:443/api/v1/nodes/qe-chaoibm10-cxqj6-worker-1-7zq9m\": dial tcp 172.30.0.1:443: connect: connection refused"}

6.Manually add label to worker, the co/storage is normal.
 vpc-block-csi-driver-labels: "true"
 ibm-cloud.kubernetes.io/worker-id: 0727_1c399df3-e42e-4d5e-97ed-522af821d682

Actual results:
CO/storage is DEGRADED

Expected results:
Enhance the vpc-node-label-updater to avoid that network issue.

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 1 Jan Safranek 2022-01-06 16:03:50 UTC

It seems to me that the node labeller should retry few times before giving up (with exp. backoff?). In addition, Kubernetes should not even start the driver container until all init containers succeed - does the labeller return a proper exit code?

Comment 2 Jan Safranek 2022-01-06 16:38:02 UTC

*** Bug 2034886 has been marked as a duplicate of this bug. ***

Comment 3 Arashad Ahamad 2022-01-07 14:04:07 UTC

sure, will check,

It will be good if we can get the cluster for debug, it will be easy for developer

Comment 4 Arashad Ahamad 2022-01-07 14:29:13 UTC

looks following things needs to be done

1- vpc-node-label-updater init container should be failed (if any un-expected case) which will stop to run other container from same pod

2- Let kubernetes/Openshift re-try init container again until its success.


Thats what we will put the fix

Comment 5 Jan Safranek 2022-01-07 14:33:20 UTC

> 2- Let kubernetes/Openshift re-try init container again until its success.

Maybe it should exit after few minutes of trying, waiting forever looks scary. But both 1. and 2. look good.

Comment 6 Arashad Ahamad 2022-01-11 13:19:50 UTC

we have done the code changes and created release tag here https://github.com/IBM/vpc-node-label-updater/releases/tag/v4.1.1

Comment 7 Jan Safranek 2022-01-12 10:00:56 UTC

Moving the ball to Red Hat to merge the upstream fix.

Comment 8 Jan Safranek 2022-01-12 10:01:47 UTC

Marking as blocker, this can cause a whole cluster installation to fail.

Comment 12 Chao Yang 2022-01-25 08:47:00 UTC

Install cluster for several times and are successfully with 4.10.0-0.nightly-2022-01-25-023600

Comment 15 errata-xmlrpc 2022-03-10 16:37:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.