The IBM VPC block CSI driver is using the following zone and region labels: https://github.com/IBM/ibm-csi-common/blob/ce654faf168d6c4d9f90c5d8bc99ee4b2bd33ea2/pkg/utils/constants.go#L55-L59 // NodeZoneLabel Zone Label attached to node NodeZoneLabel = "failure-domain.beta.kubernetes.io/zone" // NodeRegionLabel Region Label attached to node NodeRegionLabel = "failure-domain.beta.kubernetes.io/region" Those labels are used by ibm-vpc-block-csi-driver and ibm-vpc-node-label-updater: https://github.com/openshift/ibm-vpc-block-csi-driver/blob/d54e3706bb8b38447800aa91632a946eb6c990ec/pkg/ibmcsidriver/controller_helper.go#L440-L441 https://github.com/openshift/ibm-vpc-node-label-updater/blob/8e220983b2c2efdfb67eabe4d74cb35bbeca552e/pkg/nodeupdater/utils.go#L41-L42 And as a result, the e2e test manifest for the operator references the same label: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/32c4cb8817becc47133d3b0a556cdd706117412a/test/e2e/manifest.yaml#L17 But those are deprecated in-tree labels: https://github.com/kubernetes/kubernetes/blob/ce9219688ff46b59cc210d880c4cd3af15516c73/staging/src/k8s.io/cloud-provider/cloud.go#L306 They should be updated to provider-specific labels for IBM cloud, similar to what AWS and GCP do: https://github.com/openshift/aws-ebs-csi-driver-operator/blob/ef20d086a1efcbe9f6b1b716de83c1cc734b6519/test/e2e/manifest.yaml#L17 https://github.com/openshift/gcp-pd-csi-driver-operator/blob/10a76a928fe316537bd86e208e79a302e2095a5d/test/e2e/manifest.yaml#L17
Trying to get a passing test run on https://github.com/openshift/release/pull/24720 and the topology tests in e2e-ibmcloud-csi are failing each time. I think it's related to this bug? https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/24720/rehearse-24720-pull-ci-openshift-ibm-vpc-block-csi-driver-operator-master-e2e-ibmcloud-csi/1508442934234583040 : External Storage [Driver: vpc.block.csi.ibm.io] [Testpattern: Dynamic PV (delayed binding)] topology should provision a volume and schedule a pod with AllowedTopologies expand_less 6m36s { fail [k8s.io/kubernetes.0/test/e2e/storage/testsuites/topology.go:180]: Unexpected error: <*errors.errorString | 0xc000300c60>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred} open stdoutopen_in_new : External Storage [Driver: vpc.block.csi.ibm.io] [Testpattern: Dynamic PV (immediate binding)] topology should provision a volume and schedule a pod with AllowedTopologies expand_less 6m26s { fail [k8s.io/kubernetes.0/test/e2e/storage/testsuites/topology.go:180]: Unexpected error: <*errors.errorString | 0xc000344c70>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred} open stdoutopen_in_new We find and use the failure-domain.beta.kubernetes.io/zone:eu-gb-2 domain in the PVC: blob:https://prow.ci.openshift.org/119bde49-651c-46be-9c8f-4c50cf188191 Mar 28 15:06:48.855: INFO: found topology map[failure-domain.beta.kubernetes.io/zone:eu-gb-1] Mar 28 15:06:48.855: INFO: found topology map[failure-domain.beta.kubernetes.io/zone:eu-gb-2] Mar 28 15:06:48.953: INFO: Creating storage class object and pvc object for driver - sc: &StorageClass{ObjectMeta:{e2e-topology-8510-e2e-scj6dv6 1f53768b-fdf4-4ca6-a729-183a411e314f 0 2022-03-28 14:30:04 +0000 UTC <nil> <nil> map[addonmanager.kubernetes.io/mode:Reconcile app:ibm-vpc-block-csi-driver razee/force-apply:true] map[storageclass.kubernetes.io/is-default-class:true] [] [] [{ibm-vpc-block-csi-driver-operator Update storage.k8s.io/v1 2022-03-28 14:30:04 +0000 UTC FieldsV1 {"f:allowVolumeExpansion":{},"f:metadata":{"f:annotations":{".":{},"f:storageclass.kubernetes.io/is-default-class":{}},"f:labels":{".":{},"f:addonmanager.kubernetes.io/mode":{},"f:app":{},"f:razee/force-apply":{}}},"f:parameters":{".":{},"f:csi.storage.k8s.io/fstype":{},"f:encrypted":{},"f:encryptionKey":{},"f:profile":{},"f:region":{},"f:resourceGroup":{},"f:tags":{},"f:zone":{}},"f:provisioner":{},"f:reclaimPolicy":{},"f:volumeBindingMode":{}} }]},Provisioner:vpc.block.csi.ibm.io,Parameters:map[string]string{csi.storage.k8s.io/fstype: ext4,encrypted: false,encryptionKey: ,profile: 10iops-tier,region: ,resourceGroup: ,tags: ,zone: ,},ReclaimPolicy:*Delete,MountOptions:[],AllowVolumeExpansion:*true,VolumeBindingMode:*WaitForFirstConsumer,AllowedTopologies:[]TopologySelectorTerm{{[{failure-domain.beta.kubernetes.io/zone [eu-gb-2]}]},},}, pvc: &PersistentVolumeClaim{ObjectMeta:{ pvc- e2e-topology-8510 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:PersistentVolumeClaimSpec{AccessModes:[ReadWriteOnce],Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{storage: {{10737418240 0} {<nil>} 10Gi BinarySI},},},VolumeName:,Selector:nil,StorageClassName:*e2e-topology-8510-e2e-scj6dv6,VolumeMode:nil,DataSource:nil,DataSourceRef:nil,},Status:PersistentVolumeClaimStatus{Phase:,AccessModes:[],Capacity:ResourceList{},Conditions:[]PersistentVolumeClaimCondition{},AllocatedResources:ResourceList{},ResizeStatus:nil,},} But the pod fails to start with "no matching NodeSelectorTerms": STEP: Collecting events from namespace "e2e-topology-8510". STEP: Found 8 events. Mar 28 15:13:22.131: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-b2efbccd-6e14-48f7-8c79-5238e5764a53: { } FailedScheduling: running PreBind plugin "VolumeBinding": binding volumes: pv "pvc-9a1ebaea-cf98-4647-91b3-76db18ba2c58" node affinity doesn't match node "ci-op-vpizx7dl-baab4-d7chf-worker-2-tgfwq": no matching NodeSelectorTerms Mar 28 15:13:22.131: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-b2efbccd-6e14-48f7-8c79-5238e5764a53: { } FailedScheduling: 0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had volume node affinity conflict. Mar 28 15:13:22.131: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-b2efbccd-6e14-48f7-8c79-5238e5764a53: { } FailedScheduling: 0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had volume node affinity conflict. Mar 28 15:13:22.131: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-b2efbccd-6e14-48f7-8c79-5238e5764a53: { } FailedScheduling: skip schedule deleting pod: e2e-topology-8510/pod-b2efbccd-6e14-48f7-8c79-5238e5764a53 Mar 28 15:13:22.131: INFO: At 2022-03-28 15:06:49 +0000 UTC - event for pvc-t4bhj: {persistentvolume-controller } WaitForFirstConsumer: waiting for first consumer to be created before binding Mar 28 15:13:22.131: INFO: At 2022-03-28 15:06:49 +0000 UTC - event for pvc-t4bhj: {vpc.block.csi.ibm.io_ibm-vpc-block-csi-controller-6496fb9dff-sqhfn_b57afa79-91de-4dfc-a1aa-0a82953825c6 } Provisioning: External provisioner is provisioning volume for claim "e2e-topology-8510/pvc-t4bhj" Mar 28 15:13:22.131: INFO: At 2022-03-28 15:06:49 +0000 UTC - event for pvc-t4bhj: {persistentvolume-controller } ExternalProvisioning: waiting for a volume to be created, either by external provisioner "vpc.block.csi.ibm.io" or manually created by system administrator Mar 28 15:13:22.131: INFO: At 2022-03-28 15:07:14 +0000 UTC - event for pvc-t4bhj: {vpc.block.csi.ibm.io_ibm-vpc-block-csi-controller-6496fb9dff-sqhfn_b57afa79-91de-4dfc-a1aa-0a82953825c6 } ProvisioningSucceeded: Successfully provisioned volume pvc-9a1ebaea-cf98-4647-91b3-76db18ba2c58 Mar 28 15:13:22.226: INFO: POD NODE PHASE GRACE CONDITIONS Mar 28 15:13:22.226: INFO: Mar 28 15:13:22.416: INFO: skipping dumping cluster info - cluster too large STEP: Destroying namespace "e2e-topology-8510" for this suite. fail [k8s.io/kubernetes.0/test/e2e/storage/testsuites/topology.go:180]: Unexpected error: <*errors.errorString | 0xc000300c60>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred
(In reply to Jonathan Dobson from comment #1) > Trying to get a passing test run on > https://github.com/openshift/release/pull/24720 and the topology tests in > e2e-ibmcloud-csi are failing each time. I think it's related to this bug? Nope, this turned out to be a completely separate issue. See https://bugzilla.redhat.com/show_bug.cgi?id=2073617 for details. Leaving this bug (2035027) open to address the original issue of using deprecated failure-domain.beta labels.