Hide Forgot
Description of problem: Operator storage PROGRESSING and DEGRADED is true during fresh install. Version-Release number of selected component (if applicable): 4.11.0-0.nightly-2022-02-10-031822 How reproducible: Always Steps to Reproduce: 1. storage 4.11.0-0.nightly-2022-02-10-031822 False True True 91m IBMVPCBlockCSIDriverOperatorCRAvailable: IBMBlockDriverControllerServiceControllerAvailable: Waiting for Deployment... 2. Status: Conditions: Last Transition Time: 2022-02-10T10:36:51Z Message: IBMVPCBlockCSIDriverOperatorCRDegraded: SecretSyncDegraded: Resource chaoyang-ibm11-lmtx8 not found for given g2Credentials Reason: IBMVPCBlockCSIDriverOperatorCR_SecretSync_SyncError Status: True Type: Degraded Last Transition Time: 2022-02-10T10:33:16Z Message: IBMVPCBlockCSIDriverOperatorCRProgressing: IBMBlockDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods IBMVPCBlockCSIDriverOperatorCRProgressing: IBMBlockDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods Reason: IBMVPCBlockCSIDriverOperatorCR_IBMBlockDriverControllerServiceController_Deploying::IBMBlockDriverNodeServiceController_Deploying Status: True Type: Progressing Last Transition Time: 2022-02-10T10:33:16Z Message: IBMVPCBlockCSIDriverOperatorCRAvailable: IBMBlockDriverControllerServiceControllerAvailable: Waiting for Deployment IBMVPCBlockCSIDriverOperatorCRAvailable: IBMBlockDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service Reason: IBMVPCBlockCSIDriverOperatorCR_IBMBlockDriverControllerServiceController_Deploying::IBMBlockDriverNodeServiceController_Deploying Status: False Type: Available Last Transition Time: 2022-02-10T10:33:13Z Message: All is well Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> 3. oc describe pod/ibm-vpc-block-csi-controller-6cfd5cf586-klt8g Name: ibm-vpc-block-csi-controller-6cfd5cf586-klt8g Namespace: openshift-cluster-csi-drivers Priority: 2000000000 Priority Class Name: system-cluster-critical Node: chaoyang-ibm11-lmtx8-worker-1-c8zdw/10.242.1.4 Start Time: Thu, 10 Feb 2022 18:47:18 +0800 Labels: app=ibm-vpc-block-csi-driver pod-template-hash=6cfd5cf586 Annotations: openshift.io/scc: restricted operator.openshift.io/dep-93dacac787a37f90e5e6ae24c39bc451a2c11: WAXdSw== Status: Pending IP: IPs: <none> Controlled By: ReplicaSet/ibm-vpc-block-csi-controller-6cfd5cf586 Containers: csi-resizer: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a74abfe8485120fb61988202f62e053b86a63a01af67129185809894a5d2ce36 Image ID: Port: <none> Host Port: <none> Args: --v=2 --csi-address=/csi/csi.sock --timeout=900s State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Requests: cpu: 20m memory: 40Mi Environment: <none> Mounts: /csi from socket-dir (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7jmd6 (ro) csi-provisioner: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cedf5579d83be9041d0ebe9026133a5d4406ce9e5b9acc5b3e55018a9336d608 Image ID: Port: <none> Host Port: <none> Args: --v=2 --csi-address=$(ADDRESS) --timeout=600s --feature-gates=Topology=true State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Requests: cpu: 10m memory: 20Mi Environment: ADDRESS: /csi/csi.sock Mounts: /csi from socket-dir (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7jmd6 (ro) csi-attacher: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d9ecce8826b5b6d50ca9aefe59e5e10c62fea1a6f51de6800dbc3724e20175e4 Image ID: Port: <none> Host Port: <none> Args: --v=2 --csi-address=/csi/csi.sock --timeout=900s State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Requests: cpu: 10m memory: 20Mi Environment: <none> Mounts: /csi from socket-dir (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7jmd6 (ro) liveness-probe: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2bb08c795978c50bde9da9b012aff287055d8e6b90b0014306d5278473acc5df Image ID: Port: <none> Host Port: <none> Args: --csi-address=/csi/csi.sock --v=2 State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Requests: cpu: 5m memory: 10Mi Environment: <none> Mounts: /csi from socket-dir (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7jmd6 (ro) iks-vpc-block-driver: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba07c2a25367149757b855909fd89a7713a901083bf3afd7a9f196e0589e345a Image ID: Port: 9808/TCP Host Port: 0/TCP Args: --v=2 --endpoint=$(CSI_ENDPOINT) --lock_enabled=false State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Requests: cpu: 50m memory: 100Mi Liveness: http-get http://:healthz/healthz delay=10s timeout=3s period=10s #success=1 #failure=5 Environment Variables from: ibm-vpc-block-csi-configmap ConfigMap Optional: false Environment: POD_NAME: ibm-vpc-block-csi-controller-6cfd5cf586-klt8g (v1:metadata.name) POD_NAMESPACE: openshift-cluster-csi-drivers (v1:metadata.namespace) Mounts: /csi from socket-dir (rw) /etc/storage_ibmc from customer-auth (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7jmd6 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: socket-dir: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> customer-auth: Type: Secret (a volume populated by a Secret) SecretName: storage-secret-store Optional: false non-standard-root-system-trust-ca-bundle: Type: ConfigMap (a volume populated by a ConfigMap) Name: ibm-vpc-block-csi-driver-trusted-ca-bundle Optional: false kube-api-access-7jmd6: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 101m default-scheduler 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Warning FailedScheduling 98m default-scheduler 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Warning FailedScheduling 96m default-scheduler 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Warning FailedScheduling 91m default-scheduler 0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate. Normal Scheduled 89m default-scheduler Successfully assigned openshift-cluster-csi-drivers/ibm-vpc-block-csi-controller-6cfd5cf586-klt8g to chaoyang-ibm11-lmtx8-worker-1-c8zdw Warning FailedScheduling 102m (x2 over 103m) default-scheduler 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Warning FailedMount 44m (x6 over 76m) kubelet Unable to attach or mount volumes: unmounted volumes=[customer-auth], unattached volumes=[kube-api-access-7jmd6 customer-auth socket-dir]: timed out waiting for the condition Warning FailedMount 19m (x6 over 87m) kubelet Unable to attach or mount volumes: unmounted volumes=[customer-auth], unattached volumes=[customer-auth socket-dir kube-api-access-7jmd6]: timed out waiting for the condition Warning FailedMount 8m25s (x19 over 85m) kubelet Unable to attach or mount volumes: unmounted volumes=[customer-auth], unattached volumes=[socket-dir kube-api-access-7jmd6 customer-auth]: timed out waiting for the condition Warning FailedMount 4m19s (x50 over 89m) kubelet MountVolume.SetUp failed for volume "customer-auth" : secret "storage-secret-store" not found Actual results: Expected results: Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info: From installation log 02-10 19:28:43.833 level=info msg=Cluster operator insights SCANotAvailable is True with NotFound: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 404: {"id":"7","kind":"Error","href":"/api/accounts_mgmt/v1/errors/7","code":"ACCT-MGMT-7","reason":"The organization (id= 1V6IJrh1cNmDxgNlAAWZRfupr3B) does not have any certificate of type sca. Enable SCA at https://access.redhat.com/management.","operation_id":"1accab1c-011e-48f4-bc49-0140b87e41c8"} 02-10 19:28:43.833 level=info msg=Cluster operator network ManagementStateDegraded is False with : 02-10 19:28:43.834 level=error msg=Cluster operator storage Degraded is True with IBMVPCBlockCSIDriverOperatorCR_SecretSync_SyncError: IBMVPCBlockCSIDriverOperatorCRDegraded: SecretSyncDegraded: Resource chaoyang-ibm11-lmtx8 not found for given g2Credentials
Created attachment 1860361 [details] CSI driver operator logs I can see a lot of errors like this in the operator log: E0210 10:45:20.151308 1 secretsync.go:117] "Error while extracting data from secret/cm" err="Resource chaoyang-ibm11-lmtx8 not found for given g2Credentials" E0210 10:45:20.151484 1 base_controller.go:272] SecretSync reconciliation failed: Resource chaoyang-ibm11-lmtx8 not found for given g2Credentials The first successful sync is 2 hours later (!!!): I0210 12:45:20.209751 1 secretsync.go:125] storage-secret-store secret created successfully "Error while extracting data from secret/cm" is printed here: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/7e195cbe598f7911e7689faff1f6027c46ece837/pkg/controller/secret/secretsync.go#L115-L118 It looks like translateSecret() failed with error "Resource chaoyang-ibm11-lmtx8 not found for given g2Credentials" and I miss IBM knowledge about what could that mean and who should create the resource or make it available.
I have tested this on a different cluster and was able to reproduce the error. The error is appearing for some time and disappeared. I have listed the resource groups while getting errors and I could not find the particular resource group. Below are the list of resource group at that time. printint abutcher-4-10-9w7lg Default rvanderp-dev-rg rvanderp-dev-g9hr9 abutcher-test-dgjzb abutcher-test-qm426 master-volume-5-cmr7v ipi-dev-storage-15-2l5vd 20220208-dusty-resource-group ipi-dev-storage-13-9znl9 ipi-dev-test-70-9xdj6 ipi-dev-storage-17-prvmc rvanderp-dev-5k9fw jdiaztestrg master-volume-1-kt4zq ipi-dev-network-3-lb944 master-volume-3-gjxh9 After the error disappeared and I am able to list the resource group - lisowski-etcd-2-7dxs8.
I had a discussion with Gayathri.M and ambiknai.com on this issue, as per analysis and logs it looks resourceGroupList, _, err := serviceClient.ListResourceGroups(listResourceGroupsOptions) is not returning resource group in the list which is expected. We will do the following improvement in https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/7e195cbe598f7911e7689faff1f6027c46ece837/pkg/controller/secret/secretsync.go#L185 method 1- instead of getting full resource group list, we will pass the resource group name and get only resource group detail which we need 2- Improve error message and action to be taken by user, along with this we will do some more check before dereference pointer variable(which is not the issue as of now) @jsafrane Please let me know if you have any other suggestion.
> instead of getting full resource group list, we will pass the resource group name and get only resource group detail which we need Gayathri wrote in comment #3: > I have listed the resource groups while getting errors and I could not find the particular resource group. If the resource group is not in ListResourceGroups(), how will GetResourceGroup() help? Are some resource groups hidden in ListResourceGroups()? What hides them and why? This may be OK in IBM cloud, but it looks very odd from perspective of someone who just looks at the API function names. The operator logs show that ListResourceGroups() is able to find the resource group for few minutes and then it disappears. Do you know why? IMO that's the root of the problem. E0210 12:48:24.671011 1 base_controller.go:272] SecretSync reconciliation failed: Resource chaoyang-ibm11-lmtx8 not found for given g2Credentials I0210 12:49:06.459743 1 secretsync.go:125] storage-secret-store secret created successfully I0210 12:49:06.897443 1 secretsync.go:125] storage-secret-store secret created successfully E0210 12:53:10.009110 1 secretsync.go:117] "Error while extracting data from secret/cm" err="Resource chaoyang-ibm11-lmtx8 not found for given g2Credentials"
Gayathri.M is working on this issue, Can you please provide the details
I am working on improvements to get the resource id from the resource id name instead of listing it and some improvements on error handling and logging messages. Also checking storage-secret-store present before deploying CSI driver.
For the record, it seems it was reproducible only around Feb 9 - 10, at least in our CI: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-ovn-kubernetes-release-4.11-e2e-ibmcloud-ipi-ovn-periodic/1491562876593246208 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-ovn-kubernetes-release-4.11-e2e-ibmcloud-ipi-ovn-periodic/1491200522177220608 I did not find anything more recent.
Installation is passed with 4.11.0-0.nightly-2022-02-18-121223 And from 2.14, I did not meet this issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069