2053006 – [ibm]Operator storage PROGRESSING and DEGRADED is true during fresh install for ocp4.11

Bug 2053006 - [ibm]Operator storage PROGRESSING and DEGRADED is true during fresh install for ocp4.11

Summary: [ibm]Operator storage PROGRESSING and DEGRADED is true during fresh install f...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Gayathri M
QA Contact:	Chao Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2055689
TreeView+	depends on / blocked

Reported:	2022-02-10 12:46 UTC by Chao Yang
Modified:	2023-01-21 22:16 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:49:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
CSI driver operator logs (106.62 KB, text/plain) 2022-02-10 14:29 UTC, Jan Safranek	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ibm-vpc-block-csi-driver-operator pull 26	0	None	Merged	Bug 2053006: Resource id fetch optimised	2022-02-17 22:04:19 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:49:37 UTC

Description Chao Yang 2022-02-10 12:46:24 UTC

Description of problem:
Operator storage PROGRESSING and DEGRADED is true during fresh install.
Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-02-10-031822

How reproducible:
Always

Steps to Reproduce:
1.
storage                                    4.11.0-0.nightly-2022-02-10-031822   False       True          True       91m     IBMVPCBlockCSIDriverOperatorCRAvailable: IBMBlockDriverControllerServiceControllerAvailable: Waiting for Deployment...

2.
Status:
  Conditions:
    Last Transition Time:  2022-02-10T10:36:51Z
    Message:               IBMVPCBlockCSIDriverOperatorCRDegraded: SecretSyncDegraded: Resource chaoyang-ibm11-lmtx8 not found for given g2Credentials
    Reason:                IBMVPCBlockCSIDriverOperatorCR_SecretSync_SyncError
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2022-02-10T10:33:16Z
    Message:               IBMVPCBlockCSIDriverOperatorCRProgressing: IBMBlockDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods
IBMVPCBlockCSIDriverOperatorCRProgressing: IBMBlockDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods
    Reason:                IBMVPCBlockCSIDriverOperatorCR_IBMBlockDriverControllerServiceController_Deploying::IBMBlockDriverNodeServiceController_Deploying
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2022-02-10T10:33:16Z
    Message:               IBMVPCBlockCSIDriverOperatorCRAvailable: IBMBlockDriverControllerServiceControllerAvailable: Waiting for Deployment
IBMVPCBlockCSIDriverOperatorCRAvailable: IBMBlockDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service
    Reason:                IBMVPCBlockCSIDriverOperatorCR_IBMBlockDriverControllerServiceController_Deploying::IBMBlockDriverNodeServiceController_Deploying
    Status:                False
    Type:                  Available
    Last Transition Time:  2022-02-10T10:33:13Z
    Message:               All is well
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>

3.
oc describe pod/ibm-vpc-block-csi-controller-6cfd5cf586-klt8g
Name:                 ibm-vpc-block-csi-controller-6cfd5cf586-klt8g
Namespace:            openshift-cluster-csi-drivers
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 chaoyang-ibm11-lmtx8-worker-1-c8zdw/10.242.1.4
Start Time:           Thu, 10 Feb 2022 18:47:18 +0800
Labels:               app=ibm-vpc-block-csi-driver
                      pod-template-hash=6cfd5cf586
Annotations:          openshift.io/scc: restricted
                      operator.openshift.io/dep-93dacac787a37f90e5e6ae24c39bc451a2c11: WAXdSw==
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/ibm-vpc-block-csi-controller-6cfd5cf586
Containers:
  csi-resizer:
    Container ID:  
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a74abfe8485120fb61988202f62e053b86a63a01af67129185809894a5d2ce36
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=2
      --csi-address=/csi/csi.sock
      --timeout=900s
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        20m
      memory:     40Mi
    Environment:  <none>
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7jmd6 (ro)
  csi-provisioner:
    Container ID:  
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cedf5579d83be9041d0ebe9026133a5d4406ce9e5b9acc5b3e55018a9336d608
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=2
      --csi-address=$(ADDRESS)
      --timeout=600s
      --feature-gates=Topology=true
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     10m
      memory:  20Mi
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7jmd6 (ro)
  csi-attacher:
    Container ID:  
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d9ecce8826b5b6d50ca9aefe59e5e10c62fea1a6f51de6800dbc3724e20175e4
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=2
      --csi-address=/csi/csi.sock
      --timeout=900s
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        10m
      memory:     20Mi
    Environment:  <none>
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7jmd6 (ro)
  liveness-probe:
    Container ID:  
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2bb08c795978c50bde9da9b012aff287055d8e6b90b0014306d5278473acc5df
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      --csi-address=/csi/csi.sock
      --v=2
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        5m
      memory:     10Mi
    Environment:  <none>
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7jmd6 (ro)
  iks-vpc-block-driver:
    Container ID:  
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba07c2a25367149757b855909fd89a7713a901083bf3afd7a9f196e0589e345a
    Image ID:      
    Port:          9808/TCP
    Host Port:     0/TCP
    Args:
      --v=2
      --endpoint=$(CSI_ENDPOINT)
      --lock_enabled=false
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     50m
      memory:  100Mi
    Liveness:  http-get http://:healthz/healthz delay=10s timeout=3s period=10s #success=1 #failure=5
    Environment Variables from:
      ibm-vpc-block-csi-configmap  ConfigMap  Optional: false
    Environment:
      POD_NAME:       ibm-vpc-block-csi-controller-6cfd5cf586-klt8g (v1:metadata.name)
      POD_NAMESPACE:  openshift-cluster-csi-drivers (v1:metadata.namespace)
    Mounts:
      /csi from socket-dir (rw)
      /etc/storage_ibmc from customer-auth (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7jmd6 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  socket-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  customer-auth:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  storage-secret-store
    Optional:    false
  non-standard-root-system-trust-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ibm-vpc-block-csi-driver-trusted-ca-bundle
    Optional:  false
  kube-api-access-7jmd6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  101m                  default-scheduler  0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  98m                   default-scheduler  0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  96m                   default-scheduler  0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  91m                   default-scheduler  0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate.
  Normal   Scheduled         89m                   default-scheduler  Successfully assigned openshift-cluster-csi-drivers/ibm-vpc-block-csi-controller-6cfd5cf586-klt8g to chaoyang-ibm11-lmtx8-worker-1-c8zdw
  Warning  FailedScheduling  102m (x2 over 103m)   default-scheduler  0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedMount       44m (x6 over 76m)     kubelet            Unable to attach or mount volumes: unmounted volumes=[customer-auth], unattached volumes=[kube-api-access-7jmd6 customer-auth socket-dir]: timed out waiting for the condition
  Warning  FailedMount       19m (x6 over 87m)     kubelet            Unable to attach or mount volumes: unmounted volumes=[customer-auth], unattached volumes=[customer-auth socket-dir kube-api-access-7jmd6]: timed out waiting for the condition
  Warning  FailedMount       8m25s (x19 over 85m)  kubelet            Unable to attach or mount volumes: unmounted volumes=[customer-auth], unattached volumes=[socket-dir kube-api-access-7jmd6 customer-auth]: timed out waiting for the condition
  Warning  FailedMount       4m19s (x50 over 89m)  kubelet            MountVolume.SetUp failed for volume "customer-auth" : secret "storage-secret-store" not found

Actual results:


Expected results:

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:
From installation log
02-10 19:28:43.833  level=info msg=Cluster operator insights SCANotAvailable is True with NotFound: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 404: {"id":"7","kind":"Error","href":"/api/accounts_mgmt/v1/errors/7","code":"ACCT-MGMT-7","reason":"The organization (id= 1V6IJrh1cNmDxgNlAAWZRfupr3B) does not have any certificate of type sca. Enable SCA at https://access.redhat.com/management.","operation_id":"1accab1c-011e-48f4-bc49-0140b87e41c8"}
02-10 19:28:43.833  level=info msg=Cluster operator network ManagementStateDegraded is False with : 
02-10 19:28:43.834  level=error msg=Cluster operator storage Degraded is True with IBMVPCBlockCSIDriverOperatorCR_SecretSync_SyncError: IBMVPCBlockCSIDriverOperatorCRDegraded: SecretSyncDegraded: Resource chaoyang-ibm11-lmtx8 not found for given g2Credentials

Comment 2 Jan Safranek 2022-02-10 14:29:44 UTC

Created attachment 1860361 [details]
CSI driver operator logs

I can see a lot of errors like this in the operator log:

E0210 10:45:20.151308       1 secretsync.go:117] "Error while extracting data from secret/cm" err="Resource chaoyang-ibm11-lmtx8 not found for given g2Credentials"
E0210 10:45:20.151484       1 base_controller.go:272] SecretSync reconciliation failed: Resource chaoyang-ibm11-lmtx8 not found for given g2Credentials

The first successful sync is 2 hours later (!!!):

I0210 12:45:20.209751       1 secretsync.go:125] storage-secret-store secret created successfully

"Error while extracting data from secret/cm" is printed here: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/7e195cbe598f7911e7689faff1f6027c46ece837/pkg/controller/secret/secretsync.go#L115-L118

It looks like translateSecret() failed with error "Resource chaoyang-ibm11-lmtx8 not found for given g2Credentials" and I miss IBM knowledge about what could that mean and who should create the resource or make it available.

Comment 3 Gayathri M 2022-02-11 06:08:18 UTC

I have tested this on a different cluster and was able to reproduce the error. The error is appearing for some time and disappeared. 
I have listed the resource groups while getting errors and I could not find the particular resource group. Below are the list of resource group at that time.
printint
abutcher-4-10-9w7lg
Default
rvanderp-dev-rg
rvanderp-dev-g9hr9
abutcher-test-dgjzb
abutcher-test-qm426
master-volume-5-cmr7v
ipi-dev-storage-15-2l5vd
20220208-dusty-resource-group
ipi-dev-storage-13-9znl9
ipi-dev-test-70-9xdj6
ipi-dev-storage-17-prvmc
rvanderp-dev-5k9fw
jdiaztestrg
master-volume-1-kt4zq
ipi-dev-network-3-lb944
master-volume-3-gjxh9

After the error disappeared and I am able to list the resource group - lisowski-etcd-2-7dxs8.

Comment 4 Arashad Ahamad 2022-02-11 12:50:39 UTC

I had a discussion with Gayathri.M and ambiknai.com on this issue, as per analysis and logs it looks 

resourceGroupList, _, err := serviceClient.ListResourceGroups(listResourceGroupsOptions)

is not returning resource group in the list which is expected.


We will do the following improvement in https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/7e195cbe598f7911e7689faff1f6027c46ece837/pkg/controller/secret/secretsync.go#L185 method

1- instead of getting full resource group list, we will pass the resource group name and get only resource group detail which we need 

2- Improve error message and action to be taken by user, along with this we will do some more check before dereference pointer variable(which is not the issue as of now)


@jsafrane
Please let me know if you have any other suggestion.

Comment 5 Jan Safranek 2022-02-11 13:21:42 UTC

> instead of getting full resource group list, we will pass the resource group name and get only resource group detail which we need 

Gayathri wrote in comment #3:

> I have listed the resource groups while getting errors and I could not find the particular resource group.

If the resource group is not in ListResourceGroups(), how will GetResourceGroup() help? Are some resource groups hidden in ListResourceGroups()? What hides them and why?
This may be OK in IBM cloud, but it looks very odd from perspective of someone who just looks at the API function names.

The operator logs show that ListResourceGroups() is able to find the resource group for few minutes and then it disappears. Do you know why? IMO that's the root of the problem.

E0210 12:48:24.671011       1 base_controller.go:272] SecretSync reconciliation failed: Resource chaoyang-ibm11-lmtx8 not found for given g2Credentials
I0210 12:49:06.459743       1 secretsync.go:125] storage-secret-store secret created successfully
I0210 12:49:06.897443       1 secretsync.go:125] storage-secret-store secret created successfully
E0210 12:53:10.009110       1 secretsync.go:117] "Error while extracting data from secret/cm" err="Resource chaoyang-ibm11-lmtx8 not found for given g2Credentials"

Comment 7 Arashad Ahamad 2022-02-14 13:28:57 UTC

Gayathri.M is working on this issue,

Can you please provide the details

Comment 8 Gayathri M 2022-02-15 10:21:57 UTC

I am working on improvements to get the resource id from the resource id name instead of listing it and some improvements on error handling and logging messages.
Also checking storage-secret-store present before deploying CSI driver.

Comment 9 Jan Safranek 2022-02-17 15:08:36 UTC

For the record, it seems it was reproducible only around Feb 9 - 10, at least in our CI:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-ovn-kubernetes-release-4.11-e2e-ibmcloud-ipi-ovn-periodic/1491562876593246208
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-ovn-kubernetes-release-4.11-e2e-ibmcloud-ipi-ovn-periodic/1491200522177220608

I did not find anything more recent.

Comment 12 Chao Yang 2022-02-21 08:12:07 UTC

Installation is passed with 4.11.0-0.nightly-2022-02-18-121223
And from 2.14, I did not meet this issue.

Comment 14 errata-xmlrpc 2022-08-10 10:49:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.