Bug 1967877 - [IBM][ROKS] ocs-operator pod in CrashLoopBackOff a week after successful installation
Summary: [IBM][ROKS] ocs-operator pod in CrashLoopBackOff a week after successful inst...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.6
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ---
: OCS 4.8.0
Assignee: N Balachandran
QA Contact: akarsha
URL:
Whiteboard:
Depends On:
Blocks: 1973684
TreeView+ depends on / blocked
 
Reported: 2021-06-04 09:43 UTC by Shirisha S Rao
Modified: 2023-09-15 01:09 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1973684 (view as bug list)
Environment:
Last Closed: 2021-08-03 18:16:41 UTC
Embargoed:


Attachments (Terms of Use)
ocs-operator pod logs (9.87 KB, text/plain)
2021-06-04 09:44 UTC, Shirisha S Rao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ocs-operator pull 1211 0 None open topology: fix crash in determinePlacementRack 2021-06-06 15:24:54 UTC
Github openshift ocs-operator pull 1213 0 None open Bug 1967877: [release-4.8] topology: fix crash in determinePlacementRack 2021-06-08 05:59:10 UTC
Red Hat Product Errata RHBA-2021:3003 0 None None None 2021-08-03 18:16:50 UTC

Description Shirisha S Rao 2021-06-04 09:43:22 UTC
User-Agent:       Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1 Safari/605.1.15
Build Identifier: 

I installed OCS successfully a week ago, but now, the ocs-operator-xxx pod is in CrashLoopBackOff and in the logs, I can see a panic error in goroutine

However, OCS seems to be functioning fine

Reproducible: Didn't try

Steps to Reproduce:
1.Install OCS 4.6
2.Check the ocs-operator pod
3.
Actual Results:  
ocs-operator pod in CrashLoopBackOff

Expected Results:  
ocs-operator pod in Running state

Comment 2 Shirisha S Rao 2021-06-04 09:44:39 UTC
Created attachment 1788952 [details]
ocs-operator pod logs

These are the ocs operator pod logs

Comment 3 Shirisha S Rao 2021-06-04 10:09:59 UTC
Two of the worker nodes were replaced because it was showing node unreachable on the OCP console. Apart from that, no other changes were made

Comment 4 N Balachandran 2021-06-04 10:43:45 UTC
Can I get access to the cluster ?

Comment 5 N Balachandran 2021-06-04 12:07:44 UTC
I need the following information :

1. the storagecluster information
oc get storagecluster -o yaml 

2. The list of nodes in the cluster and the labels set on them
oc get nodes --show-labels


3. The names of the nodes that were replaced and which AZ they were in.

Comment 6 N Balachandran 2021-06-04 13:56:09 UTC
This is what I think is happening:

1. The original workers belonged to a particular set of zones.
2. The failure domain for this cluster is rack so the original workers belonged to fewer than 3 zones.
3. Each rack contains at least one node.
3. From the operator log, the operator has been restarted.
4. A node belonging to a new zone was found:
{"level":"info","ts":"2021-06-04T09:07:43.074Z","logger":"controller_storagecluster","msg":"Adding topology label from node","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"10.241.65.56","Label":"failure-domain.beta.kubernetes.io/zone","Value":"us-east-2"}
5. As the failure domain is rack, OCS attempts to find an appropriate rack for this node.
6. However, as all the other nodes are in a different zone and assuming that none of the racks have 0 nodes, validRack stays false in determinePlacementRack and 
rackList is empty.

Trying to access the elements of rackList causes the crash.


I need the information requested in the earlier comment to confirm this.

Comment 7 Shirisha S Rao 2021-06-04 15:21:16 UTC
oc get storagecluster -n openshift-storage -o yaml 

% oc get storagecluster -n openshift-storage -o yaml 
apiVersion: v1
items:
- apiVersion: ocs.openshift.io/v1
  kind: StorageCluster
  metadata:
    annotations:
      uninstall.ocs.openshift.io/cleanup-policy: delete
      uninstall.ocs.openshift.io/mode: graceful
    creationTimestamp: "2021-05-24T13:10:22Z"
    finalizers:
    - storagecluster.ocs.openshift.io
    generation: 2
    managedFields:
    - apiVersion: ocs.openshift.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:encryption: {}
          f:externalStorage: {}
          f:monPVCTemplate:
            .: {}
            f:metadata:
              .: {}
              f:creationTimestamp: {}
            f:spec:
              .: {}
              f:accessModes: {}
              f:resources:
                .: {}
                f:requests:
                  .: {}
                  f:storage: {}
              f:storageClassName: {}
              f:volumeMode: {}
            f:status: {}
          f:storageDeviceSets: {}
      manager: manager
      operation: Update
      time: "2021-05-24T13:10:22Z"
    - apiVersion: ocs.openshift.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:uninstall.ocs.openshift.io/cleanup-policy: {}
            f:uninstall.ocs.openshift.io/mode: {}
          f:finalizers: {}
        f:spec:
          f:managedResources:
            .: {}
            f:cephBlockPools: {}
            f:cephFilesystems: {}
            f:cephObjectStoreUsers: {}
            f:cephObjectStores: {}
          f:version: {}
        f:status:
          .: {}
          f:conditions: {}
          f:failureDomain: {}
          f:nodeTopologies:
            .: {}
            f:labels:
              .: {}
              f:failure-domain.beta.kubernetes.io/region: {}
              f:failure-domain.beta.kubernetes.io/zone: {}
              f:topology.rook.io/rack: {}
          f:phase: {}
          f:relatedObjects: {}
      manager: ocs-operator
      operation: Update
      time: "2021-05-26T11:26:06Z"
    name: ocs-storagecluster
    namespace: openshift-storage
    resourceVersion: "2048394"
    selfLink: /apis/ocs.openshift.io/v1/namespaces/openshift-storage/storageclusters/ocs-storagecluster
    uid: a45e29fb-94f0-4f3d-903d-750c4f0fedeb
  spec:
    encryption: {}
    externalStorage: {}
    managedResources:
      cephBlockPools: {}
      cephFilesystems: {}
      cephObjectStoreUsers: {}
      cephObjectStores: {}
    monPVCTemplate:
      metadata:
        creationTimestamp: null
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 100Gi
        storageClassName: ibmc-vpc-block-metro-general-purpose
        volumeMode: Filesystem
      status: {}
    storageDeviceSets:
    - config: {}
      count: 1
      dataPVCTemplate:
        metadata:
          creationTimestamp: null
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 1000Gi
          storageClassName: ibmc-vpc-block-metro-general-purpose
          volumeMode: Block
        status: {}
      name: ocs-deviceset
      placement: {}
      portable: true
      replica: 3
      resources: {}
    version: 4.6.0
  status:
    conditions:
    - lastHeartbeatTime: "2021-05-26T11:26:06Z"
      lastTransitionTime: "2021-05-24T13:23:25Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: ReconcileComplete
    - lastHeartbeatTime: "2021-05-26T11:26:06Z"
      lastTransitionTime: "2021-05-24T13:25:29Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: Available
    - lastHeartbeatTime: "2021-05-26T11:26:06Z"
      lastTransitionTime: "2021-05-24T13:25:29Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "False"
      type: Progressing
    - lastHeartbeatTime: "2021-05-26T11:26:06Z"
      lastTransitionTime: "2021-05-24T13:11:32Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "False"
      type: Degraded
    - lastHeartbeatTime: "2021-05-26T11:26:06Z"
      lastTransitionTime: "2021-05-24T13:25:29Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: Upgradeable
    failureDomain: rack
    nodeTopologies:
      labels:
        failure-domain.beta.kubernetes.io/region:
        - us-east
        failure-domain.beta.kubernetes.io/zone:
        - us-east-1
        - us-east-3
        topology.rook.io/rack:
        - rack0
        - rack1
        - rack2
    phase: Ready
    relatedObjects:
    - apiVersion: ceph.rook.io/v1
      kind: CephCluster
      name: ocs-storagecluster-cephcluster
      namespace: openshift-storage
      resourceVersion: "2048391"
      uid: c52869f6-9356-4383-8df0-a821c3817501
    - apiVersion: noobaa.io/v1alpha1
      kind: NooBaa
      name: noobaa
      namespace: openshift-storage
      resourceVersion: "2048167"
      uid: 6e961786-cacc-46bf-8be1-d98651d452fc
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 8 Shirisha S Rao 2021-06-04 15:22:47 UTC
 % oc get nodes --show-labels

 % oc get nodes --show-labels
NAME            STATUS   ROLES           AGE   VERSION           LABELS
10.241.1.19     Ready    master,worker   11d   v1.19.0+d856161   arch=amd64,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=bx2.16x64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east,failure-domain.beta.kubernetes.io/zone=us-east-1,ibm-cloud.kubernetes.io/iaas-provider=g2,ibm-cloud.kubernetes.io/instance-id=0757_256a1670-fcf8-47df-9970-2480f6891401,ibm-cloud.kubernetes.io/internal-ip=10.241.1.19,ibm-cloud.kubernetes.io/machine-type=bx2.16x64,ibm-cloud.kubernetes.io/os=REDHAT_7_64,ibm-cloud.kubernetes.io/region=us-east,ibm-cloud.kubernetes.io/sgx-enabled=false,ibm-cloud.kubernetes.io/subnet-id=0757-5d99e472-86d0-4445-86b9-3ca87bf9b9e7,ibm-cloud.kubernetes.io/worker-id=kube-c2lmbsgw0ku6jd2h0g70-cherryocsus-default-00000213,ibm-cloud.kubernetes.io/worker-pool-id=c2lmbsgw0ku6jd2h0g70-cf4e109,ibm-cloud.kubernetes.io/worker-pool-name=default,ibm-cloud.kubernetes.io/worker-version=4.6.27_1542_openshift,ibm-cloud.kubernetes.io/zone=us-east-1,kubernetes.io/arch=amd64,kubernetes.io/hostname=10.241.1.19,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=bx2.16x64,node.openshift.io/os_id=rhel,topology.kubernetes.io/region=us-east,topology.kubernetes.io/zone=us-east-1,topology.rook.io/rack=rack0
10.241.1.20     Ready    master,worker   11d   v1.19.0+d856161   arch=amd64,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=bx2.16x64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east,failure-domain.beta.kubernetes.io/zone=us-east-1,ibm-cloud.kubernetes.io/iaas-provider=g2,ibm-cloud.kubernetes.io/instance-id=0757_f0f47544-00af-4095-bd48-42e7caab9ed4,ibm-cloud.kubernetes.io/internal-ip=10.241.1.20,ibm-cloud.kubernetes.io/machine-type=bx2.16x64,ibm-cloud.kubernetes.io/os=REDHAT_7_64,ibm-cloud.kubernetes.io/region=us-east,ibm-cloud.kubernetes.io/sgx-enabled=false,ibm-cloud.kubernetes.io/subnet-id=0757-5d99e472-86d0-4445-86b9-3ca87bf9b9e7,ibm-cloud.kubernetes.io/worker-id=kube-c2lmbsgw0ku6jd2h0g70-cherryocsus-default-0000010e,ibm-cloud.kubernetes.io/worker-pool-id=c2lmbsgw0ku6jd2h0g70-cf4e109,ibm-cloud.kubernetes.io/worker-pool-name=default,ibm-cloud.kubernetes.io/worker-version=4.6.27_1542_openshift,ibm-cloud.kubernetes.io/zone=us-east-1,kubernetes.io/arch=amd64,kubernetes.io/hostname=10.241.1.20,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=bx2.16x64,node.openshift.io/os_id=rhel,topology.kubernetes.io/region=us-east,topology.kubernetes.io/zone=us-east-1,topology.rook.io/rack=rack1
10.241.129.16   Ready    master,worker   11d   v1.19.0+d856161   arch=amd64,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=bx2.16x64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east,failure-domain.beta.kubernetes.io/zone=us-east-3,ibm-cloud.kubernetes.io/iaas-provider=g2,ibm-cloud.kubernetes.io/instance-id=0777_e7984ecd-4805-4a23-a426-3014ca8bbc55,ibm-cloud.kubernetes.io/internal-ip=10.241.129.16,ibm-cloud.kubernetes.io/machine-type=bx2.16x64,ibm-cloud.kubernetes.io/os=REDHAT_7_64,ibm-cloud.kubernetes.io/region=us-east,ibm-cloud.kubernetes.io/sgx-enabled=false,ibm-cloud.kubernetes.io/subnet-id=0777-7e4a53d7-39f6-4a14-a901-5e4a6fc8ed95,ibm-cloud.kubernetes.io/worker-id=kube-c2lmbsgw0ku6jd2h0g70-cherryocsus-default-000004ed,ibm-cloud.kubernetes.io/worker-pool-id=c2lmbsgw0ku6jd2h0g70-cf4e109,ibm-cloud.kubernetes.io/worker-pool-name=default,ibm-cloud.kubernetes.io/worker-version=4.6.27_1542_openshift,ibm-cloud.kubernetes.io/zone=us-east-3,kubernetes.io/arch=amd64,kubernetes.io/hostname=10.241.129.16,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=bx2.16x64,node.openshift.io/os_id=rhel,topology.kubernetes.io/region=us-east,topology.kubernetes.io/zone=us-east-3,topology.rook.io/rack=rack2
10.241.129.17   Ready    master,worker   11d   v1.19.0+d856161   arch=amd64,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=bx2.16x64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east,failure-domain.beta.kubernetes.io/zone=us-east-3,ibm-cloud.kubernetes.io/iaas-provider=g2,ibm-cloud.kubernetes.io/instance-id=0777_5b023ff0-3d29-4866-9d29-96f65c2b1c2f,ibm-cloud.kubernetes.io/internal-ip=10.241.129.17,ibm-cloud.kubernetes.io/machine-type=bx2.16x64,ibm-cloud.kubernetes.io/os=REDHAT_7_64,ibm-cloud.kubernetes.io/region=us-east,ibm-cloud.kubernetes.io/sgx-enabled=false,ibm-cloud.kubernetes.io/subnet-id=0777-7e4a53d7-39f6-4a14-a901-5e4a6fc8ed95,ibm-cloud.kubernetes.io/worker-id=kube-c2lmbsgw0ku6jd2h0g70-cherryocsus-default-00000326,ibm-cloud.kubernetes.io/worker-pool-id=c2lmbsgw0ku6jd2h0g70-cf4e109,ibm-cloud.kubernetes.io/worker-pool-name=default,ibm-cloud.kubernetes.io/worker-version=4.6.27_1542_openshift,ibm-cloud.kubernetes.io/zone=us-east-3,kubernetes.io/arch=amd64,kubernetes.io/hostname=10.241.129.17,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=bx2.16x64,node.openshift.io/os_id=rhel,topology.kubernetes.io/region=us-east,topology.kubernetes.io/zone=us-east-3,topology.rook.io/rack=rack2
10.241.65.56    Ready    master,worker   9d    v1.19.0+d856161   arch=amd64,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=bx2.16x64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east,failure-domain.beta.kubernetes.io/zone=us-east-2,ibm-cloud.kubernetes.io/iaas-provider=g2,ibm-cloud.kubernetes.io/instance-id=0767_caa642b9-1586-43b3-87a3-871aac5dcbc3,ibm-cloud.kubernetes.io/internal-ip=10.241.65.56,ibm-cloud.kubernetes.io/machine-type=bx2.16x64,ibm-cloud.kubernetes.io/os=REDHAT_7_64,ibm-cloud.kubernetes.io/region=us-east,ibm-cloud.kubernetes.io/sgx-enabled=false,ibm-cloud.kubernetes.io/subnet-id=0767-706430c2-7ec8-4665-b9fa-31bfc1905989,ibm-cloud.kubernetes.io/worker-id=kube-c2lmbsgw0ku6jd2h0g70-cherryocsus-default-00000af6,ibm-cloud.kubernetes.io/worker-pool-id=c2lmbsgw0ku6jd2h0g70-cf4e109,ibm-cloud.kubernetes.io/worker-pool-name=default,ibm-cloud.kubernetes.io/worker-version=4.6.29_1544_openshift,ibm-cloud.kubernetes.io/zone=us-east-2,kubernetes.io/arch=amd64,kubernetes.io/hostname=10.241.65.56,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=bx2.16x64,node.openshift.io/os_id=rhel,topology.kubernetes.io/region=us-east,topology.kubernetes.io/zone=us-east-2
10.241.65.57    Ready    master,worker   9d    v1.19.0+d856161   arch=amd64,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=bx2.16x64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east,failure-domain.beta.kubernetes.io/zone=us-east-2,ibm-cloud.kubernetes.io/iaas-provider=g2,ibm-cloud.kubernetes.io/instance-id=0767_bb9ad539-61c1-4e44-a0fd-ff62877afa22,ibm-cloud.kubernetes.io/internal-ip=10.241.65.57,ibm-cloud.kubernetes.io/machine-type=bx2.16x64,ibm-cloud.kubernetes.io/os=REDHAT_7_64,ibm-cloud.kubernetes.io/region=us-east,ibm-cloud.kubernetes.io/sgx-enabled=false,ibm-cloud.kubernetes.io/subnet-id=0767-706430c2-7ec8-4665-b9fa-31bfc1905989,ibm-cloud.kubernetes.io/worker-id=kube-c2lmbsgw0ku6jd2h0g70-cherryocsus-default-00000b91,ibm-cloud.kubernetes.io/worker-pool-id=c2lmbsgw0ku6jd2h0g70-cf4e109,ibm-cloud.kubernetes.io/worker-pool-name=default,ibm-cloud.kubernetes.io/worker-version=4.6.29_1544_openshift,ibm-cloud.kubernetes.io/zone=us-east-2,kubernetes.io/arch=amd64,kubernetes.io/hostname=10.241.65.57,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=bx2.16x64,node.openshift.io/os_id=rhel,topology.kubernetes.io/region=us-east,topology.kubernetes.io/zone=us-east-2

Comment 9 N Balachandran 2021-06-04 15:49:28 UTC
This confirms the hypothesis. The newer nodes without the rack labels are in zone=us-east-2 while the original nodes were in zone=us-east-1 and zone=us-east-3.


I can reproduce the crash with the latest upstream master code.

Comment 12 N Balachandran 2021-06-07 07:41:03 UTC
The current solution is to create a new rack in such a case. This needs to be evaluated to ensure arbiter will work as expected.

Comment 16 N Balachandran 2021-06-09 06:16:11 UTC
@Shirisha and Mudit
Does this need to be backported to OCS 4.7 or 4.6?

Comment 17 Mudit Agarwal 2021-06-09 08:17:05 UTC
Not 4.6.5 unless we don't have a workaround, we can take it in next 4.7.1 z stream. I will create a clone.

Sahina as this was found on ROKS, WDYT?

Comment 18 Sahina Bose 2021-06-18 12:35:25 UTC
(In reply to Mudit Agarwal from comment #17)
> Not 4.6.5 unless we don't have a workaround, we can take it in next 4.7.1 z
> stream. I will create a clone.
> 
> Sahina as this was found on ROKS, WDYT?

Yes, 4.7 z-stream is fine. Thanks!

Comment 23 N Balachandran 2021-07-08 06:34:58 UTC
This seems fine. Without the fix, the ocs-operator pod should have gone into crash loop backoff.

Comment 24 akarsha 2021-07-08 12:38:02 UTC
Based on comment21 and comment23, moving to the verified state.

Comment 26 errata-xmlrpc 2021-08-03 18:16:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3003

Comment 27 Red Hat Bugzilla 2023-09-15 01:09:00 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.