User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1 Safari/605.1.15 Build Identifier: I installed OCS successfully a week ago, but now, the ocs-operator-xxx pod is in CrashLoopBackOff and in the logs, I can see a panic error in goroutine However, OCS seems to be functioning fine Reproducible: Didn't try Steps to Reproduce: 1.Install OCS 4.6 2.Check the ocs-operator pod 3. Actual Results: ocs-operator pod in CrashLoopBackOff Expected Results: ocs-operator pod in Running state
Created attachment 1788952 [details] ocs-operator pod logs These are the ocs operator pod logs
Two of the worker nodes were replaced because it was showing node unreachable on the OCP console. Apart from that, no other changes were made
Can I get access to the cluster ?
I need the following information : 1. the storagecluster information oc get storagecluster -o yaml 2. The list of nodes in the cluster and the labels set on them oc get nodes --show-labels 3. The names of the nodes that were replaced and which AZ they were in.
This is what I think is happening: 1. The original workers belonged to a particular set of zones. 2. The failure domain for this cluster is rack so the original workers belonged to fewer than 3 zones. 3. Each rack contains at least one node. 3. From the operator log, the operator has been restarted. 4. A node belonging to a new zone was found: {"level":"info","ts":"2021-06-04T09:07:43.074Z","logger":"controller_storagecluster","msg":"Adding topology label from node","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"10.241.65.56","Label":"failure-domain.beta.kubernetes.io/zone","Value":"us-east-2"} 5. As the failure domain is rack, OCS attempts to find an appropriate rack for this node. 6. However, as all the other nodes are in a different zone and assuming that none of the racks have 0 nodes, validRack stays false in determinePlacementRack and rackList is empty. Trying to access the elements of rackList causes the crash. I need the information requested in the earlier comment to confirm this.
oc get storagecluster -n openshift-storage -o yaml % oc get storagecluster -n openshift-storage -o yaml apiVersion: v1 items: - apiVersion: ocs.openshift.io/v1 kind: StorageCluster metadata: annotations: uninstall.ocs.openshift.io/cleanup-policy: delete uninstall.ocs.openshift.io/mode: graceful creationTimestamp: "2021-05-24T13:10:22Z" finalizers: - storagecluster.ocs.openshift.io generation: 2 managedFields: - apiVersion: ocs.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: .: {} f:encryption: {} f:externalStorage: {} f:monPVCTemplate: .: {} f:metadata: .: {} f:creationTimestamp: {} f:spec: .: {} f:accessModes: {} f:resources: .: {} f:requests: .: {} f:storage: {} f:storageClassName: {} f:volumeMode: {} f:status: {} f:storageDeviceSets: {} manager: manager operation: Update time: "2021-05-24T13:10:22Z" - apiVersion: ocs.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:uninstall.ocs.openshift.io/cleanup-policy: {} f:uninstall.ocs.openshift.io/mode: {} f:finalizers: {} f:spec: f:managedResources: .: {} f:cephBlockPools: {} f:cephFilesystems: {} f:cephObjectStoreUsers: {} f:cephObjectStores: {} f:version: {} f:status: .: {} f:conditions: {} f:failureDomain: {} f:nodeTopologies: .: {} f:labels: .: {} f:failure-domain.beta.kubernetes.io/region: {} f:failure-domain.beta.kubernetes.io/zone: {} f:topology.rook.io/rack: {} f:phase: {} f:relatedObjects: {} manager: ocs-operator operation: Update time: "2021-05-26T11:26:06Z" name: ocs-storagecluster namespace: openshift-storage resourceVersion: "2048394" selfLink: /apis/ocs.openshift.io/v1/namespaces/openshift-storage/storageclusters/ocs-storagecluster uid: a45e29fb-94f0-4f3d-903d-750c4f0fedeb spec: encryption: {} externalStorage: {} managedResources: cephBlockPools: {} cephFilesystems: {} cephObjectStoreUsers: {} cephObjectStores: {} monPVCTemplate: metadata: creationTimestamp: null spec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi storageClassName: ibmc-vpc-block-metro-general-purpose volumeMode: Filesystem status: {} storageDeviceSets: - config: {} count: 1 dataPVCTemplate: metadata: creationTimestamp: null spec: accessModes: - ReadWriteOnce resources: requests: storage: 1000Gi storageClassName: ibmc-vpc-block-metro-general-purpose volumeMode: Block status: {} name: ocs-deviceset placement: {} portable: true replica: 3 resources: {} version: 4.6.0 status: conditions: - lastHeartbeatTime: "2021-05-26T11:26:06Z" lastTransitionTime: "2021-05-24T13:23:25Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: ReconcileComplete - lastHeartbeatTime: "2021-05-26T11:26:06Z" lastTransitionTime: "2021-05-24T13:25:29Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: Available - lastHeartbeatTime: "2021-05-26T11:26:06Z" lastTransitionTime: "2021-05-24T13:25:29Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "False" type: Progressing - lastHeartbeatTime: "2021-05-26T11:26:06Z" lastTransitionTime: "2021-05-24T13:11:32Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "False" type: Degraded - lastHeartbeatTime: "2021-05-26T11:26:06Z" lastTransitionTime: "2021-05-24T13:25:29Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: Upgradeable failureDomain: rack nodeTopologies: labels: failure-domain.beta.kubernetes.io/region: - us-east failure-domain.beta.kubernetes.io/zone: - us-east-1 - us-east-3 topology.rook.io/rack: - rack0 - rack1 - rack2 phase: Ready relatedObjects: - apiVersion: ceph.rook.io/v1 kind: CephCluster name: ocs-storagecluster-cephcluster namespace: openshift-storage resourceVersion: "2048391" uid: c52869f6-9356-4383-8df0-a821c3817501 - apiVersion: noobaa.io/v1alpha1 kind: NooBaa name: noobaa namespace: openshift-storage resourceVersion: "2048167" uid: 6e961786-cacc-46bf-8be1-d98651d452fc kind: List metadata: resourceVersion: "" selfLink: ""
% oc get nodes --show-labels % oc get nodes --show-labels NAME STATUS ROLES AGE VERSION LABELS 10.241.1.19 Ready master,worker 11d v1.19.0+d856161 arch=amd64,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=bx2.16x64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east,failure-domain.beta.kubernetes.io/zone=us-east-1,ibm-cloud.kubernetes.io/iaas-provider=g2,ibm-cloud.kubernetes.io/instance-id=0757_256a1670-fcf8-47df-9970-2480f6891401,ibm-cloud.kubernetes.io/internal-ip=10.241.1.19,ibm-cloud.kubernetes.io/machine-type=bx2.16x64,ibm-cloud.kubernetes.io/os=REDHAT_7_64,ibm-cloud.kubernetes.io/region=us-east,ibm-cloud.kubernetes.io/sgx-enabled=false,ibm-cloud.kubernetes.io/subnet-id=0757-5d99e472-86d0-4445-86b9-3ca87bf9b9e7,ibm-cloud.kubernetes.io/worker-id=kube-c2lmbsgw0ku6jd2h0g70-cherryocsus-default-00000213,ibm-cloud.kubernetes.io/worker-pool-id=c2lmbsgw0ku6jd2h0g70-cf4e109,ibm-cloud.kubernetes.io/worker-pool-name=default,ibm-cloud.kubernetes.io/worker-version=4.6.27_1542_openshift,ibm-cloud.kubernetes.io/zone=us-east-1,kubernetes.io/arch=amd64,kubernetes.io/hostname=10.241.1.19,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=bx2.16x64,node.openshift.io/os_id=rhel,topology.kubernetes.io/region=us-east,topology.kubernetes.io/zone=us-east-1,topology.rook.io/rack=rack0 10.241.1.20 Ready master,worker 11d v1.19.0+d856161 arch=amd64,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=bx2.16x64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east,failure-domain.beta.kubernetes.io/zone=us-east-1,ibm-cloud.kubernetes.io/iaas-provider=g2,ibm-cloud.kubernetes.io/instance-id=0757_f0f47544-00af-4095-bd48-42e7caab9ed4,ibm-cloud.kubernetes.io/internal-ip=10.241.1.20,ibm-cloud.kubernetes.io/machine-type=bx2.16x64,ibm-cloud.kubernetes.io/os=REDHAT_7_64,ibm-cloud.kubernetes.io/region=us-east,ibm-cloud.kubernetes.io/sgx-enabled=false,ibm-cloud.kubernetes.io/subnet-id=0757-5d99e472-86d0-4445-86b9-3ca87bf9b9e7,ibm-cloud.kubernetes.io/worker-id=kube-c2lmbsgw0ku6jd2h0g70-cherryocsus-default-0000010e,ibm-cloud.kubernetes.io/worker-pool-id=c2lmbsgw0ku6jd2h0g70-cf4e109,ibm-cloud.kubernetes.io/worker-pool-name=default,ibm-cloud.kubernetes.io/worker-version=4.6.27_1542_openshift,ibm-cloud.kubernetes.io/zone=us-east-1,kubernetes.io/arch=amd64,kubernetes.io/hostname=10.241.1.20,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=bx2.16x64,node.openshift.io/os_id=rhel,topology.kubernetes.io/region=us-east,topology.kubernetes.io/zone=us-east-1,topology.rook.io/rack=rack1 10.241.129.16 Ready master,worker 11d v1.19.0+d856161 arch=amd64,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=bx2.16x64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east,failure-domain.beta.kubernetes.io/zone=us-east-3,ibm-cloud.kubernetes.io/iaas-provider=g2,ibm-cloud.kubernetes.io/instance-id=0777_e7984ecd-4805-4a23-a426-3014ca8bbc55,ibm-cloud.kubernetes.io/internal-ip=10.241.129.16,ibm-cloud.kubernetes.io/machine-type=bx2.16x64,ibm-cloud.kubernetes.io/os=REDHAT_7_64,ibm-cloud.kubernetes.io/region=us-east,ibm-cloud.kubernetes.io/sgx-enabled=false,ibm-cloud.kubernetes.io/subnet-id=0777-7e4a53d7-39f6-4a14-a901-5e4a6fc8ed95,ibm-cloud.kubernetes.io/worker-id=kube-c2lmbsgw0ku6jd2h0g70-cherryocsus-default-000004ed,ibm-cloud.kubernetes.io/worker-pool-id=c2lmbsgw0ku6jd2h0g70-cf4e109,ibm-cloud.kubernetes.io/worker-pool-name=default,ibm-cloud.kubernetes.io/worker-version=4.6.27_1542_openshift,ibm-cloud.kubernetes.io/zone=us-east-3,kubernetes.io/arch=amd64,kubernetes.io/hostname=10.241.129.16,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=bx2.16x64,node.openshift.io/os_id=rhel,topology.kubernetes.io/region=us-east,topology.kubernetes.io/zone=us-east-3,topology.rook.io/rack=rack2 10.241.129.17 Ready master,worker 11d v1.19.0+d856161 arch=amd64,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=bx2.16x64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east,failure-domain.beta.kubernetes.io/zone=us-east-3,ibm-cloud.kubernetes.io/iaas-provider=g2,ibm-cloud.kubernetes.io/instance-id=0777_5b023ff0-3d29-4866-9d29-96f65c2b1c2f,ibm-cloud.kubernetes.io/internal-ip=10.241.129.17,ibm-cloud.kubernetes.io/machine-type=bx2.16x64,ibm-cloud.kubernetes.io/os=REDHAT_7_64,ibm-cloud.kubernetes.io/region=us-east,ibm-cloud.kubernetes.io/sgx-enabled=false,ibm-cloud.kubernetes.io/subnet-id=0777-7e4a53d7-39f6-4a14-a901-5e4a6fc8ed95,ibm-cloud.kubernetes.io/worker-id=kube-c2lmbsgw0ku6jd2h0g70-cherryocsus-default-00000326,ibm-cloud.kubernetes.io/worker-pool-id=c2lmbsgw0ku6jd2h0g70-cf4e109,ibm-cloud.kubernetes.io/worker-pool-name=default,ibm-cloud.kubernetes.io/worker-version=4.6.27_1542_openshift,ibm-cloud.kubernetes.io/zone=us-east-3,kubernetes.io/arch=amd64,kubernetes.io/hostname=10.241.129.17,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=bx2.16x64,node.openshift.io/os_id=rhel,topology.kubernetes.io/region=us-east,topology.kubernetes.io/zone=us-east-3,topology.rook.io/rack=rack2 10.241.65.56 Ready master,worker 9d v1.19.0+d856161 arch=amd64,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=bx2.16x64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east,failure-domain.beta.kubernetes.io/zone=us-east-2,ibm-cloud.kubernetes.io/iaas-provider=g2,ibm-cloud.kubernetes.io/instance-id=0767_caa642b9-1586-43b3-87a3-871aac5dcbc3,ibm-cloud.kubernetes.io/internal-ip=10.241.65.56,ibm-cloud.kubernetes.io/machine-type=bx2.16x64,ibm-cloud.kubernetes.io/os=REDHAT_7_64,ibm-cloud.kubernetes.io/region=us-east,ibm-cloud.kubernetes.io/sgx-enabled=false,ibm-cloud.kubernetes.io/subnet-id=0767-706430c2-7ec8-4665-b9fa-31bfc1905989,ibm-cloud.kubernetes.io/worker-id=kube-c2lmbsgw0ku6jd2h0g70-cherryocsus-default-00000af6,ibm-cloud.kubernetes.io/worker-pool-id=c2lmbsgw0ku6jd2h0g70-cf4e109,ibm-cloud.kubernetes.io/worker-pool-name=default,ibm-cloud.kubernetes.io/worker-version=4.6.29_1544_openshift,ibm-cloud.kubernetes.io/zone=us-east-2,kubernetes.io/arch=amd64,kubernetes.io/hostname=10.241.65.56,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=bx2.16x64,node.openshift.io/os_id=rhel,topology.kubernetes.io/region=us-east,topology.kubernetes.io/zone=us-east-2 10.241.65.57 Ready master,worker 9d v1.19.0+d856161 arch=amd64,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=bx2.16x64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east,failure-domain.beta.kubernetes.io/zone=us-east-2,ibm-cloud.kubernetes.io/iaas-provider=g2,ibm-cloud.kubernetes.io/instance-id=0767_bb9ad539-61c1-4e44-a0fd-ff62877afa22,ibm-cloud.kubernetes.io/internal-ip=10.241.65.57,ibm-cloud.kubernetes.io/machine-type=bx2.16x64,ibm-cloud.kubernetes.io/os=REDHAT_7_64,ibm-cloud.kubernetes.io/region=us-east,ibm-cloud.kubernetes.io/sgx-enabled=false,ibm-cloud.kubernetes.io/subnet-id=0767-706430c2-7ec8-4665-b9fa-31bfc1905989,ibm-cloud.kubernetes.io/worker-id=kube-c2lmbsgw0ku6jd2h0g70-cherryocsus-default-00000b91,ibm-cloud.kubernetes.io/worker-pool-id=c2lmbsgw0ku6jd2h0g70-cf4e109,ibm-cloud.kubernetes.io/worker-pool-name=default,ibm-cloud.kubernetes.io/worker-version=4.6.29_1544_openshift,ibm-cloud.kubernetes.io/zone=us-east-2,kubernetes.io/arch=amd64,kubernetes.io/hostname=10.241.65.57,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=bx2.16x64,node.openshift.io/os_id=rhel,topology.kubernetes.io/region=us-east,topology.kubernetes.io/zone=us-east-2
This confirms the hypothesis. The newer nodes without the rack labels are in zone=us-east-2 while the original nodes were in zone=us-east-1 and zone=us-east-3. I can reproduce the crash with the latest upstream master code.
The current solution is to create a new rack in such a case. This needs to be evaluated to ensure arbiter will work as expected.
@Shirisha and Mudit Does this need to be backported to OCS 4.7 or 4.6?
Not 4.6.5 unless we don't have a workaround, we can take it in next 4.7.1 z stream. I will create a clone. Sahina as this was found on ROKS, WDYT?
(In reply to Mudit Agarwal from comment #17) > Not 4.6.5 unless we don't have a workaround, we can take it in next 4.7.1 z > stream. I will create a clone. > > Sahina as this was found on ROKS, WDYT? Yes, 4.7 z-stream is fine. Thanks!
This seems fine. Without the fix, the ocs-operator pod should have gone into crash loop backoff.
Based on comment21 and comment23, moving to the verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3003
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days