2000573 – Incorrect StorageCluster CR created and ODF cluster getting installed with 2 Zone OCP cluster

Bug 2000573 - Incorrect StorageCluster CR created and ODF cluster getting installed with 2 Zone OCP cluster

Summary: Incorrect StorageCluster CR created and ODF cluster getting installed with 2 ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Console Storage Plugin
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Afreen
QA Contact:	Jilju Joy
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2000711 (view as bug list)
Depends On:
Blocks:	2001983
TreeView+	depends on / blocked

Reported:	2021-09-02 12:10 UTC by Bipin Kunal
Modified:	2023-09-15 01:14 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:07:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift console pull 10005	0	None	None	None	2021-09-07 09:54:28 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:07:30 UTC

Description Bipin Kunal 2021-09-02 12:10:51 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Replica 1 cluster is being installed using UI.


Version of all relevant components (if applicable):



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
No sure. 


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue be reproducible?

Yes.
Always in 2 zone OCP cluster


Can this issue reproduce from the UI?

yes, Installation via UI reproduces it 100% time


If this is a regression, please provide more details to justify this:
I think so. In fact not sure. I haven't tried 2 zone deployment earlier.
But I believe with 2 zones, it should fall to racks instead of deploying replica 1


Steps to Reproduce:
1. Create an OCP-4.9 two zone cluster
2. Install ODF-4.9 operator
3. Create Storage system


Actual results:



Expected results:

Storage System/cluster should be created with Replica 3


Additional info:

Comment 2 Bipin Kunal 2021-09-02 12:13:06 UTC

Few things I missed to add in the description, adding it here:

version:

OCP version: 4.9.0-0.nightly-2021-09-01-193941
OCS version: 4.9.0-120.ci

Comment 6 Bipin Kunal 2021-09-02 13:12:31 UTC

$ oc get pvc -n openshift-storage
NAME                              STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
db-noobaa-db-pg-0                 Pending                                                                        ocs-storagecluster-ceph-rbd   115m
ocs-deviceset-gp2-0-data-057qxb   Bound     pvc-57ae009f-00c6-49c6-b1f7-7225b4eea84f   512Gi      RWO            gp2                           117m
rook-ceph-mon-a                   Bound     pvc-df1edead-13ce-42d3-97d2-44d94a7b424c   50Gi       RWO            gp2                           119m
rook-ceph-mon-b                   Bound     pvc-dea4df9c-1b1c-4e5a-afcb-4af857d064e2   50Gi       RWO            gp2                           119m
rook-ceph-mon-c                   Bound     pvc-a74b12f6-b7c9-419e-9328-8c7b62e3e828   50Gi       RWO            gp2                           119m


I see replica 1 being set in the storagecluster yaml..


=======================================

storageDeviceSets:
    - config: {}
      count: 1
      dataPVCTemplate:
        metadata: {}
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: gp2
          volumeMode: Block
        status: {}
      name: ocs-deviceset-gp2
      placement: {}
      preparePlacement: {}
      replica: 1
      resources: {}
    version: 4.9.0

=======================================

Comment 7 Shay Rozen 2021-09-02 19:37:46 UTC

*** Bug 2000711 has been marked as a duplicate of this bug. ***

Comment 9 Shay Rozen 2021-09-02 19:50:03 UTC

There is W/A to change storagecluster count from 1 to 3. Moving back to high

Comment 10 N Balachandran 2021-09-03 04:12:06 UTC

From the storagecluster.yaml:

  spec:
    arbiter: {}
    encryption:
      enable: true
      kms: {}
    externalStorage: {}
    flexibleScaling: true  < --- this is why the replica is set to 1
    managedResources:
...
    failureDomain: host
    failureDomainKey: kubernetes.io/hostname
    failureDomainValues:
    - ip-10-0-137-89.us-west-1.compute.internal
    - ip-10-0-144-4.us-west-1.compute.internal
    - ip-10-0-253-129.us-west-1.compute.internal


FlexibleScaling should only be enabled for internal-attached clusters in which case it will set the count to the number of OSDs and the replica to 1 (this replica is unrelated to the data replication factor of the pool).

If this was not an internal-attached cluster, the storagecluster should have used racks.

Comment 11 Bipin Kunal 2021-09-03 05:46:46 UTC

(In reply to N Balachandran from comment #10)
> From the storagecluster.yaml:
> 
>   spec:
>     arbiter: {}
>     encryption:
>       enable: true
>       kms: {}
>     externalStorage: {}
>     flexibleScaling: true  < --- this is why the replica is set to 1
>     managedResources:
> ...
>     failureDomain: host
>     failureDomainKey: kubernetes.io/hostname
>     failureDomainValues:
>     - ip-10-0-137-89.us-west-1.compute.internal
>     - ip-10-0-144-4.us-west-1.compute.internal
>     - ip-10-0-253-129.us-west-1.compute.internal
> 

Interesting, we need to find why this is being set.
> 
> FlexibleScaling should only be enabled for internal-attached clusters in
> which case it will set the count to the number of OSDs and the replica to 1
> (this replica is unrelated to the data replication factor of the pool).

agreed

> 
> If this was not an internal-attached cluster, the storagecluster should have
> used racks.

Certainly, this was not internal-attached or using local devices. This option used was "Use an existing storage class", I guess this is the same Internal Mode.

Comment 15 N Balachandran 2021-09-06 10:30:41 UTC

Reproducible with an OCP 4.9 cluster (using clusterbot) and the odf operator.

As the storagecluster CR is created by the console, moving it to the console component.

Comment 16 N Balachandran 2021-09-06 13:40:19 UTC

Summary of issues:
1. The UI should not enable flexibleScaling for Internal mode StorageClusters. Only Internal-Attached StorageClusters should be configured to enable flexibleScaling when the storage nodes are in fewer than 3 zones.
2. 3 nodes were selected but the StorageCluster.spec.storageDeviceSets[0].count was set to 1. If flexibleScaling is enabled, the count should be set to the number of OSDs as StorageCluster.spec.storageDeviceSets[0].replica is set to 1.

Comment 56 Jilju Joy 2021-10-21 14:26:06 UTC

Verified in version:


$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-10-21-014208   True        False         78m     Cluster version is 4.10.0-0.nightly-2021-10-21-014208

$ oc get csv odf-operator.v4.9.0 -o yaml | grep full_version
    full_version: 4.9.0-195.ci


Tested on AWS internal mode.
Used 2 availability zones. ODF installation and storage system creation was done from GUI. Replica value is 3 as expected.



$ oc get storagecluster -o yaml
apiVersion: v1
items:
- apiVersion: ocs.openshift.io/v1
  kind: StorageCluster
  metadata:
    annotations:
      storagesystem.odf.openshift.io/watched-by: ocs-storagecluster-storagesystem
      uninstall.ocs.openshift.io/cleanup-policy: delete
      uninstall.ocs.openshift.io/mode: graceful
    creationTimestamp: "2021-10-21T14:05:27Z"
    finalizers:
    - storagecluster.ocs.openshift.io
    generation: 3
    name: ocs-storagecluster
    namespace: openshift-storage
      kms: {}
    externalStorage: {}
    resourceVersion: "60478"
    uid: b3a83935-79cf-4ef8-bc1c-0aa023051c21
  spec:
    arbiter: {}
    encryption:
    managedResources:
      cephBlockPools: {}
      cephConfig: {}
      cephDashboard: {}
      cephFilesystems: {}
      cephObjectStoreUsers: {}
      cephObjectStores: {}
    nodeTopologies: {}
    storageDeviceSets:
    - config: {}
      count: 1
      dataPVCTemplate:
        metadata: {}
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: gp2
          volumeMode: Block
        status: {}
      name: ocs-deviceset-gp2
      placement: {}
      portable: true
      preparePlacement: {}
      replica: 3
      resources: {}
    version: 4.9.0
  status:
    conditions:
    - lastHeartbeatTime: "2021-10-21T14:19:01Z"
      lastTransitionTime: "2021-10-21T14:10:32Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: ReconcileComplete
    - lastHeartbeatTime: "2021-10-21T14:19:01Z"
      lastTransitionTime: "2021-10-21T14:11:50Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: Available
    - lastHeartbeatTime: "2021-10-21T14:19:01Z"
      lastTransitionTime: "2021-10-21T14:11:50Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "False"
      type: Progressing
    - lastHeartbeatTime: "2021-10-21T14:19:01Z"
      lastTransitionTime: "2021-10-21T14:05:28Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "False"
      type: Degraded
    - lastHeartbeatTime: "2021-10-21T14:19:01Z"
      lastTransitionTime: "2021-10-21T14:11:50Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: Upgradeable
    failureDomain: rack
    failureDomainKey: topology.rook.io/rack
    failureDomainValues:
    - rack0
    - rack1
    - rack2
    images:
      ceph:
        actualImage: quay.io/rhceph-dev/rhceph@sha256:b5ff930b8b35b4ac002f0f34b4be112b3a433b5615f2ea65402a54a84b6edadb
        desiredImage: quay.io/rhceph-dev/rhceph@sha256:b5ff930b8b35b4ac002f0f34b4be112b3a433b5615f2ea65402a54a84b6edadb
      noobaaCore:
        actualImage: quay.io/rhceph-dev/mcg-core@sha256:f60e2a6a87c1e49be237740d16f74f95578d24213f6a3b85bba4185313278672
        desiredImage: quay.io/rhceph-dev/mcg-core@sha256:f60e2a6a87c1e49be237740d16f74f95578d24213f6a3b85bba4185313278672
      noobaaDB:
        actualImage: registry.redhat.io/rhel8/postgresql-12@sha256:1b91c9946f4351bd3688bc538d498e6738cd8a5285af998be6e8dfe218dca6fa
        desiredImage: registry.redhat.io/rhel8/postgresql-12@sha256:1b91c9946f4351bd3688bc538d498e6738cd8a5285af998be6e8dfe218dca6fa
    nodeTopologies:
      labels:
        failure-domain.beta.kubernetes.io/region:
        - us-east-2
        failure-domain.beta.kubernetes.io/zone:
        - us-east-2a
        - us-east-2b
        kubernetes.io/hostname:
        - ip-10-0-200-3.us-east-2.compute.internal
        - ip-10-0-142-161.us-east-2.compute.internal
        - ip-10-0-181-8.us-east-2.compute.internal
        topology.rook.io/rack:
        - rack0
        - rack1
        - rack2
    phase: Ready
    relatedObjects:
    - apiVersion: ceph.rook.io/v1
      kind: CephCluster
      name: ocs-storagecluster-cephcluster
      namespace: openshift-storage
      resourceVersion: "60289"
      uid: 36284fff-5b83-4f5a-b475-0e4b5baffbc6
    - apiVersion: noobaa.io/v1alpha1
      kind: NooBaa
      name: noobaa
      namespace: openshift-storage
      resourceVersion: "60477"
      uid: 83c8145b-6d22-424e-9754-94d76d92a9e7
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""




$ oc get pvc
NAME                              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
db-noobaa-db-pg-0                 Bound    pvc-50a57630-1f3b-449b-9aaf-7c18698b2196   50Gi       RWO            ocs-storagecluster-ceph-rbd   9m7s
ocs-deviceset-gp2-0-data-0ww4v6   Bound    pvc-8e3544b2-2e41-42c8-914d-941abd2b96d5   512Gi      RWO            gp2                           10m
ocs-deviceset-gp2-1-data-0xswnv   Bound    pvc-fc5d3a5b-7263-47a0-9069-af9c2d59f412   512Gi      RWO            gp2                           10m
ocs-deviceset-gp2-2-data-0pdtkp   Bound    pvc-49de13cc-b26b-4f61-96d6-3cf995505009   512Gi      RWO            gp2                           10m
rook-ceph-mon-a                   Bound    pvc-0398b6e1-cba3-43c4-8e44-0da317dace61   50Gi       RWO            gp2                           13m
rook-ceph-mon-b                   Bound    pvc-59ba210a-0977-4247-9514-e9925b8c0eb1   50Gi       RWO            gp2                           13m
rook-ceph-mon-c                   Bound    pvc-23d83615-c55f-4bfe-971a-5e9208aad590   50Gi       RWO            gp2                           13m



$ oc get pods -o wide -l app=rook-ceph-osd
NAME                               READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
rook-ceph-osd-0-7fd67cb559-g62rp   2/2     Running   0          10m   10.129.2.20   ip-10-0-200-3.us-east-2.compute.internal     <none>           <none>
rook-ceph-osd-1-7f96cb594f-jvbvp   2/2     Running   0          10m   10.131.0.60   ip-10-0-142-161.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-2-69c6755569-9jrlv   2/2     Running   0          10m   10.128.2.25   ip-10-0-181-8.us-east-2.compute.internal     <none>           <none>


$ oc get nodes -o wide --show-labels
NAME                                         STATUS   ROLES    AGE    VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION                 CONTAINER-RUNTIME   LABELS
ip-10-0-138-73.us-east-2.compute.internal    Ready    master   104m   v1.22.1+d767194   10.0.138.73    <none>        Red Hat Enterprise Linux CoreOS 410.84.202110191922-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.22.0      beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-138-73.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m4.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
ip-10-0-142-161.us-east-2.compute.internal   Ready    worker   97m    v1.22.1+d767194   10.0.142.161   <none>        Red Hat Enterprise Linux CoreOS 410.84.202110191922-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.22.0      beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-142-161.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a,topology.rook.io/rack=rack1
ip-10-0-181-8.us-east-2.compute.internal     Ready    worker   97m    v1.22.1+d767194   10.0.181.8     <none>        Red Hat Enterprise Linux CoreOS 410.84.202110191922-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.22.0      beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-181-8.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a,topology.rook.io/rack=rack2
ip-10-0-190-213.us-east-2.compute.internal   Ready    master   104m   v1.22.1+d767194   10.0.190.213   <none>        Red Hat Enterprise Linux CoreOS 410.84.202110191922-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.22.0      beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-190-213.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m4.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
ip-10-0-198-192.us-east-2.compute.internal   Ready    master   104m   v1.22.1+d767194   10.0.198.192   <none>        Red Hat Enterprise Linux CoreOS 410.84.202110191922-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.22.0      beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-198-192.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m4.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b
ip-10-0-200-3.us-east-2.compute.internal     Ready    worker   96m    v1.22.1+d767194   10.0.200.3     <none>        Red Hat Enterprise Linux CoreOS 410.84.202110191922-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.22.0      beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-200-3.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b,topology.rook.io/rack=rack0



$ ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME                                             STATUS  REWEIGHT  PRI-AFF
 -1         1.50000  root default                                                                   
 -6         1.50000      region us-east-2                                                           
 -5         1.00000          zone us-east-2a                                                        
 -4         0.50000              rack rack1                                                         
 -3         0.50000                  host ocs-deviceset-gp2-2-data-0pdtkp                           
  1    ssd  0.50000                      osd.1                                 up   1.00000  1.00000
-18         0.50000              rack rack2                                                         
-17         0.50000                  host ocs-deviceset-gp2-1-data-0xswnv                           
  2    ssd  0.50000                      osd.2                                 up   1.00000  1.00000
-13         0.50000          zone us-east-2b                                                        
-12         0.50000              rack rack0                                                         
-11         0.50000                  host ocs-deviceset-gp2-0-data-0ww4v6                           
  0    ssd  0.50000                      osd.0                                 up   1.00000  1.00000

Comment 62 errata-xmlrpc 2022-03-10 16:07:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 63 Red Hat Bugzilla 2023-09-15 01:14:34 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.