Description of problem (please be detailed as possible and provide log snippests): If flexible scaling is enabled, the failure domain is expected to be set to host. However, if flexible scaling is enabled but the OCS hosts are distributed across 3 zones , the failure domain is still being set to zone. Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? No If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Set up an OCP cluster with 3 zones 2. Create a storagecluster using a yaml with flexibleScaling set to true 3. Check the failureDomain in storagecluster.status Actual results: failureDomain is set to zone Expected results: failureDomain should be set to host Additional info:
Discussed on today's OCS Operator triage meeting, and it looks as a serious problem. Giving QA ack. Reproducer in the description is clear.
Proposing to cover this use case via automated test case(s).
Which platforms did you check? The bug is for vSphere and AWS platforms?
(In reply to Itzhak from comment #7) > Which platforms did you check? The bug is for vSphere and AWS platforms? It will hold for any platform - this was hit when creating the storage cluster using the CLI. It cannot be reproduced using the UI.
I checked the bug with an AWS 4.7 cluster with 3 availability zones. Steps I did to reproduce the bug: 1. Deploy an AWS cluster with OCP 4.7, 3 availability zones, using the conf file "conf/deployment/aws/ipi_3az_rhcos_lso_3m_3w.yaml". And skip OCS deploy. 2. Create an OCS 4.7 operator, and label the 3 worker nodes with the ocs label. 3. Create a Local Storage 4.7 operator. 4. Use an ocs-storagecluster yaml file with "flexibleScaling: true". 5. Check that all the pods in the openshift-storage namespace created successfully, and Ceph health is OK. 6. Check the failureDomain param in the "ocs-storagecluster" and verified that is is "host": $ oc get storageclusters.ocs.openshift.io ocs-storagecluster -o yaml | grep failureDomain: f:failureDomain: {} failureDomain: host Additional info about the cluster versions: OCP version: Client Version: 4.7.0-0.nightly-2021-04-21-211002 Server Version: 4.7.0-0.nightly-2021-04-23-222925 Kubernetes Version: v1.20.0+7d0a2b2 OCS verison: ocs-operator.v4.7.0-353.ci OpenShift Container Storage 4.7.0-353.ci Succeeded cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-04-23-222925 True False 70m Cluster version is 4.7.0-0.nightly-2021-04-23-222925 Rook version rook: 4.7-132.80f8b1112.release_4.7 go: go1.15.7 Ceph version ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable) Jenkins URL: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/2233/
Created attachment 1775186 [details] ocs-storagecluster with flexible scaling enabled
I have just one thing I am not sure about. The Ceph osd tree output is: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 6.82109 root default -5 6.82109 region us-east-2 -10 2.27370 zone us-east-2a -9 2.27370 host ip-10-0-150-125 0 ssd 2.27370 osd.0 up 1.00000 1.00000 -4 2.27370 zone us-east-2b -3 2.27370 host ip-10-0-171-3 1 ssd 2.27370 osd.1 up 1.00000 1.00000 -14 2.27370 zone us-east-2c -13 2.27370 host ip-10-0-216-86 2 ssd 2.27370 osd.2 up 1.00000 1.00000 Is it the output we expect in such a case?
Yes, this looks fine. Itzhak, can you confirm that portable was set to false and the count and replica values were changed as well?
The count is 3, and the replica is 1. You can see it also in the ocs-storagecluster file I uploaded in comment 10 https://bugzilla.redhat.com/show_bug.cgi?id=1939472#c10. But I don't find the word "portable" in the file, so I am not sure about that.
That is fine. If portable is missing, it is false.
Okay, great. So I am moving the bug to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041