Bug 1939472
Summary: | Failure domain set incorrectly to zone if flexible scaling is enabled but there are >= 3 zones | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | N Balachandran <nibalach> | ||||
Component: | ocs-operator | Assignee: | N Balachandran <nibalach> | ||||
Status: | CLOSED ERRATA | QA Contact: | Itzhak <ikave> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.7 | CC: | ebenahar, ikave, madam, mbukatov, muagarwa, ocs-bugs, sostapov | ||||
Target Milestone: | --- | ||||||
Target Release: | OCS 4.7.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | 4.7.0-318.ci | Doc Type: | No Doc Update | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-05-19 09:20:45 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
N Balachandran
2021-03-16 13:12:22 UTC
Discussed on today's OCS Operator triage meeting, and it looks as a serious problem. Giving QA ack. Reproducer in the description is clear. Proposing to cover this use case via automated test case(s). Which platforms did you check? The bug is for vSphere and AWS platforms? (In reply to Itzhak from comment #7) > Which platforms did you check? The bug is for vSphere and AWS platforms? It will hold for any platform - this was hit when creating the storage cluster using the CLI. It cannot be reproduced using the UI. I checked the bug with an AWS 4.7 cluster with 3 availability zones. Steps I did to reproduce the bug: 1. Deploy an AWS cluster with OCP 4.7, 3 availability zones, using the conf file "conf/deployment/aws/ipi_3az_rhcos_lso_3m_3w.yaml". And skip OCS deploy. 2. Create an OCS 4.7 operator, and label the 3 worker nodes with the ocs label. 3. Create a Local Storage 4.7 operator. 4. Use an ocs-storagecluster yaml file with "flexibleScaling: true". 5. Check that all the pods in the openshift-storage namespace created successfully, and Ceph health is OK. 6. Check the failureDomain param in the "ocs-storagecluster" and verified that is is "host": $ oc get storageclusters.ocs.openshift.io ocs-storagecluster -o yaml | grep failureDomain: f:failureDomain: {} failureDomain: host Additional info about the cluster versions: OCP version: Client Version: 4.7.0-0.nightly-2021-04-21-211002 Server Version: 4.7.0-0.nightly-2021-04-23-222925 Kubernetes Version: v1.20.0+7d0a2b2 OCS verison: ocs-operator.v4.7.0-353.ci OpenShift Container Storage 4.7.0-353.ci Succeeded cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-04-23-222925 True False 70m Cluster version is 4.7.0-0.nightly-2021-04-23-222925 Rook version rook: 4.7-132.80f8b1112.release_4.7 go: go1.15.7 Ceph version ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable) Jenkins URL: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/2233/ Created attachment 1775186 [details]
ocs-storagecluster with flexible scaling enabled
I have just one thing I am not sure about. The Ceph osd tree output is: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 6.82109 root default -5 6.82109 region us-east-2 -10 2.27370 zone us-east-2a -9 2.27370 host ip-10-0-150-125 0 ssd 2.27370 osd.0 up 1.00000 1.00000 -4 2.27370 zone us-east-2b -3 2.27370 host ip-10-0-171-3 1 ssd 2.27370 osd.1 up 1.00000 1.00000 -14 2.27370 zone us-east-2c -13 2.27370 host ip-10-0-216-86 2 ssd 2.27370 osd.2 up 1.00000 1.00000 Is it the output we expect in such a case? Yes, this looks fine. Itzhak, can you confirm that portable was set to false and the count and replica values were changed as well? The count is 3, and the replica is 1. You can see it also in the ocs-storagecluster file I uploaded in comment 10 https://bugzilla.redhat.com/show_bug.cgi?id=1939472#c10. But I don't find the word "portable" in the file, so I am not sure about that. That is fine. If portable is missing, it is false. Okay, great. So I am moving the bug to Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041 |