Bug 1939472 - Failure domain set incorrectly to zone if flexible scaling is enabled but there are >= 3 zones
Summary: Failure domain set incorrectly to zone if flexible scaling is enabled but the...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: OCS 4.7.0
Assignee: N Balachandran
QA Contact: Itzhak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-16 13:12 UTC by N Balachandran
Modified: 2021-05-19 09:21 UTC (History)
7 users (show)

Fixed In Version: 4.7.0-318.ci
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-19 09:20:45 UTC
Embargoed:


Attachments (Terms of Use)
ocs-storagecluster with flexible scaling enabled (6.25 KB, application/octet-stream)
2021-04-25 13:52 UTC, Itzhak
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ocs-operator pull 1118 0 None open Set failure domain to host if flexibleScaling is enabled 2021-03-22 09:09:26 UTC
Github openshift ocs-operator pull 1126 0 None open Bug 1939472:[release-4.7] Set failure domain to host if flexibleScaling is enabled 2021-03-22 16:36:16 UTC
Red Hat Product Errata RHSA-2021:2041 0 None None None 2021-05-19 09:21:13 UTC

Description N Balachandran 2021-03-16 13:12:22 UTC
Description of problem (please be detailed as possible and provide log
snippests):

If flexible scaling is enabled, the failure domain is expected to be set to host.

However, if flexible scaling is enabled but the OCS hosts are distributed across 3 zones , the failure domain is still being set to zone.

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Set up an OCP cluster with 3 zones
2. Create a storagecluster using a yaml with flexibleScaling set to true
3. Check the failureDomain in storagecluster.status


Actual results:
failureDomain is set to zone

Expected results:
failureDomain should be set to host

Additional info:

Comment 2 Martin Bukatovic 2021-03-16 16:02:34 UTC
Discussed on today's OCS Operator triage meeting, and it looks as a serious problem. Giving QA ack. Reproducer in the description is clear.

Comment 5 Martin Bukatovic 2021-03-16 16:17:08 UTC
Proposing to cover this use case via automated test case(s).

Comment 7 Itzhak 2021-04-05 15:11:44 UTC
Which platforms did you check? The bug is for vSphere and AWS platforms?

Comment 8 N Balachandran 2021-04-05 15:32:50 UTC
(In reply to Itzhak from comment #7)
> Which platforms did you check? The bug is for vSphere and AWS platforms?

It will hold for any platform - this was hit when creating the storage cluster using the CLI. It cannot be reproduced using the UI.

Comment 9 Itzhak 2021-04-25 13:49:53 UTC
I checked the bug with an AWS 4.7 cluster with 3 availability zones. 

Steps I did to reproduce the bug:

1. Deploy an AWS cluster with OCP 4.7, 3 availability zones, using the conf file "conf/deployment/aws/ipi_3az_rhcos_lso_3m_3w.yaml". And skip OCS deploy. 

2. Create an OCS 4.7 operator, and label the 3 worker nodes with the ocs label.
3. Create a Local Storage 4.7 operator.
4. Use an ocs-storagecluster yaml file with "flexibleScaling: true". 

5. Check that all the pods in the openshift-storage namespace created successfully, and Ceph health is OK.
6. Check the failureDomain param in the "ocs-storagecluster" and verified that is is "host":
$ oc get storageclusters.ocs.openshift.io ocs-storagecluster -o yaml | grep failureDomain:
        f:failureDomain: {}
  failureDomain: host


Additional info about the cluster versions:

OCP version:
Client Version: 4.7.0-0.nightly-2021-04-21-211002
Server Version: 4.7.0-0.nightly-2021-04-23-222925
Kubernetes Version: v1.20.0+7d0a2b2

OCS verison:
ocs-operator.v4.7.0-353.ci   OpenShift Container Storage   4.7.0-353.ci              Succeeded

cluster version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-04-23-222925   True        False         70m     Cluster version is 4.7.0-0.nightly-2021-04-23-222925

Rook version
rook: 4.7-132.80f8b1112.release_4.7
go: go1.15.7

Ceph version
ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable)

Jenkins URL: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/2233/

Comment 10 Itzhak 2021-04-25 13:52:27 UTC
Created attachment 1775186 [details]
ocs-storagecluster with flexible scaling enabled

Comment 11 Itzhak 2021-04-25 13:58:48 UTC
I have just one thing I am not sure about. 
The Ceph osd tree output is:

ID  CLASS WEIGHT  TYPE NAME                        STATUS REWEIGHT PRI-AFF 
 -1       6.82109 root default                                             
 -5       6.82109     region us-east-2                                     
-10       2.27370         zone us-east-2a                                  
 -9       2.27370             host ip-10-0-150-125                         
  0   ssd 2.27370                 osd.0                up  1.00000 1.00000 
 -4       2.27370         zone us-east-2b                                  
 -3       2.27370             host ip-10-0-171-3                           
  1   ssd 2.27370                 osd.1                up  1.00000 1.00000 
-14       2.27370         zone us-east-2c                                  
-13       2.27370             host ip-10-0-216-86                          
  2   ssd 2.27370                 osd.2                up  1.00000 1.00000 


Is it the output we expect in such a case?

Comment 12 N Balachandran 2021-04-27 04:29:13 UTC
Yes, this looks fine. Itzhak, can you confirm that portable was set to false and the count and replica values were changed as well?

Comment 13 Itzhak 2021-04-27 13:12:12 UTC
The count is 3, and the replica is 1. You can see it also in the ocs-storagecluster file I uploaded in comment 10 https://bugzilla.redhat.com/show_bug.cgi?id=1939472#c10. But I don't find the word "portable" in the file, so I am not sure about that.

Comment 14 N Balachandran 2021-04-28 10:11:35 UTC
That is fine. If portable is missing, it is false.

Comment 15 Itzhak 2021-04-28 11:38:41 UTC
Okay, great. So I am moving the bug to Verified.

Comment 17 errata-xmlrpc 2021-05-19 09:20:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041


Note You need to log in before you can comment on or make changes to this bug.