1939472 – Failure domain set incorrectly to zone if flexible scaling is enabled but there are >= 3 zones

Bug 1939472 - Failure domain set incorrectly to zone if flexible scaling is enabled but there are >= 3 zones

Summary: Failure domain set incorrectly to zone if flexible scaling is enabled but the...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	N Balachandran
QA Contact:	Itzhak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-16 13:12 UTC by N Balachandran
Modified:	2021-05-19 09:21 UTC (History)
CC List:	7 users (show)
Fixed In Version:	4.7.0-318.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-19 09:20:45 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
ocs-storagecluster with flexible scaling enabled (6.25 KB, application/octet-stream) 2021-04-25 13:52 UTC, Itzhak	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ocs-operator pull 1118	None	open	Set failure domain to host if flexibleScaling is enabled	2021-03-22 09:09:26 UTC
Github	openshift ocs-operator pull 1126	None	open	Bug 1939472:[release-4.7] Set failure domain to host if flexibleScaling is enabled	2021-03-22 16:36:16 UTC
Red Hat Product Errata	RHSA-2021:2041	None	None	None	2021-05-19 09:21:13 UTC

Description N Balachandran 2021-03-16 13:12:22 UTC

Description of problem (please be detailed as possible and provide log
snippests):

If flexible scaling is enabled, the failure domain is expected to be set to host.

However, if flexible scaling is enabled but the OCS hosts are distributed across 3 zones , the failure domain is still being set to zone.

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Set up an OCP cluster with 3 zones
2. Create a storagecluster using a yaml with flexibleScaling set to true
3. Check the failureDomain in storagecluster.status


Actual results:
failureDomain is set to zone

Expected results:
failureDomain should be set to host

Additional info:

Comment 2 Martin Bukatovic 2021-03-16 16:02:34 UTC

Discussed on today's OCS Operator triage meeting, and it looks as a serious problem. Giving QA ack. Reproducer in the description is clear.

Comment 5 Martin Bukatovic 2021-03-16 16:17:08 UTC

Proposing to cover this use case via automated test case(s).

Comment 7 Itzhak 2021-04-05 15:11:44 UTC

Which platforms did you check? The bug is for vSphere and AWS platforms?

Comment 8 N Balachandran 2021-04-05 15:32:50 UTC

(In reply to Itzhak from comment #7)
> Which platforms did you check? The bug is for vSphere and AWS platforms?

It will hold for any platform - this was hit when creating the storage cluster using the CLI. It cannot be reproduced using the UI.

Comment 9 Itzhak 2021-04-25 13:49:53 UTC

I checked the bug with an AWS 4.7 cluster with 3 availability zones.

Steps I did to reproduce the bug:

1. Deploy an AWS cluster with OCP 4.7, 3 availability zones, using the conf file "conf/deployment/aws/ipi_3az_rhcos_lso_3m_3w.yaml". And skip OCS deploy.

2. Create an OCS 4.7 operator, and label the 3 worker nodes with the ocs label.
3. Create a Local Storage 4.7 operator.
4. Use an ocs-storagecluster yaml file with "flexibleScaling: true".

5. Check that all the pods in the openshift-storage namespace created successfully, and Ceph health is OK.
6. Check the failureDomain param in the "ocs-storagecluster" and verified that is is "host":
$ oc get storageclusters.ocs.openshift.io ocs-storagecluster -o yaml | grep failureDomain:
f:failureDomain: {}
failureDomain: host

Additional info about the cluster versions:

OCP version:
Client Version: 4.7.0-0.nightly-2021-04-21-211002
Server Version: 4.7.0-0.nightly-2021-04-23-222925
Kubernetes Version: v1.20.0+7d0a2b2

OCS verison:
ocs-operator.v4.7.0-353.ci OpenShift Container Storage 4.7.0-353.ci Succeeded

cluster version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.7.0-0.nightly-2021-04-23-222925 True False 70m Cluster version is 4.7.0-0.nightly-2021-04-23-222925

Rook version
rook: 4.7-132.80f8b1112.release_4.7
go: go1.15.7

Ceph version
ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable)

Jenkins URL: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/2233/

Comment 10 Itzhak 2021-04-25 13:52:27 UTC

Created attachment 1775186 [details]
ocs-storagecluster with flexible scaling enabled

Comment 11 Itzhak 2021-04-25 13:58:48 UTC

I have just one thing I am not sure about. 
The Ceph osd tree output is:

ID  CLASS WEIGHT  TYPE NAME                        STATUS REWEIGHT PRI-AFF 
 -1       6.82109 root default                                             
 -5       6.82109     region us-east-2                                     
-10       2.27370         zone us-east-2a                                  
 -9       2.27370             host ip-10-0-150-125                         
  0   ssd 2.27370                 osd.0                up  1.00000 1.00000 
 -4       2.27370         zone us-east-2b                                  
 -3       2.27370             host ip-10-0-171-3                           
  1   ssd 2.27370                 osd.1                up  1.00000 1.00000 
-14       2.27370         zone us-east-2c                                  
-13       2.27370             host ip-10-0-216-86                          
  2   ssd 2.27370                 osd.2                up  1.00000 1.00000 


Is it the output we expect in such a case?

Comment 12 N Balachandran 2021-04-27 04:29:13 UTC

Yes, this looks fine. Itzhak, can you confirm that portable was set to false and the count and replica values were changed as well?

Comment 13 Itzhak 2021-04-27 13:12:12 UTC

The count is 3, and the replica is 1. You can see it also in the ocs-storagecluster file I uploaded in comment 10 https://bugzilla.redhat.com/show_bug.cgi?id=1939472#c10. But I don't find the word "portable" in the file, so I am not sure about that.

Comment 14 N Balachandran 2021-04-28 10:11:35 UTC

That is fine. If portable is missing, it is false.

Comment 15 Itzhak 2021-04-28 11:38:41 UTC

Okay, great. So I am moving the bug to Verified.

Comment 17 errata-xmlrpc 2021-05-19 09:20:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Note You need to log in before you can comment on or make changes to this bug.