Bug 1928063

Summary:	For FD: rack: actual osd pod distribution and OSD placement in rack under ceph osd tree output do not match
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Neha Berry <nberry>
Component:	rook	Assignee:	Travis Nielsen <tnielsen>
Status:	CLOSED ERRATA	QA Contact:	Neha Berry <nberry>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	madam, muagarwa, ocs-bugs, owasserm, sapillai, tnielsen
Target Milestone:	---	Keywords:	AutomationBackLog, Regression
Target Release:	OCS 4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.7.0-273.ci	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-19 09:20:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Neha Berry 2021-02-12 09:53:59 UTC

Description of problem (please be detailed as possible and provide log
snippests):
======================================================================
Till OCS 4.6, say for failure Domain : rack, the placement of osd within each rack in ceph osd tree output used to match with the pod distribution across comput nodes in same rack

But in OCS 4.7 deployment, it is seen that ceph osd tree output does not match with the actual placement of OSD pods in OCS labelled rack nodes. Does this have something to do with changes in OSD placement rules? 

e.g. for following cluster info:  
---------------------------

Node             rack             osd               							PVC  
--------------|--------------|---------------------------------------|----------------------------     
compute-0     |   rack0		 |	rook-ceph-osd-0-f84fd66c4-xd68f		 |		ocs-deviceset-thin-0-data-0jw2jt
compute-1     |   rack1		 |	rook-ceph-osd-1-7d79b4bdf4-8wh7x     |      ocs-deviceset-thin-2-data-05p4jm
compute-2     |   rack2		 |	rook-ceph-osd-2-7655c968d4-7vbv4     |      ocs-deviceset-thin-1-data-0n8ksh


Fri Feb 12 08:40:42 UTC 2021
======ceph osd tree ===
ID  CLASS WEIGHT  TYPE NAME                                     STATUS REWEIGHT PRI-AFF 
 -1       1.50000 root default                                                          
 -4       0.50000     rack rack0                                                        
 -3       0.50000         host ocs-deviceset-thin-0-data-0jw2jt                         
  0   hdd 0.50000             osd.0                                 up  1.00000 1.00000   <--- correct
-12       0.50000     rack rack1                                                        
-11       0.50000         host ocs-deviceset-thin-1-data-0n8ksh                         
  2   hdd 0.50000             osd.2                                 up  1.00000 1.00000   <--- incorrect
 -8       0.50000     rack rack2                                                        
 -7       0.50000         host ocs-deviceset-thin-2-data-05p4jm                         
  1   hdd 0.50000             osd.1                                 up  1.00000 1.00000   <--- incorrect


How is ceph osd tree deciding on the placement of OSDs under particular rack ? Isn't it using the same rack label which OCS operator added on hosting OCS nodes?

OSD placement rule from one OSD pod
======================================
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cluster.ocs.openshift.io/openshift-storage
            operator: Exists
          - key: cluster.ocs.openshift.io/openshift-storage
            operator: Exists



Version of all relevant components (if applicable):
=========================================================
OCP = 4.7.0-0.nightly-2021-02-09-224509
OCS = ocs-operator.v4.7.0-260.ci
ceph = 14.2.11-112.el8cp

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
==================================================================
Yes. OSD placement should match with actual pod placement
It is seen even if the OSD drains to another node, the info stays incorrect

Is there any workaround available to the best of your knowledge?
====================================================================
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
=============================================================================
3

Can this issue reproducible?
===============================
Yes

Can this issue reproduce from the UI?
-========================================
Installed OCS via UI

If this is a regression, please provide more details to justify this:
==================================================================
Yes. Till 4.6 it was working as expected.

Steps to Reproduce:
==========================
1. Install OCS 4.7 on a vmware cluster- dynamic mode where OCS labels nodes with rack label
2. Check the following outputs and compare if the OSDs are really showing up under correct rack label in ceph osd tree
 >> Verify compute node labels (which node belongs to which rack):
  $ oc get nodes --show-labels|grep rack

>> Verify the pods placement
  $ oc get pods -o wide -n openshift-storage|grep osd 

>> Check the OSD pod and PVC name relation
  $ for i in `oc get pvc |grep deviceset|awk '{print$1}'`; do echo $i; echo ++++++; oc describe pvc $i|grep -i 'used by' -A3  ; done
 
>> Check ceph osd tree output
  $ ceph osd tree



Actual results:
=====================
The osds are not placed under the correct rack in ceph osd tree (does not match with pod placement in that rack/node)

Expected results:
======================
The osd should show up under the correct rack in ceph osd tree.

Comment 5 Santosh Pillai 2021-02-12 11:01:32 UTC

OSD-1 is actually running on Compute-1 and rack-1.

  rook-ceph-osd-1-7d79b4bdf4-8wh7x                                  2/2     Running     0          103m   10.131.0.108   compute-1   <none>           <none>

But ceph OSD tree is showing that OSD-1 is running on rack-2. 

rook-ceph-osd-1-7d79b4bdf4-8wh7x deployment shows 
--crush-location=root=default host=ocs-deviceset-thin-2-data-05p4jm rack=rack2


Suspect this has to do with the topology spread constraints and has the same root cause as - https://bugzilla.redhat.com/show_bug.cgi?id=1924682

Comment 8 Travis Nielsen 2021-02-17 22:47:22 UTC

The topology of the OSD daemons is now enforced by rook to be the same as the topology of the OSD prepare jobs.
No need for the OCS operator PR since it's now in Rook. https://github.com/rook/rook/pull/7256

Comment 12 errata-xmlrpc 2021-05-19 09:20:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041