Bug 2296991

Summary:	PG state is not active+clean in arbiter deployment
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Vijay Avuthu <vavuthu>
Component:	rook	Assignee:	Travis Nielsen <tnielsen>
Status:	CLOSED ERRATA	QA Contact:	Vijay Avuthu <vavuthu>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.16	CC:	bniver, ebenahar, muagarwa, nojha, odf-bz-bot, pdhiran, sostapov, tnielsen
Target Milestone:	---	Keywords:	Automation, Regression
Target Release:	ODF 4.16.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.16.0-136	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-07-17 13:25:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vijay Avuthu 2024-07-10 02:29:53 UTC

Description of problem (please be detailed as possible and provide log
snippests):

[vSphere]: On a fresh Arbiter deployment (3M + 6W), PG state is not active+clean which result in blocking of creating default OSD PDB

Version of all relevant components (if applicable):

ocs-registry:4.16.0-135

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
3/3

Can this issue reproduce from the UI?
Not tried

If this is a regression, please provide more details to justify this:
Yes

Steps to Reproduce:
1. install Arbiter deployment 
2. check all PDB's are created
3.


Actual results:

$ oc get pdb
NAME                                               MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem    1               N/A               1                     7h54m
rook-ceph-mgr-pdb                                  N/A             1                 1                     7h52m
rook-ceph-mon-pdb                                  N/A             2                 2                     7h52m
rook-ceph-osd-zone-data-2                          N/A             0                 0                     7h51m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore   1               N/A               1                     7h54m



Expected results:

Default OSD pdb should be "rook-ceph-osd"


Additional info:

Some times we see 2 PDBs created for OSD.

> rook ceph operator log

2024-07-09 17:59:41.547696 I | clusterdisruption-controller: osd is down in failure domain "data-2". pg health: "cluster has no PGs"
2024-07-09 17:59:41.547749 I | clusterdisruption-controller: deleting the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd
.
.
.
2024-07-09 17:59:48.866945 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-09 17:59:49.241674 I | clusterdisruption-controller: osd is down in failure domain "data-1". pg health: "cluster has no PGs"
2024-07-09 17:59:49.241775 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-zone-data-2" with maxUnavailable=0 for "zone" failure domain "data-2"


> All OSD's are up
$ oc get pods | grep -i osd | egrep -v "Running|Completed"
$ 

> 
sh-5.1$ ceph -s
  cluster:
    id:     230697f0-ea45-4ceb-9d8a-ee9341ec19c5
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum a,b,c,d,e (age 8h)
    mgr: a(active, since 8h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 12 osds: 12 up (since 8h), 12 in (since 8h); 3 remapped pgs
    rgw: 2 daemons active (2 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 61 pgs
    objects: 2.34k objects, 6.9 GiB
    usage:   28 GiB used, 6.0 TiB / 6 TiB avail
    pgs:     3846/9356 objects misplaced (41.107%)
             58 active+clean
             3  active+clean+remapped
 
  io:
    client:   1023 B/s rd, 263 KiB/s wr, 1 op/s rd, 2 op/s wr
 
sh-5.1$ 


> 
sh-5.1$ ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME               STATUS  REWEIGHT  PRI-AFF
 -1         6.00000  root default                                     
-10         3.00000      zone data-1                                  
-13         1.00000          host compute-0                           
  3    ssd  0.50000              osd.3           up   1.00000  1.00000
  9    ssd  0.50000              osd.9           up   1.00000  1.00000
 -9         1.00000          host compute-2                           
  2    ssd  0.50000              osd.2           up   1.00000  1.00000
  7    ssd  0.50000              osd.7           up   1.00000  1.00000
-17         1.00000          host compute-4                           
 10    ssd  0.50000              osd.10          up   1.00000  1.00000
 11    ssd  0.50000              osd.11          up   1.00000  1.00000
 -4         3.00000      zone data-2                                  
 -3         1.00000          host compute-1                           
  0    ssd  0.50000              osd.0           up   1.00000  1.00000
  6    ssd  0.50000              osd.6           up   1.00000  1.00000
 -7         1.00000          host compute-3                           
  1    ssd  0.50000              osd.1           up   1.00000  1.00000
  8    ssd  0.50000              osd.8           up   1.00000  1.00000
-15         1.00000          host compute-5                           
  4    ssd  0.50000              osd.4           up   1.00000  1.00000
  5    ssd  0.50000              osd.5           up   1.00000  1.00000
sh-5.1$ 

> 

job: https://url.corp.redhat.com/e4d0617
must gather: https://url.corp.redhat.com/7c182c1

> I have tested the same in older build ( 4.16.0-120 ) and its working fine and no issue seen in 4.16.0-120

Comment 15 errata-xmlrpc 2024-07-17 13:25:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591