2296991 – PG state is not active+clean in arbiter deployment

Bug 2296991 - PG state is not active+clean in arbiter deployment

Summary: PG state is not active+clean in arbiter deployment

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.16
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	Travis Nielsen
QA Contact:	Vijay Avuthu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-07-10 02:29 UTC by Vijay Avuthu
Modified:	2024-07-17 13:25 UTC (History)
CC List:	8 users (show)
Fixed In Version:	4.16.0-136
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-07-17 13:25:33 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 677	None	open	Bug 2296991: pool: Skip updating crush rules for stretch clusters	2024-07-10 17:53:46 UTC
Github	rook rook pull 14447	None	open	pool: Skip updating crush rules for stretch clusters	2024-07-10 17:12:51 UTC
Red Hat Product Errata	RHSA-2024:4591	None	None	None	2024-07-17 13:25:37 UTC

Description Vijay Avuthu 2024-07-10 02:29:53 UTC

Description of problem (please be detailed as possible and provide log
snippests):

[vSphere]: On a fresh Arbiter deployment (3M + 6W), PG state is not active+clean which result in blocking of creating default OSD PDB

Version of all relevant components (if applicable):

ocs-registry:4.16.0-135

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
3/3

Can this issue reproduce from the UI?
Not tried

If this is a regression, please provide more details to justify this:
Yes

Steps to Reproduce:
1. install Arbiter deployment 
2. check all PDB's are created
3.


Actual results:

$ oc get pdb
NAME                                               MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem    1               N/A               1                     7h54m
rook-ceph-mgr-pdb                                  N/A             1                 1                     7h52m
rook-ceph-mon-pdb                                  N/A             2                 2                     7h52m
rook-ceph-osd-zone-data-2                          N/A             0                 0                     7h51m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore   1               N/A               1                     7h54m



Expected results:

Default OSD pdb should be "rook-ceph-osd"


Additional info:

Some times we see 2 PDBs created for OSD.

> rook ceph operator log

2024-07-09 17:59:41.547696 I | clusterdisruption-controller: osd is down in failure domain "data-2". pg health: "cluster has no PGs"
2024-07-09 17:59:41.547749 I | clusterdisruption-controller: deleting the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd
.
.
.
2024-07-09 17:59:48.866945 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-09 17:59:49.241674 I | clusterdisruption-controller: osd is down in failure domain "data-1". pg health: "cluster has no PGs"
2024-07-09 17:59:49.241775 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-zone-data-2" with maxUnavailable=0 for "zone" failure domain "data-2"


> All OSD's are up
$ oc get pods | grep -i osd | egrep -v "Running|Completed"
$ 

> 
sh-5.1$ ceph -s
  cluster:
    id:     230697f0-ea45-4ceb-9d8a-ee9341ec19c5
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum a,b,c,d,e (age 8h)
    mgr: a(active, since 8h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 12 osds: 12 up (since 8h), 12 in (since 8h); 3 remapped pgs
    rgw: 2 daemons active (2 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 61 pgs
    objects: 2.34k objects, 6.9 GiB
    usage:   28 GiB used, 6.0 TiB / 6 TiB avail
    pgs:     3846/9356 objects misplaced (41.107%)
             58 active+clean
             3  active+clean+remapped
 
  io:
    client:   1023 B/s rd, 263 KiB/s wr, 1 op/s rd, 2 op/s wr
 
sh-5.1$ 


> 
sh-5.1$ ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME               STATUS  REWEIGHT  PRI-AFF
 -1         6.00000  root default                                     
-10         3.00000      zone data-1                                  
-13         1.00000          host compute-0                           
  3    ssd  0.50000              osd.3           up   1.00000  1.00000
  9    ssd  0.50000              osd.9           up   1.00000  1.00000
 -9         1.00000          host compute-2                           
  2    ssd  0.50000              osd.2           up   1.00000  1.00000
  7    ssd  0.50000              osd.7           up   1.00000  1.00000
-17         1.00000          host compute-4                           
 10    ssd  0.50000              osd.10          up   1.00000  1.00000
 11    ssd  0.50000              osd.11          up   1.00000  1.00000
 -4         3.00000      zone data-2                                  
 -3         1.00000          host compute-1                           
  0    ssd  0.50000              osd.0           up   1.00000  1.00000
  6    ssd  0.50000              osd.6           up   1.00000  1.00000
 -7         1.00000          host compute-3                           
  1    ssd  0.50000              osd.1           up   1.00000  1.00000
  8    ssd  0.50000              osd.8           up   1.00000  1.00000
-15         1.00000          host compute-5                           
  4    ssd  0.50000              osd.4           up   1.00000  1.00000
  5    ssd  0.50000              osd.5           up   1.00000  1.00000
sh-5.1$ 

> 

job: https://url.corp.redhat.com/e4d0617
must gather: https://url.corp.redhat.com/7c182c1

> I have tested the same in older build ( 4.16.0-120 ) and its working fine and no issue seen in 4.16.0-120

Comment 15 errata-xmlrpc 2024-07-17 13:25:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Note You need to log in before you can comment on or make changes to this bug.