Bug 2309444

Summary:	[RDR] When cluster was upgraded from 4.16 to 4.17 osd migrated to new ceph osd id's
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Pratik Surve <prsurve>
Component:	ceph	Assignee:	Guillaume Abrioux <gabrioux>
ceph sub component:	RADOS	QA Contact:	Elad <ebenahar>
Status:	POST ---	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	aramteke, bniver, gabrioux, muagarwa, nojha, odf-bz-bot, sostapov, tnielsen
Version:	4.18	Keywords:	TestBlocker
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2309719 (view as bug list)		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2309719

Description Pratik Surve 2024-09-03 13:45:24 UTC

Description of problem (please be detailed as possible and provide log
snippests):

[RDR] When cluster was upgraded from 4.16 to 4.17, osd migrated to new ceph osd id's 


Version of all relevant components (if applicable):

OCP version:- 4.17.0-0.nightly-2024-09-02-044025
ODF version:- 4.17.0-90
CEPH version:- ceph version 19.1.0-42.el9cp (03ae7f7ffec5e7796d2808064c4766b35c4b5ffb) squid (rc)
ACM version:- 2.11.2
SUBMARINER version:- v0.18.0
VOLSYNC version:- volsync-product.v0.10.0
OADP version:- 1.4.0
VOLSYNC method:- destinationCopyMethod: Direct
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue be reproducible?

yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy RDR 4.16 
2.Upgrade RDR cluster to 4.17
3. check ceph status
4. Migration for osd will be started automatically 


Actual results:

rook-ceph-osd-3-ff888c5f5-wqjfx                                   2/2     Running     0             89m
rook-ceph-osd-4-8c7886d68-n9mbj                                   2/2     Running     0             64m
rook-ceph-osd-5-696975c56-sl4mj                                   2/2     Running     0             35m


Expected results:
We should not change osd id's
openshift-storage                                  rook-ceph-osd-1-7f59c986c7-hv55p                                  2/2     Running     0               6h42m
openshift-storage                                  rook-ceph-osd-2-866d4787bf-trnqp                                  2/2     Running     0               6h48m
openshift-storage                                  rook-ceph-osd-3-ff888c5f5-wqjfx                                   2/2     Running     0               11m

Additional info:

cephstatus
  cluster:
    id:     d378191d-7fe9-469c-9385-7cb128679367
    health: HEALTH_WARN
            1 osds down
            Degraded data redundancy: 79462/564801 objects degraded (14.069%), 73 pgs degraded, 80 pgs undersized

  services:
    mon:        3 daemons, quorum d,e,f (age 92m)
    mgr:        a(active, since 9m), standbys: b
    mds:        1/1 daemons up, 1 hot standby
    osd:        6 osds: 3 up (since 36m), 4 in (since 37m)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 188.27k objects, 67 GiB
    usage:   163 GiB used, 5.8 TiB / 6 TiB avail
    pgs:     79462/564801 objects degraded (14.069%)
             89 active+clean
             73 active+undersized+degraded
             7  active+undersized

  io:
    client:   20 MiB/s rd, 6.1 MiB/s wr, 1.73k op/s rd, 28 op/s wr