Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1923729

Summary:	Performing Ceph Filestore to bluestore migration too quickly caused data loss
Product:	Red Hat OpenStack	Reporter:	Yogev Rabl <yrabl>
Component:	openstack-tripleo-heat-templates	Assignee:	Giulio Fidente <gfidente>
Status:	CLOSED ERRATA	QA Contact:	Yogev Rabl <yrabl>
Severity:	high	Docs Contact:
Priority:	high
Version:	16.1 (Train)	CC:	fpantano, gfidente, johfulto, mburns, slinaber, yrabl
Target Milestone:	z6	Keywords:	Triaged
Target Release:	16.1 (Train on RHEL 8.2)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-tripleo-validations-11.3.2-1.20210330153514.4db92ba.el8ost openstack-tripleo-heat-templates-11.3.2-1.20210330183517.29a02c1.el8ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-26 13:50:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Yogev Rabl 2021-02-01 17:58:46 UTC

Description of problem:
After running FFWD process upgrading the undercloud and overcloud nodes, the Ceph cluster ran on Ceph 4.2 with Filestore as the OSDs object store.
Following the instructions in RH documentation[1] the Ceph storage nodes were migrated one after the other, not waiting for the migrating node to finish rebalancing the data.

Due to the fast pace of the migration the cluster lost some placement groups:

podman exec ceph-mon-controller-0 ceph -s                                                                                                                       controller-0: Mon Feb  1 17:51:43 2021

  cluster:
    id:     909a5b90-61db-11eb-8509-525400a1e12d
    health: HEALTH_ERR
            134/3798 objects unfound (3.528%)
            Reduced data availability: 2 pgs inactive, 2 pgs incomplete
            Possible data damage: 2 pgs recovery_unfound
            Degraded data redundancy: 402/11394 objects degraded (3.528%), 2 pgs degraded, 2 pgs undersized
            219 slow ops, oldest one blocked for 7386 sec, daemons [osd.2,osd.4] have slow ops.

  services:
    mon: 3 daemons, quorum controller-1,controller-2,controller-0 (age 3d)
    mgr: controller-0(active, since 3d), standbys: controller-1, controller-2
    osd: 15 osds: 15 up (since 2h), 15 in (since 2h); 2 remapped pgs

  data:
    pools:   5 pools, 160 pgs
    objects: 3.80k objects, 3.2 GiB
    usage:   26 GiB used, 154 GiB / 180 GiB avail
    pgs:     1.250% pgs not active
             402/11394 objects degraded (3.528%)
             134/3798 objects unfound (3.528%)
             156 active+clean
             2   incomplete
             2   active+recovery_unfound+undersized+degraded+remapped

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/framework_for_upgrades_13_to_16.1/osd-migration-from-filestore-to-bluestore  

Version-Release number of selected component (if applicable):
ceph-ansible-4.0.41-1.el8cp.noarch

How reproducible:
unknown

Steps to Reproduce:
1. Run a FFWD process on the overcloud 
2. Run migration of the Ceph OSD from filestore to bluestore one node immediately after the other 


Actual results:
Placement groups are missing while all of the OSDs are rebalancing, meaning, some data loss

Expected results:
All of the placement groups have at least 1 copy available during the process.

Additional info:

Comment 2 Giulio Fidente 2021-02-01 18:14:43 UTC

> not waiting for the migrating node to
> finish rebalancing the data.

this is the cause; we'll add a check in the templates to block operations unless cluster is healthy

Comment 11 Yogev Rabl 2021-04-26 17:58:24 UTC

verified

Comment 17 errata-xmlrpc 2021-05-26 13:50:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.6 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2097