Bug 1923729
| Summary: | Performing Ceph Filestore to bluestore migration too quickly caused data loss | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Yogev Rabl <yrabl> |
| Component: | openstack-tripleo-heat-templates | Assignee: | Giulio Fidente <gfidente> |
| Status: | CLOSED ERRATA | QA Contact: | Yogev Rabl <yrabl> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 16.1 (Train) | CC: | fpantano, gfidente, johfulto, mburns, slinaber, yrabl |
| Target Milestone: | z6 | Keywords: | Triaged |
| Target Release: | 16.1 (Train on RHEL 8.2) | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-tripleo-validations-11.3.2-1.20210330153514.4db92ba.el8ost openstack-tripleo-heat-templates-11.3.2-1.20210330183517.29a02c1.el8ost | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-05-26 13:50:41 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
> not waiting for the migrating node to
> finish rebalancing the data.
this is the cause; we'll add a check in the templates to block operations unless cluster is healthy
verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.6 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2097 |
Description of problem: After running FFWD process upgrading the undercloud and overcloud nodes, the Ceph cluster ran on Ceph 4.2 with Filestore as the OSDs object store. Following the instructions in RH documentation[1] the Ceph storage nodes were migrated one after the other, not waiting for the migrating node to finish rebalancing the data. Due to the fast pace of the migration the cluster lost some placement groups: podman exec ceph-mon-controller-0 ceph -s controller-0: Mon Feb 1 17:51:43 2021 cluster: id: 909a5b90-61db-11eb-8509-525400a1e12d health: HEALTH_ERR 134/3798 objects unfound (3.528%) Reduced data availability: 2 pgs inactive, 2 pgs incomplete Possible data damage: 2 pgs recovery_unfound Degraded data redundancy: 402/11394 objects degraded (3.528%), 2 pgs degraded, 2 pgs undersized 219 slow ops, oldest one blocked for 7386 sec, daemons [osd.2,osd.4] have slow ops. services: mon: 3 daemons, quorum controller-1,controller-2,controller-0 (age 3d) mgr: controller-0(active, since 3d), standbys: controller-1, controller-2 osd: 15 osds: 15 up (since 2h), 15 in (since 2h); 2 remapped pgs data: pools: 5 pools, 160 pgs objects: 3.80k objects, 3.2 GiB usage: 26 GiB used, 154 GiB / 180 GiB avail pgs: 1.250% pgs not active 402/11394 objects degraded (3.528%) 134/3798 objects unfound (3.528%) 156 active+clean 2 incomplete 2 active+recovery_unfound+undersized+degraded+remapped [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/framework_for_upgrades_13_to_16.1/osd-migration-from-filestore-to-bluestore Version-Release number of selected component (if applicable): ceph-ansible-4.0.41-1.el8cp.noarch How reproducible: unknown Steps to Reproduce: 1. Run a FFWD process on the overcloud 2. Run migration of the Ceph OSD from filestore to bluestore one node immediately after the other Actual results: Placement groups are missing while all of the OSDs are rebalancing, meaning, some data loss Expected results: All of the placement groups have at least 1 copy available during the process. Additional info: