Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2063316

Summary: Failure to continue ceph external upgrade after disk zapping failure
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: David Hill <dhill>
Component: Ceph-AnsibleAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Ameena Suhani S H <amsyedha>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.2CC: aschoen, ceph-eng-bugs, ceph-qe-bugs, dhill, fpantano, gabrioux, gfidente, gmeno, mburns, nthomas, ramishra, tonay, ykaul
Target Milestone: ---   
Target Release: 4.3z1   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-03 14:57:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Hill 2022-03-11 18:52:30 UTC
[Issue]
Customer is doing FFU upgrade from OSP 13 to 16.  They have just 3 OSD nodes. 
In the process of FS to BS migration, the playbook failed on the first osd node after the disks were zapped. 
Now while rerunning the external-upgrade command, ansible fails as the PGs are active+undersized. 

#openstack overcloud external-upgrade run --tags ceph_fstobs -e ceph_ansible_limit=<NODE_NAME> | tee oc-fstobs.log
where node_name=150001o4030


[Analysis]

+ There is a validation of Ceph status health in the playbook
~~~
 
cat ./external_upgrade_steps_tasks.yaml 

8< /////
name: ensure ceph health is OK before proceeding
    tags:
    - ceph_health
    vars:
      fail_on_ceph_health_err: true
      fail_on_ceph_health_warn: true
/////  >8

+ Since there are only 3 OSD nodes. PGs are active+undersized state as the cluster cannot find a place holder for the 3rd replica as there is only 2 out of 3 nodes up now

+ Customer mentions an old case 02984413 where they say something similar happened. 
But rerunning the command does not work anymore in Openstack 16.1.7.


[Action items]
+ Normally, in a standalone ceph cluster, we could have just rerun the site.yml playbook to add the OSDs back.
  But since this is in midst of FFU upgrade, I am not sure about the repercussions it could have.
+ Is it possible to rerun stack deployment at this stage?
 
+ Else this would probably need a BZ for the external-upgrade command/playbook