Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2063316

Summary:	Failure to continue ceph external upgrade after disk zapping failure
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	David Hill <dhill>
Component:	Ceph-Ansible	Assignee:	Guillaume Abrioux <gabrioux>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Ameena Suhani S H <amsyedha>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.2	CC:	aschoen, ceph-eng-bugs, ceph-qe-bugs, dhill, fpantano, gabrioux, gfidente, gmeno, mburns, nthomas, ramishra, tonay, ykaul
Target Milestone:	---
Target Release:	4.3z1
Hardware:	x86_64
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-05-03 14:57:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Hill 2022-03-11 18:52:30 UTC

[Issue]
Customer is doing FFU upgrade from OSP 13 to 16.  They have just 3 OSD nodes. 
In the process of FS to BS migration, the playbook failed on the first osd node after the disks were zapped. 
Now while rerunning the external-upgrade command, ansible fails as the PGs are active+undersized. 

#openstack overcloud external-upgrade run --tags ceph_fstobs -e ceph_ansible_limit=<NODE_NAME> | tee oc-fstobs.log
where node_name=150001o4030


[Analysis]

+ There is a validation of Ceph status health in the playbook
~~~
 
cat ./external_upgrade_steps_tasks.yaml 

8< /////
name: ensure ceph health is OK before proceeding
    tags:
    - ceph_health
    vars:
      fail_on_ceph_health_err: true
      fail_on_ceph_health_warn: true
/////  >8

+ Since there are only 3 OSD nodes. PGs are active+undersized state as the cluster cannot find a place holder for the 3rd replica as there is only 2 out of 3 nodes up now

+ Customer mentions an old case 02984413 where they say something similar happened. 
But rerunning the command does not work anymore in Openstack 16.1.7.


[Action items]
+ Normally, in a standalone ceph cluster, we could have just rerun the site.yml playbook to add the OSDs back.
  But since this is in midst of FFU upgrade, I am not sure about the repercussions it could have.
+ Is it possible to rerun stack deployment at this stage?
 
+ Else this would probably need a BZ for the external-upgrade command/playbook