[Issue]
Customer is doing FFU upgrade from OSP 13 to 16. They have just 3 OSD nodes.
In the process of FS to BS migration, the playbook failed on the first osd node after the disks were zapped.
Now while rerunning the external-upgrade command, ansible fails as the PGs are active+undersized.
#openstack overcloud external-upgrade run --tags ceph_fstobs -e ceph_ansible_limit=<NODE_NAME> | tee oc-fstobs.log
where node_name=150001o4030
[Analysis]
+ There is a validation of Ceph status health in the playbook
~~~
cat ./external_upgrade_steps_tasks.yaml
8< /////
name: ensure ceph health is OK before proceeding
tags:
- ceph_health
vars:
fail_on_ceph_health_err: true
fail_on_ceph_health_warn: true
///// >8
+ Since there are only 3 OSD nodes. PGs are active+undersized state as the cluster cannot find a place holder for the 3rd replica as there is only 2 out of 3 nodes up now
+ Customer mentions an old case 02984413 where they say something similar happened.
But rerunning the command does not work anymore in Openstack 16.1.7.
[Action items]
+ Normally, in a standalone ceph cluster, we could have just rerun the site.yml playbook to add the OSDs back.
But since this is in midst of FFU upgrade, I am not sure about the repercussions it could have.
+ Is it possible to rerun stack deployment at this stage?
+ Else this would probably need a BZ for the external-upgrade command/playbook
[Issue] Customer is doing FFU upgrade from OSP 13 to 16. They have just 3 OSD nodes. In the process of FS to BS migration, the playbook failed on the first osd node after the disks were zapped. Now while rerunning the external-upgrade command, ansible fails as the PGs are active+undersized. #openstack overcloud external-upgrade run --tags ceph_fstobs -e ceph_ansible_limit=<NODE_NAME> | tee oc-fstobs.log where node_name=150001o4030 [Analysis] + There is a validation of Ceph status health in the playbook ~~~ cat ./external_upgrade_steps_tasks.yaml 8< ///// name: ensure ceph health is OK before proceeding tags: - ceph_health vars: fail_on_ceph_health_err: true fail_on_ceph_health_warn: true ///// >8 + Since there are only 3 OSD nodes. PGs are active+undersized state as the cluster cannot find a place holder for the 3rd replica as there is only 2 out of 3 nodes up now + Customer mentions an old case 02984413 where they say something similar happened. But rerunning the command does not work anymore in Openstack 16.1.7. [Action items] + Normally, in a standalone ceph cluster, we could have just rerun the site.yml playbook to add the OSDs back. But since this is in midst of FFU upgrade, I am not sure about the repercussions it could have. + Is it possible to rerun stack deployment at this stage? + Else this would probably need a BZ for the external-upgrade command/playbook