Description of problem: Major upgrade from OSP 11 to OSP 12 fails while checking OSD health Version-Release number of selected component (if applicable): RH OSP 12 How reproducible: Always Steps to Reproduce: 1. Major upgrade with ceph (5 nodes, 4 OSDs/node) 2. Deployment fails while "waiting for clean pgs" Actual results: Deployment fails Expected results: Deployment should succeed Additional info: looks like the issue is at ceph-ansible/infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml at task "container - waiting for clean pgs..." After increasing health_osd_check_delay and health_osd_check_retries from 15sec and 5 to 40 sec and 30 respectively, deployment succeeded. The values are referred from ceph-ansible/infrastructure-playbooks/rolling_update.yml
This does not look like a bug to me. Depending on the infra the timings should be adjusted. We might need to include this in the documentation.
Aren't those values be configured via some variables rather than changing the default ansible scripts?
You can run ansible with an extra var like -e health_mon_check_retries=200 and this will work without editing the playbook file.
changing component from ceph-ansible to tripleo-common for the same changes to be done in mistral workbook if it should be set there.
Please update chapter 4.6 and 4.7 of the overcloud upgrade document [1] to include the following note. """ During the migration of Ceph to containers, each Ceph monitor and OSD is brought down sequentially and then the migration does not continue until the same service service that was brought down, is successfully brought back up. Ansible will wait 15 seconds (the delay) and recheck 5 times (the retries) for the service to come back and if the service does not come back the migration will stop so that the operator may intervene. Depending on the size of your Ceph cluster, the retry or delay values may need to be increased. The exact names of these parameters and their defaults are as follows: health_mon_check_retries: 5 health_mon_check_delay: 15 health_osd_check_retries: 5 health_osd_check_delay: 15 For example, to have the cluster recheck 30 times and wait 40 seconds between each check, pass the following parameters in a yaml file with a -e to the 'openstack overcloud deploy' command. parameter_defaults: CephAnsibleExtraConfig: health_osd_check_delay: 40 health_osd_check_retries: 30 """ [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/upgrading_red_hat_openstack_platform/assembly-preparing_for_overcloud_upgrade#preparing_for_ceph_storage_node_upgrades
Doc bug, Nothing for QE to test/automate with regards to close loop process.