[Documentation] [Update/Upgrade] [OSP8/9/10] : what would be the safe way to run reboot cycle of the ceph nodes - post update/upgrade From a client call , we got a question regarding what would be the safe way to run reboot cycle of the ceph nodes - post update/upgrade The Ceph engineers recommended to set the following flags and run this order: (1) ceph osd set noout (2) ceph osd set norebalance (3) reboot ceph nodes one by one (4) wait between the reboot of nodes till pgs are back to normal It was not documented in OSP8 : https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/single/upgrading-red-hat-openstack-platform/
also remember to re-enable those after all nodes are back aht pgs are back to normal with: ceph osd unset noout ceph osd unset norebalance
we should also wait for the pgs to go back to normal (all active+clean) after each storage node is back up, not only at the end; so the list could be changed in: (1) ceph osd set noout (2) ceph osd set norebalance (3) reboot one ceph-storage node (4) after reboot monitor ceph cluster status to ensure pgs are back to normal ... repeat steps 3 and 4 with the next node (5) ceph osd unset noout (6) ceph osd unset norebalance
Implemented this content in the following guides: Director Guide: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/director_installation_and_usage/sect-rebooting_the_overcloud#sect-Rebooting-Ceph Upgrade Guide: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment#sect-Major-Upgrading_the_Overcloud-Ceph Ceph Storage Guide: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/red_hat_ceph_storage_for_the_overcloud/creation#rebooting_the_environment @Omri and Giulio -- Did you guys have any suggestions for improvements for this content?
hi Dan, the instructions look good to me, thanks! I want to leave a comment regarding step 1) of the reboot process, where it says: Select the first Ceph Storage node to reboot and log into it. while the above *will* work fine, for better compatibility with the future releases we might prefer tell people to log on one of the *controller* nodes to give the "ceph osd set ..." and "ceph osd unset ..." commnds, instead of the first storage node. In the future the storage nodes might not have the necessary permissions to run that command which affects the entire cluster; controllers (or better, the nodes running ceph-mon) will always have. Not sure if there is time/resources to change that?
Hi Giulio, Have implemented the suggestion in comment #4: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/director_installation_and_usage/#sect-Rebooting-Ceph How does it look now?
Perfect, thanks for the update!