Description of problem: When performing overcloud update of ceph storage nodes, the containers are stopped and OSDs go offline which causes the cluster to go into degraded state. The cluster then has to rebalance. Even though updates are done in serial, there is potentially a risk here that the cluster might still be in degraded state by the time the first, second and third storage nodes are being updated. This might cause data loss or cause the ceph to stop serving until it meets min size. Perhaps the update process should first ensure the ceph cluster is in a healthy state before proceeding with the each node update. If not, wait for some time and check again. This way we can mitigate the risk of data loss. Version-Release number of selected component (if applicable): RHOSP 13.11 How reproducible: Always Steps to Reproduce: 1. Prepare update 'openstack overcloud update prepare' 2. Update first ceph storage node 'openstack overcloud update run --nodes ceph-storage-0' 3. Watch cluster with 'ceph -s' Actual results: Node instantly proceeds with update and cluster goes into degraded state. Expected results: Update should check that the cluster is healthy before proceeding. Additional info:
FYI, did something like this to make sure it was healthy before moving on. source ~/stackrc for node in $(openstack server list -f value -c Name |grep ceph-storage |sort -V); do while [[ ! "$(ssh -q controller-0 'sudo ceph -s |grep health:')" =~ "HEALTH_OK" ]] ; do echo 'cluster not healthy, sleeping before updating ${node}' sleep 5 done echo 'cluster healthy, updating ${node}' openstack overcloud update run --nodes "${node}" || { echo 'failed to update ${node}, exiting'; exit 1 ;} echo 'updated ${node} successfully' done
Even when doing a redeploy of RHOSP over the top (no update), it's restarting all OSD containers and taking each OSD out, which is causing backfilling and recovering. Else with container restart for every single OSD in the cluster it's having to shuffle data around until all pgs are active+clean again, which is making a simple redeploy take several hours longer than it should.... I might try with noout, norecover, norebalance and nobackfill set to stop this from happening while the deploy is being run. As containers are restarted quickly I'm hoping this won't be a problem, but I'm not sure what ceph-ansible will be looking for (hopefully just active+clean pgs, not HEALTH_OK as setting those flags will put cluster in HEALTH_WARN).
Setting noout, norecover, norebalance and nobackfill flags before a deploy resulted in expected behaviour. I'm not quite sure why with a redeploy with no ceph config changes is resulting in taking down each OSD but it doesn't seem right...
Please see my replies in-line below. (In reply to Chris Smart from comment #0) > Description of problem: > > When performing overcloud update of ceph storage nodes, the containers are > stopped and OSDs go offline which causes the cluster to go into degraded > state. The cluster then has to rebalance. So you're following "4.6. Updating all Ceph Storage nodes" from "Keeping Red Hat OpenStack Platform Updated": https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/keeping_red_hat_openstack_platform_updated/index#updating_all_ceph_storage_nodes When you run `openstack overcloud ceph-upgrade run ...` it triggers the ceph-ansible playbook rolling_update.yml. I don't know what version of ceph-ansible you're using but I expect it's the latest since previous section of the doc has you register the undercloud to the rhceph-3-tools-rpms repo and do a yum upgrade. The lastest we ship at this time is 3.2.38: https://access.redhat.com/downloads/content/ceph-ansible/3.2.38-1.el7cp/noarch/fd431d51/package so I'll refer to portions of the code from that version. https://github.com/ceph/ceph-ansible/blob/v3.2.38/infrastructure-playbooks/rolling_update.yml > Even though updates are done in serial, there is potentially a risk here > that the cluster might still be in degraded state by the time the first, > second and third storage nodes are being updated. This might cause data loss > or cause the ceph to stop serving until it meets min size. > > Perhaps the update process should first ensure the ceph cluster is in a > healthy state before proceeding with the each node update. If not, wait for > some time and check again. This way we can mitigate the risk of data loss. > > Actual results: > Node instantly proceeds with update and cluster goes into degraded state. > > Expected results: > Update should check that the cluster is healthy before proceeding. The playbook already waits for clean PGs: https://github.com/ceph/ceph-ansible/blob/v3.2.38/infrastructure-playbooks/rolling_update.yml#L405 It doesn't proceed to the next node until the PGs are clean. The playbook must stop OSDs in order to upgrade them. So any OSD running a certain container version must be taken offline and then restarted running a newer contianer version. This is done for each OSD provided that the PGs are clean as per the loop above. If you take an OSD offline the system will enter into a degraded state however Ceph is designed to be able to handle this. (In reply to Chris Smart from comment #1) > FYI, did something like this to make sure it was healthy before moving on. > > source ~/stackrc > for node in $(openstack server list -f value -c Name |grep ceph-storage > |sort -V); do > while [[ ! "$(ssh -q controller-0 'sudo ceph -s |grep health:')" =~ > "HEALTH_OK" ]] ; do > echo 'cluster not healthy, sleeping before updating ${node}' > sleep 5 > done > echo 'cluster healthy, updating ${node}' > openstack overcloud update run --nodes "${node}" || { echo 'failed to > update ${node}, exiting'; exit 1 ;} > echo 'updated ${node} successfully' > done The above is looking for HEALTH_OK while ceph-ansible looks for active+clean: https://github.com/ceph/ceph-ansible/blob/v3.2.38/infrastructure-playbooks/rolling_update.yml#L411 Getting to active+clean should be sufficient. It's too easy to get into HEALTH_WARNING (e.g. don't tag a pool even if you're not relying on pool tags) while active+clean is more specific. (In reply to Chris Smart from comment #2) > Even when doing a redeploy of RHOSP over the top (no update), it's > restarting all OSD containers and taking each OSD out, which is causing > backfilling and recovering. We're now talking about `openstack overcloud deploy ...` and not `openstack overcloud ceph-upgrade run ...`. Thus, a different playbook is triggered. https://github.com/ceph/ceph-ansible/blob/v3.2.38/site-docker.yml.sample > Else with container restart for every single OSD in the cluster it's having > to shuffle data around until all pgs are active+clean again, which is making > a simple redeploy take several hours longer than it should.... If you are confident that no change is required for Ceph during the stack update, then you can have the stack update skip changes to Ceph. How to do that is described in this article: https://access.redhat.com/solutions/4939291 Except you may apply a variation to the above where you noop not only the ceph clients but the other ceph services. resource_registry: OS::TripleO::Services::CephClient: OS::Heat::None OS::TripleO::Services::CephMds: OS::Heat::None OS::TripleO::Services::CephMgr: OS::Heat::None OS::TripleO::Services::CephMon: OS::Heat::None OS::TripleO::Services::CephRbdMirror: OS::Heat::None OS::TripleO::Services::CephRgw: OS::Heat::None OS::TripleO::Services::CephOSD: OS::Heat::None Please read https://access.redhat.com/solutions/4939291 carefully before dropping the above heat changes in though to understand you don't always want that to have the above overrides, only in certain cases. > I might try with noout, norecover, norebalance and nobackfill set to stop > this from happening while the deploy is being run. As containers are > restarted quickly I'm hoping this won't be a problem, but I'm not sure what > ceph-ansible will be looking for (hopefully just active+clean pgs, not > HEALTH_OK as setting those flags will put cluster in HEALTH_WARN). (In reply to Chris Smart from comment #3) > Setting noout, norecover, norebalance and nobackfill flags before a deploy > resulted in expected behaviour. ceph-ansible sets some of those flags for you, e.g. rolling_update playbook sets noout+norebalance: https://github.com/ceph/ceph-ansible/blob/v3.2.38/infrastructure-playbooks/rolling_update.yml#L205-L212 takes the OSD offline to upgrade it and then waits for the OSD to be clean: https://github.com/ceph/ceph-ansible/blob/v3.2.38/infrastructure-playbooks/rolling_update.yml#L411 and then unsets noout+norebalance: https://github.com/ceph/ceph-ansible/blob/v3.2.38/infrastructure-playbooks/rolling_update.yml#L437-L442 Setting these vlaues does not prevent the OSD from getting into the clean state (the playbook does it itself). A future update will replace "norebalance" with "nodeep-scrub" because of this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1740463 Because the playbooks are tested and revised to do these types of things for you, I don't know that you need to set this flag yourself. > I'm not quite sure why with a redeploy with no ceph config changes is > resulting in taking down each OSD but it doesn't seem right... Running a stack update (e.g. openstack overcloud deploy ....) reasserts the configuration. If you make an update to the configuration definition in the Heat environment files, e.g. change the Nova cpu_allocation_ratio, and then run a stack update, the configuration is reasserted and this includes any changes in the updated configuration definition. This includes re-running ceph-ansible playbooks which ensure the system configuration is as it is defined. In order to be certain that a configuration change in an OSD has been applied, the OSD must be restarted. Thus, ceph-ansible has handlers to do this. A properly redundant Ceph cluster is designed to be able to loose a subset of OSDs and continue servicing requests. If you are certain that you do not wish to reassert Ceph configuration during stack update, then you may update your configuration to "noop" the Ceph services managed by director described in the variation to https://access.redhat.com/solutions/4939291 I described above.
Rather than follow a variation of https://access.redhat.com/solutions/4939291 I have documented what's proposed here in a separate article https://access.redhat.com/solutions/4963041
I see this bug report is connected to a support case. If there's further questions or concerns about Red Hat OpenStack director usage with Ceph, then let's please manage them through the support case as what I've read in this bug report indicates that the software is working as designed.