1816482 – Ceph cluster degraded when updating Storage nodes

Bug 1816482 - Ceph cluster degraded when updating Storage nodes

Summary: Ceph cluster degraded when updating Storage nodes

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-common
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	John Fulton
QA Contact:	Yogev Rabl
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-24 05:20 UTC by Chris Smart
Modified:	2023-09-07 22:34 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-06 12:38:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-28386	0	None	None	None	2023-09-07 22:34:14 UTC
Red Hat Knowledge Base (Solution)	4963041	0	None	None	None	2020-04-03 22:51:50 UTC

Description Chris Smart 2020-03-24 05:20:06 UTC

Description of problem:

When performing overcloud update of ceph storage nodes, the containers are stopped and OSDs go offline which causes the cluster to go into degraded state. The cluster then has to rebalance.

Even though updates are done in serial, there is potentially a risk here that the cluster might still be in degraded state by the time the first, second and third storage nodes are being updated. This might cause data loss or cause the ceph to stop serving until it meets min size.

Perhaps the update process should first ensure the ceph cluster is in a healthy state before proceeding with the each node update. If not, wait for some time and check again. This way we can mitigate the risk of data loss.

Version-Release number of selected component (if applicable):

RHOSP 13.11

How reproducible:
Always

Steps to Reproduce:
1. Prepare update 'openstack overcloud update prepare'
2. Update first ceph storage node 'openstack overcloud update run --nodes ceph-storage-0'
3. Watch cluster with 'ceph -s'

Actual results:
Node instantly proceeds with update and cluster goes into degraded state.

Expected results:
Update should check that the cluster is healthy before proceeding.

Additional info:

Comment 1 Chris Smart 2020-03-24 05:29:20 UTC

FYI, did something like this to make sure it was healthy before moving on.

source ~/stackrc
for node in $(openstack server list -f value -c Name |grep ceph-storage |sort -V); do
  while [[ ! "$(ssh -q controller-0 'sudo ceph -s |grep health:')" =~ "HEALTH_OK" ]] ; do
    echo 'cluster not healthy, sleeping before updating ${node}'
    sleep 5
  done
  echo 'cluster healthy, updating ${node}'
  openstack overcloud update run --nodes "${node}" || { echo 'failed to update ${node}, exiting'; exit 1 ;}
  echo 'updated ${node} successfully'
done

Comment 2 Chris Smart 2020-03-26 11:46:32 UTC

Even when doing a redeploy of RHOSP over the top (no update), it's restarting all OSD containers and taking each OSD out, which is causing backfilling and recovering.

Else with container restart for every single OSD in the cluster it's having to shuffle data around until all pgs are active+clean again, which is making a simple redeploy take several hours longer than it should....

I might try with noout, norecover, norebalance and nobackfill set to stop this from happening while the deploy is being run. As containers are restarted quickly I'm hoping this won't be a problem, but I'm not sure what ceph-ansible will be looking for (hopefully just active+clean pgs, not HEALTH_OK as setting those flags will put cluster in HEALTH_WARN).

Comment 3 Chris Smart 2020-03-26 23:48:36 UTC

Setting noout, norecover, norebalance and nobackfill flags before a deploy resulted in expected behaviour.

I'm not quite sure why with a redeploy with no ceph config changes is resulting in taking down each OSD but it doesn't seem right...

Comment 4 John Fulton 2020-04-03 19:36:16 UTC

Please see my replies in-line below.

(In reply to Chris Smart from comment #0)
> Description of problem:
> 
> When performing overcloud update of ceph storage nodes, the containers are
> stopped and OSDs go offline which causes the cluster to go into degraded
> state. The cluster then has to rebalance.

So you're following "4.6. Updating all Ceph Storage nodes" from
"Keeping Red Hat OpenStack Platform Updated":

 https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/keeping_red_hat_openstack_platform_updated/index#updating_all_ceph_storage_nodes

When you run `openstack overcloud ceph-upgrade run ...` it triggers
the ceph-ansible playbook rolling_update.yml.

I don't know what version of ceph-ansible you're using but I expect
it's the latest since previous section of the doc has you register
the undercloud to the rhceph-3-tools-rpms repo and do a yum upgrade.
The lastest we ship at this time is 3.2.38:

 https://access.redhat.com/downloads/content/ceph-ansible/3.2.38-1.el7cp/noarch/fd431d51/package

so I'll refer to portions of the code from that version.

 https://github.com/ceph/ceph-ansible/blob/v3.2.38/infrastructure-playbooks/rolling_update.yml

> Even though updates are done in serial, there is potentially a risk here
> that the cluster might still be in degraded state by the time the first,
> second and third storage nodes are being updated. This might cause data loss
> or cause the ceph to stop serving until it meets min size.
> 
> Perhaps the update process should first ensure the ceph cluster is in a
> healthy state before proceeding with the each node update. If not, wait for
> some time and check again. This way we can mitigate the risk of data loss.
>
> Actual results:
> Node instantly proceeds with update and cluster goes into degraded state.
>
> Expected results:
> Update should check that the cluster is healthy before proceeding.

The playbook already waits for clean PGs:

 https://github.com/ceph/ceph-ansible/blob/v3.2.38/infrastructure-playbooks/rolling_update.yml#L405

It doesn't proceed to the next node until the PGs are clean.

The playbook must stop OSDs in order to upgrade them. So any OSD
running a certain container version must be taken offline and then
restarted running a newer contianer version. This is done for each OSD
provided that the PGs are clean as per the loop above.

If you take an OSD offline the system will enter into a degraded state
however Ceph is designed to be able to handle this.

(In reply to Chris Smart from comment #1)
> FYI, did something like this to make sure it was healthy before moving on.
> 
> source ~/stackrc
> for node in $(openstack server list -f value -c Name |grep ceph-storage
> |sort -V); do
>   while [[ ! "$(ssh -q controller-0 'sudo ceph -s |grep health:')" =~
> "HEALTH_OK" ]] ; do
>     echo 'cluster not healthy, sleeping before updating ${node}'
>     sleep 5
>   done
>   echo 'cluster healthy, updating ${node}'
>   openstack overcloud update run --nodes "${node}" || { echo 'failed to
> update ${node}, exiting'; exit 1 ;}
>   echo 'updated ${node} successfully'
> done

The above is looking for HEALTH_OK while ceph-ansible looks for active+clean:

 https://github.com/ceph/ceph-ansible/blob/v3.2.38/infrastructure-playbooks/rolling_update.yml#L411

Getting to active+clean should be sufficient. It's too easy to get
into HEALTH_WARNING (e.g. don't tag a pool even if you're not relying
on pool tags) while active+clean is more specific.

(In reply to Chris Smart from comment #2)
> Even when doing a redeploy of RHOSP over the top (no update), it's
> restarting all OSD containers and taking each OSD out, which is causing
> backfilling and recovering.

We're now talking about `openstack overcloud deploy ...` and not 
`openstack overcloud ceph-upgrade run ...`. Thus, a different playbook
is triggered.

 https://github.com/ceph/ceph-ansible/blob/v3.2.38/site-docker.yml.sample

> Else with container restart for every single OSD in the cluster it's having
> to shuffle data around until all pgs are active+clean again, which is making
> a simple redeploy take several hours longer than it should....

If you are confident that no change is required for Ceph during the
stack update, then you can have the stack update skip changes to
Ceph. How to do that is described in this article:

 https://access.redhat.com/solutions/4939291

Except you may apply a variation to the above where you noop not only
the ceph clients but the other ceph services.

resource_registry:
  OS::TripleO::Services::CephClient: OS::Heat::None
  OS::TripleO::Services::CephMds: OS::Heat::None
  OS::TripleO::Services::CephMgr: OS::Heat::None
  OS::TripleO::Services::CephMon: OS::Heat::None
  OS::TripleO::Services::CephRbdMirror: OS::Heat::None
  OS::TripleO::Services::CephRgw: OS::Heat::None
  OS::TripleO::Services::CephOSD: OS::Heat::None

Please read https://access.redhat.com/solutions/4939291 carefully
before dropping the above heat changes in though to understand you
don't always want that to have the above overrides, only in certain
cases.

> I might try with noout, norecover, norebalance and nobackfill set to stop
> this from happening while the deploy is being run. As containers are
> restarted quickly I'm hoping this won't be a problem, but I'm not sure what
> ceph-ansible will be looking for (hopefully just active+clean pgs, not
> HEALTH_OK as setting those flags will put cluster in HEALTH_WARN).

(In reply to Chris Smart from comment #3)
> Setting noout, norecover, norebalance and nobackfill flags before a deploy
> resulted in expected behaviour.

ceph-ansible sets some of those flags for you, e.g. rolling_update playbook sets noout+norebalance:

 https://github.com/ceph/ceph-ansible/blob/v3.2.38/infrastructure-playbooks/rolling_update.yml#L205-L212

takes the OSD offline to upgrade it and then waits for the OSD to be clean:

 https://github.com/ceph/ceph-ansible/blob/v3.2.38/infrastructure-playbooks/rolling_update.yml#L411

and then unsets noout+norebalance:

 https://github.com/ceph/ceph-ansible/blob/v3.2.38/infrastructure-playbooks/rolling_update.yml#L437-L442

Setting these vlaues does not prevent the OSD from getting into the clean state (the playbook does it itself).

A future update will replace "norebalance" with "nodeep-scrub" because of this bug:

 https://bugzilla.redhat.com/show_bug.cgi?id=1740463

Because the playbooks are tested and revised to do these types of things for you, I don't know that you need to set this flag yourself.

> I'm not quite sure why with a redeploy with no ceph config changes is
> resulting in taking down each OSD but it doesn't seem right...

Running a stack update (e.g. openstack overcloud deploy ....)
reasserts the configuration. If you make an update to the
configuration definition in the Heat environment files, e.g. change
the Nova cpu_allocation_ratio, and then run a stack update, the
configuration is reasserted and this includes any changes in the
updated configuration definition. This includes re-running
ceph-ansible playbooks which ensure the system configuration is as it
is defined. In order to be certain that a configuration change in an
OSD has been applied, the OSD must be restarted. Thus, ceph-ansible
has handlers to do this. A properly redundant Ceph cluster is designed
to be able to loose a subset of OSDs and continue servicing requests.

If you are certain that you do not wish to reassert Ceph configuration
during stack update, then you may update your configuration to "noop"
the Ceph services managed by director described in the variation to
https://access.redhat.com/solutions/4939291 I described above.

Comment 5 John Fulton 2020-04-03 22:51:50 UTC

Rather than follow a variation of https://access.redhat.com/solutions/4939291 I have documented what's proposed here in a separate article https://access.redhat.com/solutions/4963041

Comment 6 John Fulton 2020-04-06 12:38:13 UTC

I see this bug report is connected to a support case. If there's further questions or concerns about Red Hat OpenStack director usage with Ceph, then let's please manage them through the support case as what I've read in this bug report indicates that the software is working as designed.

Note You need to log in before you can comment on or make changes to this bug.