1910842 – [Bug] [RCA] During ceph upgrade all the OSDs (and other ceph services) went down

Bug 1910842 - [Bug] [RCA] During ceph upgrade all the OSDs (and other ceph services) went down

Summary: [Bug] [RCA] During ceph upgrade all the OSDs (and other ceph services) went down

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	z15
Target Release:	13.0 (Queens)
Assignee:	John Fulton
QA Contact:	Yogev Rabl
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1911620 1926821 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-25 06:52 UTC by vivek koul
Modified:	2024-06-13 23:51 UTC (History)
CC List:	14 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-8.4.1-75.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-18 13:09:23 UTC
Target Upstream Version:
Embargoed:
Flags:	yrabl: automate_bug+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1910124	None	None	None	2021-01-04 17:45:53 UTC
OpenStack gerrit	769169	None	MERGED	[Queens] Handle ceph service restart during update.	2021-02-15 16:25:18 UTC
Red Hat Bugzilla	1882724	unspecified	CLOSED	containerized daemons die on dockerd restarts	2024-10-01 16:54:45 UTC
Red Hat Issue Tracker	OSP-32301	None	None	None	2024-06-13 23:51:19 UTC
Red Hat Knowledge Base (Solution)	5674631	None	None	None	2021-01-07 14:01:09 UTC
Red Hat Knowledge Base (Solution)	5679791	None	None	None	2021-01-04 18:47:14 UTC
Red Hat Product Errata	RHBA-2021:0932	None	None	None	2021-03-18 13:10:34 UTC

Internal Links: 1882724

Description vivek koul 2020-12-25 06:52:52 UTC

Description of problem:

During ceph upgrade, all the osds went down on 4 nodes.

Version-Release number of selected component (if applicable):

less installed-rpms |grep tripleoansible-tripleo-ipsec-8.1.1-0.20190513184007.7eb892c.el7ost.noarch Wed Oct  7 07:14:55 2020
openstack-tripleo-common-8.7.1-20.el7ost.noarch             Wed Oct  7 07:16:47 2020
openstack-tripleo-common-containers-8.7.1-20.el7ost.noarch  Wed Oct  7 07:14:32 2020
openstack-tripleo-heat-templates-8.4.1-58.1.el7ost.noarch   Wed Oct  7 07:16:48 2020
openstack-tripleo-image-elements-8.0.3-1.el7ost.noarch      Wed Oct  7 07:14:43 2020
openstack-tripleo-puppet-elements-8.1.1-2.el7ost.noarch     Wed Oct  7 07:14:35 2020
openstack-tripleo-ui-8.3.2-3.el7ost.noarch                  Wed Oct  7 07:22:42 2020
openstack-tripleo-validations-8.5.0-4.el7ost.noarch         Wed Oct  7 07:14:55 2020
puppet-tripleo-8.5.1-14.el7ost.noarch                       Wed Oct  7 07:14:30 2020
python-tripleoclient-9.3.1-7.el7ost.noarch                  Wed Oct  7 07:16:52 2020

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
ceph upgrade failed

Expected results:
Ceph upgrade should not fail

Additional info:
After the reboot of storage nodes, OSDs came up

Comment 7 Sofer Athlan-Guyot 2021-01-04 14:52:09 UTC

Hi @ykarel, Sofer from DFG:Upgrades,

If I get this correctly, during the update_tasks of the ceph-osd we had the docker restart that happens because [1] was true, ie docker needed to be updated.

This, in turn, caused the ceph osd service to restart and started to rebalance.  This takes time and eventually led to all OSD down as we progressed into the update[2].

If we add those commands[3] during the update, (in the file mentioned in [3] but for the "update_tasks") and in step_1 so that it happens before docker restart in step_2,
then we would avoid this kind of issue, right ?

I wonder if we would need to check for docker update or if we could just do that all the time, as those command should be harmless during that process, provided we put the
flag back in step_4 for instance.

WDYT?

[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/queens/puppet/services/docker.yaml#L161-L162
[2] I'm calling "update" that command "openstack overcloud upgrade run --nodes CephStorage"
[3] https://github.com/openstack/tripleo-heat-templates/blob/stable/queens/puppet/services/ceph-osd.yaml#L92-L99

Comment 10 Yatin Karel 2021-01-05 08:34:13 UTC

(In reply to Sofer Athlan-Guyot from comment #7)
> Hi @ykarel, Sofer from DFG:Upgrades,
> 
> If I get this correctly, during the update_tasks of the ceph-osd we had the
> docker restart that happens because [1] was true, ie docker needed to be
> updated.
> 
Actually, not restart but stop docker, upgrade docker and start docker. If it would have been restart, then systemd would
have restarted ceph osd's as part of docker restart. But this doesn't happen when stop/start happens for docker.
> This, in turn, caused the ceph osd service to restart and started to
> rebalance.  This takes time and eventually led to all OSD down as we
> progressed into the update[2].
> 
No, not restart, but stop all ceph osds(as ceph osd daemons have Requires: docker.service) and since
docker is stopped all osd's stops too(due to Requires: as part of bz 1846830). docker get's started as part of [1] leaving osd's in stopped state.
And since OSD's are down rebalance of pgs get's triggered and since one by one all osd's get's down
ceph get's to an unhealthy state which will not recover until osds are up again.
As part of "ceph-upgrade run" osd's get started/restarted[2][3] on 1 ceph storage node(as it run's serially) and as pgs are not in "active+clean" state
Task "waiting for clean pgs..."[7] fails and aborts upgrade. To clear this osd's need to be started again and let rebalance to complete and
get pgs back to "active+clean" state. The workaround should avoid rebalancing leaving pgs to be in active+clean
and help "ceph-upgrade run" succeed as "waiting for clean pgs..." Task will not fail and upgrade continues to all the storage nodes.

Also it has nothing to do with time as osd's are stopped as part of "openstack overcloud upgrade run --nodes CephStorage",
and osds are being started as part of "openstack overcloud ceph-upgrade run".




[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/queens/puppet/services/docker.yaml#L188-L195
[2] https://github.com/ceph/ceph-ansible/blob/v3.2.52/roles/ceph-osd/tasks/start_osds.yml#L117-L123
[3] https://github.com/ceph/ceph-ansible/blob/v3.2.52/infrastructure-playbooks/rolling_update.yml#L372-L378
[4] https://github.com/ceph/ceph-ansible/blob/v3.2.52/infrastructure-playbooks/rolling_update.yml#L410-L421
> If we add those commands[3] during the update, (in the file mentioned in [3]
> but for the "update_tasks") and in step_1 so that it happens before docker
> restart in step_2,
> then we would avoid this kind of issue, right ?
> 
No just that shouldn't help much as set and unset of flags are done without osd's getting started.

 
> I wonder if we would need to check for docker update or if we could just do
> that all the time, as those command should be harmless during that process,
> provided we put the
> flag back in step_4 for instance.
> 
> WDYT?

With respect to fix, i think it needs to be done as part of docker upgrade step as that's the place where osd's get's stopped but not started, Something
like detect all "ceph osd services which were in started state", if docker get's upgraded then post docker starts start all ceph osd's too which were in
started state. "flags" can be set before docker stops and unset after osds are started, someone from ceph should confirm/comment on this theory and suggest the best
way to handle start/restarts of ceph services.

It also applies to other ceph services as well(mon/mgr etc) so need to fix for those as well, as i see customer faced issues(monitor down, and manual restart was done) during monitor upgrade as well i didn't dig on it initially as focused only on osd's down as part of RCA, my bad should have considered other services as well instead of just OSDs which was asked as part of RCA.
As part of OSD's down RCA we shared workaround for upgrade, but not for other ceph services(mon/mgr) which will also face the issue(service stop and no start) due to docker stop/start, so would be good to get customer updated on this as they are planning more upgrade activities on other environment.

> 
> [1]
> https://github.com/openstack/tripleo-heat-templates/blob/stable/queens/
> puppet/services/docker.yaml#L161-L162
> [2] I'm calling "update" that command "openstack overcloud upgrade run
> --nodes CephStorage"
> [3]
> https://github.com/openstack/tripleo-heat-templates/blob/stable/queens/
> puppet/services/ceph-osd.yaml#L92-L99

Comment 11 Sofer Athlan-Guyot 2021-01-05 09:48:49 UTC

Hi,

John, thanks for the clearing the sequential issue out, must have been one particular env on my side.

Yatin, thanks for the clarification, I think I get the full picture now.  You are right then, the problem is
not specific to ceph-osd container but to any docker container that would be stopped and not restarted 
when docker is stopped.

It seems it affect all ceph containers, but maybe there are others.

The solution would then be:

 1. get the list of services that will be stopped if docker is stopped (the dependant services);
 2. start/stop docker;
 3. restart each dependant services;

Relative to the review that definitively means that this has to got in the docker service file.

The first point, may prove complicated as, in general "calculated" actions are fragile.
So maybe we go we a list of containers to restart.

Comment 13 John Fulton 2021-01-05 20:43:18 UTC

(In reply to Yatin Karel from comment #10)
> With respect to fix, i think it needs to be done as part of docker upgrade
> step as that's the place where osd's get's stopped but not started, 

Thanks Sofer for this patch which is run after the docker update:

 https://review.opendev.org/c/openstack/tripleo-heat-templates/+/769393/2/puppet/services/docker.yaml

> Something
> like detect all "ceph osd services which were in started state", if docker
> get's upgraded then post docker starts start all ceph osd's too which were in
> started state. "flags" can be set before docker stops and unset after osds
> are started, someone from ceph should confirm/comment on this theory and
> suggest the best
> way to handle start/restarts of ceph services.
> 
> It also applies to other ceph services as well(mon/mgr etc) so need to fix
> for those as well, as i see customer faced issues(monitor down, and manual
> restart was done) during monitor upgrade as well i didn't dig on it
> initially as focused only on osd's down as part of RCA, my bad should have
> considered other services as well instead of just OSDs which was asked as
> part of RCA.
> As part of OSD's down RCA we shared workaround for upgrade, but not for
> other ceph services(mon/mgr) which will also face the issue(service stop and
> no start) due to docker stop/start, so would be good to get customer updated
> on this as they are planning more upgrade activities on other environment.

Updated https://access.redhat.com/solutions/5679791 accordingly to cover all ceph services

Comment 14 John Fulton 2021-01-07 14:01:09 UTC

*** Bug 1911620 has been marked as a duplicate of this bug. ***

Comment 15 Sofer Athlan-Guyot 2021-01-11 17:19:25 UTC

Hi, started the downport of the patch.  By the way thank Bogdan, we end up using that command to get a nice list of services.

Comment 20 Sofer Athlan-Guyot 2021-02-11 18:55:23 UTC

*** Bug 1926821 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2021-03-18 13:09:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0932

Note You need to log in before you can comment on or make changes to this bug.