Bug 1782494

Summary:	ceph osds restart is taking too much time in the converge step of a minor update
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	David Hill <dhill>
Component:	Ceph-Ansible	Assignee:	Guillaume Abrioux <gabrioux>
Status:	CLOSED DUPLICATE	QA Contact:	Vasishta <vashastr>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.2	CC:	aschoen, ceph-eng-bugs, dsavinea, gmeno, nthomas, ykaul
Target Milestone:	rc
Target Release:	5.*
Hardware:	x86_64
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-12-16 16:31:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Hill 2019-12-11 17:42:51 UTC

Description of problem:
ceph osds restart is taking too much time and the congerge step of a minor update will most likely timeout .  In this cluster, we have 16 osds per node and 10 nodes which means 160 osds to update.

In the output below, you can see the first node took ~1 hour to restart all its osds and the second 27 minutes.


2019-12-11 09:54:32,393 p=12692 u=mistral |  RUNNING HANDLER [ceph-handler : restart ceph osds daemon(s) - container] *******
2019-12-11 09:54:32,394 p=12692 u=mistral |  Wednesday 11 December 2019  09:54:32 -0600 (0:00:01.308)       0:17:22.095 **** 
2019-12-11 11:00:15,743 p=12692 u=mistral |  changed: [10.10.10.1 -> 10.10.10.2] => (item=10.10.10.2)
2019-12-11 11:27:27,920 p=12692 u=mistral |  changed: [10.10.10.1 -> 10.10.10.3] => (item=10.10.10.3)



Version-Release number of selected component (if applicable):
Latest ceph-ansible 3.2.30.1-1

How reproducible:
Converge

Steps to Reproduce:
1. Do a minor update with lots of OSDs
2.
3.

Actual results:
converge step breaks

Expected results:
converge step completes

Additional info:

Comment 11 Dimitri Savineau 2019-12-16 16:31:25 UTC


*** This bug has been marked as a duplicate of bug 1784047 ***