Description of problem: ceph osds restart is taking too much time and the congerge step of a minor update will most likely timeout . In this cluster, we have 16 osds per node and 10 nodes which means 160 osds to update. In the output below, you can see the first node took ~1 hour to restart all its osds and the second 27 minutes. 2019-12-11 09:54:32,393 p=12692 u=mistral | RUNNING HANDLER [ceph-handler : restart ceph osds daemon(s) - container] ******* 2019-12-11 09:54:32,394 p=12692 u=mistral | Wednesday 11 December 2019 09:54:32 -0600 (0:00:01.308) 0:17:22.095 **** 2019-12-11 11:00:15,743 p=12692 u=mistral | changed: [10.10.10.1 -> 10.10.10.2] => (item=10.10.10.2) 2019-12-11 11:27:27,920 p=12692 u=mistral | changed: [10.10.10.1 -> 10.10.10.3] => (item=10.10.10.3) Version-Release number of selected component (if applicable): Latest ceph-ansible 3.2.30.1-1 How reproducible: Converge Steps to Reproduce: 1. Do a minor update with lots of OSDs 2. 3. Actual results: converge step breaks Expected results: converge step completes Additional info:
*** This bug has been marked as a duplicate of bug 1784047 ***