Description of problem: ======================== If we reboot more than one MON host nodes(say 2 out of 3 or all 3) than docker instance on 2 MONs(out of 3 MONS) keep restarting Version-Release number of selected component (if applicable): ============================================================= ceph-2-rhel-7-docker-candidate-20170516172622 ceph-ansible-2.2.7-1.el7scon.noarch ansible-2.2.3.0-1.el7.noarch How reproducible: ================= 2/2 case 1 :- reboot 2 MONs out of 3 MONs case 2 :- shutdown entire cluster (all MONs and OSDS) Steps to Reproduce: =================== case 1:- 1. created containerized cluster having 3 MON nodes and 3 OSD nodes. (have 1 rbd-mirror node as well) 2.create some rbd images and create 2 way mirrorring with another cluster 3. reboot 2 MON node at same time case 2: 1. created containerized cluster having 3 MON nodes and 3 OSD nodes. (have 1 rbd-mirror node as well) 2.create some rbd images and create 2 way mirrorring with another cluster 3. reboot all MONs and OSDs of cluster at same time Actual results: =============== Mon Docker instance on 2 MON nodes keep restarting and never joined quorom. Expected results: Additional info:
Proposed fix upstream: https://github.com/ceph/ceph-docker/pull/654
I just pushed a new commit downstream, that should trigger a new image build.
The error log is invalid then. As figured out, there is a conflict between the 2 mons trying to start. Can we have the logs from the initial failure? Thanks.
I think I've found the issue, I'm working on a fix.
New commit, please re-test: remote: *** Checking commit aee9726f6457dcbef8ef633c21c704111f3d1dfc remote: *** Resolves: remote: *** Approved: remote: *** rhbz#1455357 (pm_ack+) remote: *** Commit aee9726f6457dcbef8ef633c21c704111f3d1dfc allowed
verified with version - ceph-2-rhel-7-docker-candidate-96406-20170601145625 Use a minimal setup (no rgw or rbd I/O going on) and rebooted all Mon nodes, 2 MON nodes out of 3. In both cases cluster was able to achieve health state henc moving to verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1498