Description of problem: Restarting an osd twice quickly enough can result in it starting up and sending an MOSDBoot message with the same epoch as the epoch in which it was marked up_from. The mon then ignores the boot message and leaves the osd in limbo since it won't resend the boot. I'm not leaving reproduction steps for this. There are two other ansible bugs blocked on this one which reliably reproduce it, please verify those and consider this one verified if they are fixed.
Harish and Sam, We got past this issue by changing ceph-ansible to not double start OSDs. https://bugzilla.redhat.com/show_bug.cgi?id=1394929 This fix is nice to have and as such I'm going to target it to 2.2 and fix the dependencies for the other BZs. cheers, G
Sam, Is this present in the latest RHCeph 2.2 build? i.e. can wee move the state to ON_QA ?
Verified this as part of the rolling_update tests in ceph 2.2 in build 10.2.5-26 on RHEl and 10.2.5-17 on ubuntu. This issue is not seen anymore. Moving to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0514.html