This is an issue only related to this cluster, so moving out of 4.8 for investigation... A few observations: 1. It is unexpected that there are extra replicasets for mon-p and mon-q, but it shouldn't affect the health since they have no pods. 2. From the toolbox, what result do you get from removing mon.t manually? Since mons p and q still show in quorum, they should respond to toolbox updates. Something like this (I forget if it's "t" or "mon.t" for the name). ceph mon remove t 3. The rook operator log shows the following message, so it appears the new mon pod is not able to start as expected. op-mon: failed to failover mon "t". failed to place new mon on a node: assignmon: error scheduling monitor: sched-mon: canary pod scheduling failed retries 4. If you delete the rook-ceph-mon-t deployment and remove mon-t from the rook-ceph-mon-endpoints configmap and restart the rook operator, I would expect that a new mon such as mon-y will be able to be scheduled on the node where t had been running.
1. What mons are in the rook-ceph-mon-endpoints configmap now? 2. What does "ceph mon dump" show in the toolbox? 3. Please share the latest rook operator log
Created attachment 1791872 [details] mon.ah verbose log
Manjunatha Yes, good point, it appears the clusterIP for mon-p does not match what is expected between rook's configmap and the ceph monmap. Please do try those other workaround steps to see if that fixes the situation.
Greg, what does this mean "handle_auth_request failed to assign global_id"? This BZ contains the same error https://bugzilla.redhat.com/show_bug.cgi?id=1942142 Not sure what the root cause is but this seems to be the issue... Thanks
(In reply to Sébastien Han from comment #17) > Greg, what does this mean "handle_auth_request failed to assign global_id"? > This BZ contains the same error > https://bugzilla.redhat.com/show_bug.cgi?id=1942142 > Not sure what the root cause is but this seems to be the issue... This log line is not a cause of the issue you're seeing, nor is the bz referenced. handle_auth_request() is called when trying to authenticate connections, but until a monitor joins the cluster the only connections it is able to authenticate are with other monitors using their shared secret. "failed to assign global id" is a generic message for any kind of failure — common ones include the monitor not having joined the cluster, or being out of quorum for too long. My guess is there's some monitoring process trying to poke at each individual mon triggering those lines, though I'm not certain. I'd look to see if the in-quorum monitors are reacting at all to the probe messages, since there aren't any replies in the log snippets of #c18. It sounds like we've manually edited data which is used as input to the new mon, but not adjusted the live monmap, so my first guess would be that something there is still mismatched...
From the monmap, the expected mon-p ip is: 172.30.104.245 But the service and the configmap for mon-p both point to 172.30.223.111 To fix this, it looks like you need to update the service and the configmap to correctly point to 172.30.104.245. Otherwise, the new mon won't find the correct mons in quorum to connect. Have you updated this already?
In an older release, perhaps 4.5, there was a bug that if the mon service was deleted, the operator would create a new service with a new clusterIP instead of the same ip. This would cause issues with the mon quorum since the wrong mon ip would be used. Now that you fixed the mon ip, the quorum is able to be healthy again. The fix (I believe in 4.6) was that the service would be created again with the same clusterIP.
mon-p must have been originally assigned 172.30.104.245, and the service would have been initially created with that ip. Then it's difficult to say what happened, but it seems Rook hit the bug that the service did not exist and Rook re-created it with a different clusterIP, which then caused the quorum issues.
The fixed BZ was https://bugzilla.redhat.com/show_bug.cgi?id=1897029, looks like it was actually only in 4.7, so this would still repro in 4.6.