The reason why we have the drain lock is explained here: https://bugzilla.redhat.com/show_bug.cgi?id=1960103 If you still managed to get the cluster into a bad state (i.e. by rebooting at some intermediate state..), we should have a reproducer to specify how exactly to get into that state. I presume this is some kind of a race condition. Only then can we try to make the code more robust for that edge case.
@balazs i believe john shared the steps for reproducing earlier, can you pls give an update on next steps or thoughts as to what the issue here might be?
@miltimilti a) We do not know _why_ we get into this state currently. If we can avoid that in the first place, then we should never have had a problem. b) I agree that there is a way to get into this state, and if you do, we have a state in which we are stuck. This we should be able to patch up. My next step will be to provide a custom image that fixes b).