I think the first and simplest thing we should try here is to manually apply the changes in https://github.com/openshift/machine-config-operator/pull/2741. Based on everything I've seen about this bug, I have a strong suspicion that it's triggered by ungraceful keepalived stops. Eliminating those won't fix the underlying problem, but it will prevent it from happening under normal circumstances.