Maybe there is a race between creating the machineset and CVO creating the machineset CRD? What creates the machineset 99_openshift-cluster-api_worker-machineset.yaml and when? From what I have seen, it is always the first node in the list (zone *-a) that gets dropped.
I confirmed that there is no chance of a race condition between creating the machineset and CVO creating the machineset CRD. machineset gets created from the bootstrap node by the "openshift" service. This logic keeps retrying forever until success, https://github.com/openshift/installer/blame/master/data/data/bootstrap/files/usr/local/bin/openshift.sh#L22
I even confirmed this on a local cluster.
To summarize:
1. Machineset manifest, 99_openshift-cluster-api_worker-machineset.yaml, has correct 3 machinesets even the logs when there is failure
2. This manifest is applied at bootstrap node after bootkube is done. Logic for applying manifest is such that either it will stuck on error or it will create all three machinesets.
3. Could not find anything suspicious in the machineset controller. Once the machineset objects are there at apiserver, controller will create worker machines for sure. Also there are no controller logs which could suggest that there was a deletion of machineset.
Then i checked this issues current severity and impact at CI using Trevor King's scripts. In the last 24 hours, i could not find any occurence of this issue.
If anyone notice this issue again. Please notify here. And if luckily env is also ready to debug it then it would be awesome. For now, i dont have any idea what i can look for.
This is causing the test "[sig-scheduling] Multi-AZ Clusters should spread the pods of a replication controller across zones [Suite:openshift/conformance/parallel] [Suite:k8s]" to flake, because the test sees that there are 3 zones (counting both worker and master nodes), but it can only manage to schedule pods onto the worker nodes, so it ends up missing one zone.