+++ This bug was initially created as a clone of Bug #1690588 +++ Description of problem: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5872 came up with only 2 of the expected three compute nodes: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5872/artifacts/e2e-aws/nodes.json | jq '.items[] | {creationTimestamp: .metadata.creationTimestamp, name: .metadata.name, taints: .spec.taints}' { "creationTimestamp": "2019-03-19T14:59:18Z", "name": "ip-10-0-134-240.ec2.internal", "taints": [ { "effect": "NoSchedule", "key": "node-role.kubernetes.io/master" } ] } { "creationTimestamp": "2019-03-19T15:05:26Z", "name": "ip-10-0-144-115.ec2.internal", "taints": null } { "creationTimestamp": "2019-03-19T14:59:39Z", "name": "ip-10-0-149-195.ec2.internal", "taints": [ { "effect": "NoSchedule", "key": "node-role.kubernetes.io/master" } ] } { "creationTimestamp": "2019-03-19T14:59:53Z", "name": "ip-10-0-160-190.ec2.internal", "taints": [ { "effect": "NoSchedule", "key": "node-role.kubernetes.io/master" } ] } { "creationTimestamp": "2019-03-19T15:05:24Z", "name": "ip-10-0-164-0.ec2.internal", "taints": null } which surfaced as bug 1690588: [sig-storage] Dynamic Provisioning DynamicProvisioner should provision storage with different parameters [Suite:openshift/conformance/parallel] [Suite:k8s] Looks like only two compute nodes show up in the manager's logs: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5872/artifacts/e2e-aws/pods/openshift-machine-api_clusterapi-manager-controllers-7b5f97bf7c-4gjkn_controller-manager.log.gz | gunzip | grep worker I0319 15:00:44.294956 1 controller.go:239] Too few replicas for machine.openshift.io/v1beta1, Kind=MachineSet openshift-machine-api/ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1b, need 1, creating 1 I0319 15:00:44.313234 1 controller.go:239] Too few replicas for machine.openshift.io/v1beta1, Kind=MachineSet openshift-machine-api/ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1c, need 1, creating 1 E0319 15:00:44.319071 1 controller.go:356] Machine.machine.openshift.io "ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1c-5w2dr" not found I0319 15:05:24.916312 1 node.go:81] Successfully linked machine ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1c-5w2dr to node ip-10-0-164-0.ec2.internal I0319 15:05:26.416656 1 node.go:81] Successfully linked machine ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1b-h5zl2 to node ip-10-0-144-115.ec2.internal I0319 15:05:55.250158 1 node.go:81] Successfully linked machine ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1c-5w2dr to node ip-10-0-164-0.ec2.internal I0319 15:05:56.453866 1 node.go:81] Successfully linked machine ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1b-h5zl2 to node ip-10-0-144-115.ec2.internal
The MachineSet YAML generated by the installer is in [1] (base64-encoded, shortly after /opt/openshift/openshift/99_openshift-cluster-api_worker-machineset.yaml), and it opens with: apiVersion: v1 items: - apiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: creationTimestamp: null labels: machine.openshift.io/cluster-api-cluster: ci-op-y744jt0f-5a633-zfmjl machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker name: ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1a namespace: openshift-machine-api spec: replicas: 1 ... so we do expect a compute node in us-east-1a. I don't see it here though: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5872/artifacts/e2e-aws/pods/openshift-machine-api_clusterapi-manager-controllers-7b5f97bf7c-4gjkn_machine-controller.log.gz | gunzip | grep 'idempotent create' I0319 15:02:19.309996 1 controller.go:258] Reconciling machine object ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1b-h5zl2 triggers idempotent create. I0319 15:02:21.266609 1 controller.go:258] Reconciling machine object ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1c-5w2dr triggers idempotent create. [1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5872/artifacts/e2e-aws/installer/.openshift_install.log
*** Bug 1691892 has been marked as a duplicate of this bug. ***
wking it seems this is just being consistent with the value set in the worker pool https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5872/artifacts/e2e-aws/machineconfigpools.json
(In reply to Alberto from comment #5) > wking it seems this is just being consistent with the value set in the > worker pool Who sets that up? The MCO called from the bootstrap machine [1]? [1]: https://github.com/openshift/installer/blob/48dbde17da9ecb28dcfaca284a25bc4d2b2f0302/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L168-L198
Maybe there is a race between creating the machineset and CVO creating the machineset CRD? What creates the machineset 99_openshift-cluster-api_worker-machineset.yaml and when? From what I have seen, it is always the first node in the list (zone *-a) that gets dropped.
I confirmed that there is no chance of a race condition between creating the machineset and CVO creating the machineset CRD. machineset gets created from the bootstrap node by the "openshift" service. This logic keeps retrying forever until success, https://github.com/openshift/installer/blame/master/data/data/bootstrap/files/usr/local/bin/openshift.sh#L22 I even confirmed this on a local cluster. To summarize: 1. Machineset manifest, 99_openshift-cluster-api_worker-machineset.yaml, has correct 3 machinesets even the logs when there is failure 2. This manifest is applied at bootstrap node after bootkube is done. Logic for applying manifest is such that either it will stuck on error or it will create all three machinesets. 3. Could not find anything suspicious in the machineset controller. Once the machineset objects are there at apiserver, controller will create worker machines for sure. Also there are no controller logs which could suggest that there was a deletion of machineset. Then i checked this issues current severity and impact at CI using Trevor King's scripts. In the last 24 hours, i could not find any occurence of this issue. If anyone notice this issue again. Please notify here. And if luckily env is also ready to debug it then it would be awesome. For now, i dont have any idea what i can look for.
This is causing the test "[sig-scheduling] Multi-AZ Clusters should spread the pods of a replication controller across zones [Suite:openshift/conformance/parallel] [Suite:k8s]" to flake, because the test sees that there are 3 zones (counting both worker and master nodes), but it can only manage to schedule pods onto the worker nodes, so it ends up missing one zone.
In case it helps, here are runs from the past 24 hours which don't have the expected six nodes: $ for NODES in $(find . -name nodes.json); do COUNT=$(jq '.items | length' "${NODES}" 2>/dev/null); if test -n "${COUNT}" && test "${COUNT}" -ne 6; then echo "${COUNT} $(jq -r .url ${NODES/nodes.json/job.json})"; fi; done $ for NODES in $(find . -name nodes.json); do COUNT=$(jq '.items | length' "${NODES}" 2>/dev/null); if test -n "${COUNT}" && test "${COUNT}" -ne 6; then echo "${COUNT} $(jq -r .url ${NODES/nodes.json/job.json})"; fi; done | sort 0 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_openshift-ansible/11521/pull-ci-openshift-openshift-ansible-release-3.11-e2e-aws/1255 3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22609/pull-ci-openshift-origin-master-e2e-aws/7497 3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22609/pull-ci-openshift-origin-master-e2e-aws-serial/4976 3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/407/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/2044 3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/407/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/2046 3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/407/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/2049 3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/407/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-operator/543 3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/148/pull-ci-openshift-cluster-network-operator-master-e2e-aws/957 3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/290/pull-ci-openshift-machine-api-operator-master-e2e-aws/939 3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/290/pull-ci-openshift-machine-api-operator-master-e2e-aws/941 3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/290/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/848 3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/290/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/849 3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/290/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/850 3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_release/3475/rehearse-3475-pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1 5 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22602/pull-ci-openshift-origin-master-e2e-aws/7465 5 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws/7466 5 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/57/pull-ci-openshift-cloud-credential-operator-master-e2e-aws/268 5 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/257/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws/1299 5 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-openshift-apiserver-operator/186/pull-ci-openshift-cluster-openshift-apiserver-operator-master-e2e-aws/857 5 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_jenkins-client-plugin/259/pull-ci-openshift-jenkins-client-plugin-master-e2e-aws-jenkins/8 5 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/operator-framework_operator-marketplace/151/pull-ci-operator-framework-operator-marketplace-master-e2e-aws/777 5 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/operator-framework_operator-marketplace/151/pull-ci-operator-framework-operator-marketplace-master-e2e-aws-operator/495 7 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/192/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/536 7 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/419/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/2045 I'd expect the 5s are this issue.
We have not seen the issue happening for the last 4 months. If the issue re-pears again, please re-open the issue.