Bug 1690620

Summary:	machine-API occasionally only provisions one or two of the expected three compute nodes on AWS
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Cloud Compute	Assignee:	Vikas Choudhary <vichoudh>
Status:	CLOSED NOTABUG	QA Contact:	Jianwei Hou <jhou>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	4.1.0	CC:	agarcial, aos-bugs, aos-cloud, aos-storage-staff, bchilds, chancez, danw, gblomqui, jchaloup, jupierce, lxia, mifiedle, wabouham, wking
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1690588	Environment:
Last Closed:	2019-08-28 11:48:22 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1690588
Bug Blocks:

Description W. Trevor King 2019-03-19 20:17:28 UTC

+++ This bug was initially created as a clone of Bug #1690588 +++

Description of problem:

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5872 came up with only 2 of the expected three compute nodes:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5872/artifacts/e2e-aws/nodes.json | jq '.items[] | {creationTimestamp: .metadata.creationTimestamp, name: .metadata.name, taints: .spec.taints}'
{
  "creationTimestamp": "2019-03-19T14:59:18Z",
  "name": "ip-10-0-134-240.ec2.internal",
  "taints": [
    {
      "effect": "NoSchedule",
      "key": "node-role.kubernetes.io/master"
    }
  ]
}
{
  "creationTimestamp": "2019-03-19T15:05:26Z",
  "name": "ip-10-0-144-115.ec2.internal",
  "taints": null
}
{
  "creationTimestamp": "2019-03-19T14:59:39Z",
  "name": "ip-10-0-149-195.ec2.internal",
  "taints": [
    {
      "effect": "NoSchedule",
      "key": "node-role.kubernetes.io/master"
    }
  ]
}
{
  "creationTimestamp": "2019-03-19T14:59:53Z",
  "name": "ip-10-0-160-190.ec2.internal",
  "taints": [
    {
      "effect": "NoSchedule",
      "key": "node-role.kubernetes.io/master"
    }
  ]
}
{
  "creationTimestamp": "2019-03-19T15:05:24Z",
  "name": "ip-10-0-164-0.ec2.internal",
  "taints": null
}

which surfaced as bug 1690588:

[sig-storage] Dynamic Provisioning DynamicProvisioner should provision storage with different parameters [Suite:openshift/conformance/parallel] [Suite:k8s]

Looks like only two compute nodes show up in the manager's logs:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5872/artifacts/e2e-aws/pods/openshift-machine-api_clusterapi-manager-controllers-7b5f97bf7c-4gjkn_controller-manager.log.gz | gunzip | grep worker
I0319 15:00:44.294956       1 controller.go:239] Too few replicas for machine.openshift.io/v1beta1, Kind=MachineSet openshift-machine-api/ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1b, need 1, creating 1
I0319 15:00:44.313234       1 controller.go:239] Too few replicas for machine.openshift.io/v1beta1, Kind=MachineSet openshift-machine-api/ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1c, need 1, creating 1
E0319 15:00:44.319071       1 controller.go:356] Machine.machine.openshift.io "ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1c-5w2dr" not found
I0319 15:05:24.916312       1 node.go:81] Successfully linked machine ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1c-5w2dr to node ip-10-0-164-0.ec2.internal
I0319 15:05:26.416656       1 node.go:81] Successfully linked machine ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1b-h5zl2 to node ip-10-0-144-115.ec2.internal
I0319 15:05:55.250158       1 node.go:81] Successfully linked machine ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1c-5w2dr to node ip-10-0-164-0.ec2.internal
I0319 15:05:56.453866       1 node.go:81] Successfully linked machine ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1b-h5zl2 to node ip-10-0-144-115.ec2.internal

Comment 1 W. Trevor King 2019-03-19 20:33:12 UTC

The MachineSet YAML generated by the installer is in [1] (base64-encoded, shortly after /opt/openshift/openshift/99_openshift-cluster-api_worker-machineset.yaml), and it opens with:

apiVersion: v1
items:
- apiVersion: machine.openshift.io/v1beta1
  kind: MachineSet
  metadata:
    creationTimestamp: null
    labels:
      machine.openshift.io/cluster-api-cluster: ci-op-y744jt0f-5a633-zfmjl
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
    name: ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1a
    namespace: openshift-machine-api
  spec:
    replicas: 1
...

so we do expect a compute node in us-east-1a.  I don't see it here though:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5872/artifacts/e2e-aws/pods/openshift-machine-api_clusterapi-manager-controllers-7b5f97bf7c-4gjkn_machine-controller.log.gz | gunzip | grep 'idempotent create'
I0319 15:02:19.309996       1 controller.go:258] Reconciling machine object ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1b-h5zl2 triggers idempotent create.
I0319 15:02:21.266609       1 controller.go:258] Reconciling machine object ci-op-y744jt0f-5a633-zfmjl-worker-us-east-1c-5w2dr triggers idempotent create.

[1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5872/artifacts/e2e-aws/installer/.openshift_install.log

Comment 3 Justin Pierce 2019-03-22 18:50:47 UTC

*** Bug 1691892 has been marked as a duplicate of this bug. ***

Comment 5 Alberto 2019-03-25 14:22:02 UTC

wking it seems this is just being consistent with the value set in the worker pool https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5872/artifacts/e2e-aws/machineconfigpools.json

Comment 6 W. Trevor King 2019-03-25 17:36:45 UTC

(In reply to Alberto from comment #5)
> wking it seems this is just being consistent with the value set in the
> worker pool

Who sets that up?  The MCO called from the bootstrap machine [1]?

[1]: https://github.com/openshift/installer/blob/48dbde17da9ecb28dcfaca284a25bc4d2b2f0302/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L168-L198

Comment 7 Matthew Wong 2019-03-28 21:28:47 UTC

Maybe there is a race between creating the machineset and CVO creating the machineset CRD? What creates the machineset 99_openshift-cluster-api_worker-machineset.yaml and when? From what I have seen, it is always the first node in the list (zone *-a) that gets dropped.

Comment 8 Vikas Choudhary 2019-04-02 03:01:57 UTC

I confirmed that there is no chance of a race condition between creating the machineset and CVO creating the machineset CRD. machineset gets created from the bootstrap node by the "openshift" service. This logic keeps retrying forever until success, https://github.com/openshift/installer/blame/master/data/data/bootstrap/files/usr/local/bin/openshift.sh#L22

I even confirmed this on a local cluster. 

To summarize:
1. Machineset manifest, 99_openshift-cluster-api_worker-machineset.yaml, has correct 3 machinesets even the logs when there is failure
2. This manifest is applied at bootstrap node after bootkube is done. Logic for applying manifest is such that either it will stuck on error or it will create all three machinesets.
3. Could not find anything suspicious in the machineset controller. Once the machineset objects are there at apiserver, controller will create worker machines for sure. Also there are no controller logs which could suggest that there was a deletion of machineset.

Then i checked this issues current severity and impact at CI using Trevor King's scripts. In the last 24 hours, i could not find any occurence of this issue.

If anyone notice this issue again. Please notify here. And if luckily env is also ready to debug it then it would be awesome. For now, i dont have any idea what i can look for.

Comment 10 Dan Winship 2019-04-18 18:53:54 UTC

This is causing the test "[sig-scheduling] Multi-AZ Clusters should spread the pods of a replication controller across zones [Suite:openshift/conformance/parallel] [Suite:k8s]" to flake, because the test sees that there are 3 zones (counting both worker and master nodes), but it can only manage to schedule pods onto the worker nodes, so it ends up missing one zone.

Comment 11 W. Trevor King 2019-04-18 23:02:24 UTC

In case it helps, here are runs from the past 24 hours which don't have the expected six nodes:

$ for NODES in $(find . -name nodes.json); do COUNT=$(jq '.items | length' "${NODES}" 2>/dev/null); if test -n "${COUNT}" && test "${COUNT}" -ne 6; then echo "${COUNT} $(jq -r .url ${NODES/nodes.json/job.json})"; fi; done
$ for NODES in $(find . -name nodes.json); do COUNT=$(jq '.items | length' "${NODES}" 2>/dev/null); if test -n "${COUNT}" && test "${COUNT}" -ne 6; then echo "${COUNT} $(jq -r .url ${NODES/nodes.json/job.json})"; fi; done | sort
0 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_openshift-ansible/11521/pull-ci-openshift-openshift-ansible-release-3.11-e2e-aws/1255
3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22609/pull-ci-openshift-origin-master-e2e-aws/7497
3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22609/pull-ci-openshift-origin-master-e2e-aws-serial/4976
3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/407/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/2044
3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/407/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/2046
3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/407/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/2049
3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/407/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-operator/543
3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/148/pull-ci-openshift-cluster-network-operator-master-e2e-aws/957
3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/290/pull-ci-openshift-machine-api-operator-master-e2e-aws/939
3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/290/pull-ci-openshift-machine-api-operator-master-e2e-aws/941
3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/290/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/848
3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/290/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/849
3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/290/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/850
3 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_release/3475/rehearse-3475-pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1
5 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22602/pull-ci-openshift-origin-master-e2e-aws/7465
5 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws/7466
5 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/57/pull-ci-openshift-cloud-credential-operator-master-e2e-aws/268
5 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/257/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws/1299
5 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-openshift-apiserver-operator/186/pull-ci-openshift-cluster-openshift-apiserver-operator-master-e2e-aws/857
5 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_jenkins-client-plugin/259/pull-ci-openshift-jenkins-client-plugin-master-e2e-aws-jenkins/8
5 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/operator-framework_operator-marketplace/151/pull-ci-operator-framework-operator-marketplace-master-e2e-aws/777
5 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/operator-framework_operator-marketplace/151/pull-ci-operator-framework-operator-marketplace-master-e2e-aws-operator/495
7 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/192/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/536
7 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/419/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/2045

I'd expect the 5s are this issue.

Comment 12 Jan Chaloupka 2019-08-28 11:48:22 UTC

We have not seen the issue happening for the last 4 months. If the issue re-pears again, please re-open the issue.