Bug 1701804

Summary: node does not join OCP cluster after scaleup playbook finished successfully
Product: OpenShift Container Platform Reporter: Weihua Meng <wmeng>
Component: InstallerAssignee: Russell Teague <rteague>
Installer sub component: openshift-ansible QA Contact: Weihua Meng <wmeng>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: gpei
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:47:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Weihua Meng 2019-04-22 02:23:22 UTC
Description of problem:
node does not join OCP cluster after scaleup playbook finished successfully

Version-Release number of the following components:
openshift-ansible-4.1.0-201904201251.git.148.6de1227.el7.noarch

$ ansible --version
ansible 2.7.9
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/wmeng/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.6.6 (default, Mar 29 2019, 00:03:27) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]

How reproducible:
Always

Steps to Reproduce:
1. run scaleup playbook to add new RHEL node to existing OCP4 cluster
$ ansible-playbook -vvv -i ~/hosts playbooks/scaleup.yml
2. check cluster status after playbook finished successfully.

Actual results:
###  before scaleup:
$ oc get nodes
NAME                                                STATUS   ROLES    AGE   VERSION
ip-172-31-128-63.ap-northeast-1.compute.internal    Ready    worker   87m   v1.13.4+da48e8391
ip-172-31-130-205.ap-northeast-1.compute.internal   Ready    master   94m   v1.13.4+da48e8391
ip-172-31-149-11.ap-northeast-1.compute.internal    Ready    master   94m   v1.13.4+da48e8391
ip-172-31-154-138.ap-northeast-1.compute.internal   Ready    worker   87m   v1.13.4+da48e8391
ip-172-31-164-59.ap-northeast-1.compute.internal    Ready    worker   87m   v1.13.4+da48e8391
ip-172-31-169-48.ap-northeast-1.compute.internal    Ready    master   94m   v1.13.4+da48e8391

###  after scaleup
$ oc get nodes
NAME                                                STATUS   ROLES    AGE    VERSION
ip-172-31-128-63.ap-northeast-1.compute.internal    Ready    worker   178m   v1.13.4+da48e8391
ip-172-31-130-205.ap-northeast-1.compute.internal   Ready    master   3h5m   v1.13.4+da48e8391
ip-172-31-149-11.ap-northeast-1.compute.internal    Ready    master   3h5m   v1.13.4+da48e8391
ip-172-31-154-138.ap-northeast-1.compute.internal   Ready    worker   178m   v1.13.4+da48e8391
ip-172-31-164-59.ap-northeast-1.compute.internal    Ready    worker   178m   v1.13.4+da48e8391
ip-172-31-169-48.ap-northeast-1.compute.internal    Ready    master   3h5m   v1.13.4+da48e8391

Expected results:
new RHEL node should show up in cluster by oc get nodes

Comment 2 Russell Teague 2019-04-22 15:48:07 UTC
The kubelet is failing to start due to an AWS issue.


Apr 22 11:20:08 ip-172-31-12-224.ap-northeast-1.compute.internal hyperkube[20023]: F0422 11:20:08.216862   20023 server.go:264] failed to run Kubelet: could not init cloud provider "aws": error finding instance i-0f03fe4975e10d1fa: "error listing AWS instances: \"NoCredentialProviders: no valid providers in chain. Deprecated.\\n\\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors\""

Is this a valid instance in the region?  Could be a possible IAM permissions issue in the region as well.

Could you try again with a fresh node?  This one will not bootstrap due to expired certs.

Comment 3 Weihua Meng 2019-04-23 06:47:27 UTC
What kind if instance is valid instance?

I tried again for another fresh node, got the same result.


steps to create host,

choose rhel7 AMI in the same region with the existing cluster,

launch it in the same VPC of the cluster, and choose public subnet 

use the same Security Group as other worker node, and open SSH port for ansible

Comment 5 Russell Teague 2019-04-23 18:35:55 UTC
In AWS, the RHEL7 node provisioned must have an IAM role attached with a policy to allow describing all EC2 instances.  A new role can be created or you can use the role created by the installer which will have the format <cluster_name>-xxxxx-worker-role.  This role will allow the kublet to query the AWS API.

Comment 6 Weihua Meng 2019-04-24 04:25:27 UTC
RHEL7 host join the cluster successfully after IAM assigned and TAG added

$ oc get node -o wide
NAME                                                STATUS   ROLES    AGE     VERSION             INTERNAL-IP      EXTERNAL-IP     OS-IMAGE                                                   KERNEL-VERSION              CONTAINER-RUNTIME
ip-172-31-135-71.ap-northeast-1.compute.internal    Ready    worker   3h30m   v1.13.4+da48e8391   172.31.135.71    <none>          Red Hat Enterprise Linux CoreOS 410.8.20190418.1 (Ootpa)   4.18.0-80.el8.x86_64        cri-o://1.13.6-4.rhaos4.1.gita4b40b7.el8
ip-172-31-143-235.ap-northeast-1.compute.internal   Ready    master   3h38m   v1.13.4+da48e8391   172.31.143.235   <none>          Red Hat Enterprise Linux CoreOS 410.8.20190418.1 (Ootpa)   4.18.0-80.el8.x86_64        cri-o://1.13.6-4.rhaos4.1.gita4b40b7.el8
ip-172-31-147-96.ap-northeast-1.compute.internal    Ready    worker   3h30m   v1.13.4+da48e8391   172.31.147.96    <none>          Red Hat Enterprise Linux CoreOS 410.8.20190418.1 (Ootpa)   4.18.0-80.el8.x86_64        cri-o://1.13.6-4.rhaos4.1.gita4b40b7.el8
ip-172-31-151-240.ap-northeast-1.compute.internal   Ready    master   3h38m   v1.13.4+da48e8391   172.31.151.240   <none>          Red Hat Enterprise Linux CoreOS 410.8.20190418.1 (Ootpa)   4.18.0-80.el8.x86_64        cri-o://1.13.6-4.rhaos4.1.gita4b40b7.el8
ip-172-31-169-106.ap-northeast-1.compute.internal   Ready    worker   3h30m   v1.13.4+da48e8391   172.31.169.106   <none>          Red Hat Enterprise Linux CoreOS 410.8.20190418.1 (Ootpa)   4.18.0-80.el8.x86_64        cri-o://1.13.6-4.rhaos4.1.gita4b40b7.el8
ip-172-31-175-155.ap-northeast-1.compute.internal   Ready    master   3h38m   v1.13.4+da48e8391   172.31.175.155   <none>          Red Hat Enterprise Linux CoreOS 410.8.20190418.1 (Ootpa)   4.18.0-80.el8.x86_64        cri-o://1.13.6-4.rhaos4.1.gita4b40b7.el8
ip-172-31-29-93.ap-northeast-1.compute.internal     Ready    worker   52m     v1.13.4+8730f3882   172.31.29.93     46.51.238.198   Red Hat Enterprise Linux Server 7.6 (Maipo)                3.10.0-957.1.3.el7.x86_64   cri-o://1.13.6-1.dev.rhaos4.1.gitee2e748.el7-dev

Comment 8 errata-xmlrpc 2019-06-04 10:47:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758