Hide Forgot
Description of problem: Upgrade RHEL node failed due to incorrect nodename when host public DNS name is different from OCP cluster nodename Version-Release number of the following components: openshift-ansible-4.1.0-201904231432.git.150.9f73bcc.el7.noarch ansible 2.7.9 config file = /etc/ansible/ansible.cfg configured module search path = ['/home/wmeng/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] ansible python module location = /usr/local/lib/python3.6/site-packages/ansible executable location = /usr/local/bin/ansible python version = 3.6.6 (default, Mar 29 2019, 00:03:27) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] How reproducible: Always Steps to Reproduce: 1. join RHEL7 workers to OCP4 cluster 2. upgrade RHEL7 nodes $ ansible-playbook -i inventory/hosts playbooks/upgrade.yml Actual results: failed. TASK [Cordon node prior to upgrade] ******************************************************************************************************************************************************************************* task path: /usr/share/ansible/openshift-ansible/playbooks/upgrade.yml:23 Wednesday 24 April 2019 04:35:23 -0400 (0:00:00.048) 0:00:04.989 ******* Using module file /usr/local/lib/python3.6/site-packages/ansible/modules/commands/command.py <localhost> ESTABLISH LOCAL CONNECTION FOR USER: wmeng <localhost> EXEC /bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-eyzosaopjogyqgpnxgmjcnzsrnehtrum; /usr/bin/python3.6'"'"' && sleep 0' failed: [ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com -> localhost] (item=ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com) => { "changed": true, "cmd": [ "oc", "adm", "cordon", "ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com", "--config=/home/wmeng/.kube/config" ], "delta": "0:00:01.843968", "end": "2019-04-24 04:35:25.688872", "invocation": { "module_args": { "_raw_params": "oc adm cordon ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com --config=/home/wmeng/.kube/config", "_uses_shell": false, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true } }, "item": "ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com", "msg": "non-zero return code", "rc": 1, "start": "2019-04-24 04:35:23.844904", "stderr": "Error from server (NotFound): nodes \"ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com\" not found", "stderr_lines": [ "Error from server (NotFound): nodes \"ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com\" not found" ], "stdout": "", "stdout_lines": [] } PLAY RECAP ******************************************************************************************************************************************************************************************************** ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com : ok=1 changed=0 unreachable=0 failed=1 localhost : ok=0 changed=0 unreachable=0 failed=0 Expected results: upgrade success Additional info: [wmeng@preserve-slave-wmengbuilder1 ~]$ oc get node NAME STATUS ROLES AGE VERSION ip-172-31-135-71.ap-northeast-1.compute.internal Ready worker 8h v1.13.4+da48e8391 ip-172-31-143-235.ap-northeast-1.compute.internal Ready master 8h v1.13.4+da48e8391 ip-172-31-147-96.ap-northeast-1.compute.internal Ready worker 8h v1.13.4+da48e8391 ip-172-31-151-240.ap-northeast-1.compute.internal Ready master 8h v1.13.4+da48e8391 ip-172-31-169-106.ap-northeast-1.compute.internal Ready worker 8h v1.13.4+da48e8391 ip-172-31-175-155.ap-northeast-1.compute.internal Ready master 8h v1.13.4+da48e8391 ip-172-31-29-93.ap-northeast-1.compute.internal Ready worker 5h31m v1.13.4+8730f3882 $ oc adm cordon ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com --config=/home/wmeng/.kube/config Error from server (NotFound): nodes "ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com" not found $ oc adm cordon ip-172-31-29-93.ap-northeast-1.compute.internal --config=/home/wmeng/.kube/config node/ip-172-31-29-93.ap-northeast-1.compute.internal cordoned [wmeng@preserve-slave-wmengbuilder1 openshift-ansible]$ oc get node NAME STATUS ROLES AGE VERSION ip-172-31-135-71.ap-northeast-1.compute.internal Ready worker 8h v1.13.4+da48e8391 ip-172-31-143-235.ap-northeast-1.compute.internal Ready master 8h v1.13.4+da48e8391 ip-172-31-147-96.ap-northeast-1.compute.internal Ready worker 8h v1.13.4+da48e8391 ip-172-31-151-240.ap-northeast-1.compute.internal Ready master 8h v1.13.4+da48e8391 ip-172-31-169-106.ap-northeast-1.compute.internal Ready worker 8h v1.13.4+da48e8391 ip-172-31-175-155.ap-northeast-1.compute.internal Ready master 8h v1.13.4+da48e8391 ip-172-31-29-93.ap-northeast-1.compute.internal Ready,SchedulingDisabled worker 6h5m v1.13.4+8730f3882
The Ansible inventory should be created with host names that the cluster knows about. Do not use public DNS host names for Ansible host names.
if not public DNS host names, then ansible host may not connect to them. fatal: [ip-172-31-29-93.ap-northeast-1.compute.internal]: UNREACHABLE! => { "changed": false, "msg": "SSH Error: data could not be sent to remote host \"ip-172-31-29-93.ap-northeast-1.compute.internal\". Make sure this host can be reached over ssh", "unreachable": true
Ensure hosts are resolvable/reachable from the Ansible control machine. One option to ensure access is to set up an ssh bastion host. An example can be found here, https://github.com/eparis/ssh-bastion. Ansible can be configured to use the ssh bastion host by setting this var in host_vars or group_vars --- ansible_ssh_common_args: "-o ProxyCommand=\"ssh -o IdentityFile='/path/to/libra.pem' -o StrictHostKeyChecking=no -W %h:%p -q <username>@<ssh_bastion_hostname>\""
The bastion works well for ansible-playbook in such case. With ansible_ssh_common_args set in ansible inventory file, playbooks/upgrade.yml finished successfully, the RHEL workers is working well after upgrade. ansible_ssh_common_args="-o ProxyCommand=\"ssh -o IdentityFile='/path/to/libra.pem' -o StrictHostKeyChecking=no -W %h:%p -q core@<bastion_hostname>\"" One more thing want to confirm is: When we want to upgrade a 4.1 cluster with RHEL76&RHCOS workers to a newer version manually, should we run the "playbooks/upgrade.yml" playbook against RHEL workers before or after the cluster upgrade(oc adm upgrade) ? Thanks.
RHEL workers should be upgraded after the cluster is upgraded. The RHEL upgrade playbook installs the latest package version available for cri-o, openshift-clients and openshift-hyperkube but pulls images based on the cluster version.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758