Bug 1702623

Summary: Upgrade RHEL node failed due to incorrect nodename
Product: OpenShift Container Platform Reporter: Weihua Meng <wmeng>
Component: InstallerAssignee: Russell Teague <rteague>
Installer sub component: openshift-ansible QA Contact: Weihua Meng <wmeng>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: gpei, rteague
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:47:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Weihua Meng 2019-04-24 09:48:25 UTC
Description of problem:
Upgrade RHEL node failed due to incorrect nodename 
when host public DNS name is different from OCP cluster nodename

Version-Release number of the following components:
openshift-ansible-4.1.0-201904231432.git.150.9f73bcc.el7.noarch

ansible 2.7.9
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/wmeng/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.6.6 (default, Mar 29 2019, 00:03:27) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]

How reproducible:
Always

Steps to Reproduce:
1. join RHEL7 workers to OCP4 cluster
2. upgrade RHEL7 nodes
$ ansible-playbook -i inventory/hosts playbooks/upgrade.yml

Actual results:
failed.

TASK [Cordon node prior to upgrade] *******************************************************************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/playbooks/upgrade.yml:23
Wednesday 24 April 2019  04:35:23 -0400 (0:00:00.048)       0:00:04.989 ******* 
Using module file /usr/local/lib/python3.6/site-packages/ansible/modules/commands/command.py
<localhost> ESTABLISH LOCAL CONNECTION FOR USER: wmeng
<localhost> EXEC /bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-eyzosaopjogyqgpnxgmjcnzsrnehtrum; /usr/bin/python3.6'"'"' && sleep 0'
failed: [ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com -> localhost] (item=ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com) => {
    "changed": true,
    "cmd": [
        "oc",
        "adm",
        "cordon",
        "ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com",
        "--config=/home/wmeng/.kube/config"
    ],
    "delta": "0:00:01.843968",
    "end": "2019-04-24 04:35:25.688872",
    "invocation": {
        "module_args": {
            "_raw_params": "oc adm cordon ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com --config=/home/wmeng/.kube/config",
            "_uses_shell": false,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "warn": true
        }
    },
    "item": "ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com",
    "msg": "non-zero return code",
    "rc": 1,
    "start": "2019-04-24 04:35:23.844904",
    "stderr": "Error from server (NotFound): nodes \"ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com\" not found",
    "stderr_lines": [
        "Error from server (NotFound): nodes \"ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com\" not found"
    ],
    "stdout": "",
    "stdout_lines": []
}

PLAY RECAP ********************************************************************************************************************************************************************************************************
ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com : ok=1    changed=0    unreachable=0    failed=1   
localhost                  : ok=0    changed=0    unreachable=0    failed=0   


Expected results:
upgrade success

Additional info:
[wmeng@preserve-slave-wmengbuilder1 ~]$ oc get node
NAME                                                STATUS   ROLES    AGE     VERSION
ip-172-31-135-71.ap-northeast-1.compute.internal    Ready    worker   8h      v1.13.4+da48e8391
ip-172-31-143-235.ap-northeast-1.compute.internal   Ready    master   8h      v1.13.4+da48e8391
ip-172-31-147-96.ap-northeast-1.compute.internal    Ready    worker   8h      v1.13.4+da48e8391
ip-172-31-151-240.ap-northeast-1.compute.internal   Ready    master   8h      v1.13.4+da48e8391
ip-172-31-169-106.ap-northeast-1.compute.internal   Ready    worker   8h      v1.13.4+da48e8391
ip-172-31-175-155.ap-northeast-1.compute.internal   Ready    master   8h      v1.13.4+da48e8391
ip-172-31-29-93.ap-northeast-1.compute.internal     Ready    worker   5h31m   v1.13.4+8730f3882

$ oc adm cordon ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com --config=/home/wmeng/.kube/config
Error from server (NotFound): nodes "ec2-46-51-238-198.ap-northeast-1.compute.amazonaws.com" not found

$ oc adm cordon ip-172-31-29-93.ap-northeast-1.compute.internal --config=/home/wmeng/.kube/config
node/ip-172-31-29-93.ap-northeast-1.compute.internal cordoned
[wmeng@preserve-slave-wmengbuilder1 openshift-ansible]$ oc get node
NAME                                                STATUS                     ROLES    AGE    VERSION
ip-172-31-135-71.ap-northeast-1.compute.internal    Ready                      worker   8h     v1.13.4+da48e8391
ip-172-31-143-235.ap-northeast-1.compute.internal   Ready                      master   8h     v1.13.4+da48e8391
ip-172-31-147-96.ap-northeast-1.compute.internal    Ready                      worker   8h     v1.13.4+da48e8391
ip-172-31-151-240.ap-northeast-1.compute.internal   Ready                      master   8h     v1.13.4+da48e8391
ip-172-31-169-106.ap-northeast-1.compute.internal   Ready                      worker   8h     v1.13.4+da48e8391
ip-172-31-175-155.ap-northeast-1.compute.internal   Ready                      master   8h     v1.13.4+da48e8391
ip-172-31-29-93.ap-northeast-1.compute.internal     Ready,SchedulingDisabled   worker   6h5m   v1.13.4+8730f3882

Comment 1 Russell Teague 2019-04-24 12:54:15 UTC
The Ansible inventory should be created with host names that the cluster knows about.  Do not use public DNS host names for Ansible host names.

Comment 2 Weihua Meng 2019-04-25 01:24:41 UTC
if not public DNS host names, then ansible host may not connect to them.

fatal: [ip-172-31-29-93.ap-northeast-1.compute.internal]: UNREACHABLE! => {
    "changed": false,
    "msg": "SSH Error: data could not be sent to remote host \"ip-172-31-29-93.ap-northeast-1.compute.internal\". Make sure this host can be reached over ssh",
    "unreachable": true

Comment 3 Russell Teague 2019-04-25 12:24:52 UTC
Ensure hosts are resolvable/reachable from the Ansible control machine.  One option to ensure access is to set up an ssh bastion host.  An example can be found here, https://github.com/eparis/ssh-bastion.

Ansible can be configured to use the ssh bastion host by setting this var in host_vars or group_vars

---
ansible_ssh_common_args: "-o ProxyCommand=\"ssh -o IdentityFile='/path/to/libra.pem' -o StrictHostKeyChecking=no -W %h:%p -q <username>@<ssh_bastion_hostname>\""

Comment 4 Gaoyun Pei 2019-04-30 03:20:59 UTC
The bastion works well for ansible-playbook in such case.

With ansible_ssh_common_args set in ansible inventory file, playbooks/upgrade.yml finished successfully, the RHEL workers is working well after upgrade.
ansible_ssh_common_args="-o ProxyCommand=\"ssh -o IdentityFile='/path/to/libra.pem' -o StrictHostKeyChecking=no -W %h:%p -q core@<bastion_hostname>\""


One more thing want to confirm is:
When we want to upgrade a 4.1 cluster with RHEL76&RHCOS workers to a newer version manually, should we run the "playbooks/upgrade.yml" playbook against RHEL workers before or after the cluster upgrade(oc adm upgrade) ? Thanks.

Comment 5 Russell Teague 2019-04-30 12:17:07 UTC
RHEL workers should be upgraded after the cluster is upgraded.  The RHEL upgrade playbook installs the latest package version available for cri-o, openshift-clients and openshift-hyperkube but pulls images based on the cluster version.

Comment 7 errata-xmlrpc 2019-06-04 10:47:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758