1419465 – upgrade failed when openstack internal name is different with openshift_public_hostname

Bug 1419465 - upgrade failed when openstack internal name is different with openshift_public_hostname

Summary: upgrade failed when openstack internal name is different with openshift_publi...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Jan Chaloupka
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-06 09:10 UTC by Anping Li
Modified:	2017-08-24 20:49 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-24 20:49:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Anping Li 2017-02-06 09:10:30 UTC

Description of problem:
When the openstack instance internal name is different from the openshift_public_hostname. the upgrade failed at task 'Determine if node is currently scheduleable'.
The upgrade playbook use the Openstack internal name, (You can find name  via 'curl 169.254.169.254/2009-04-04/meta-data//hostname'). while openshift is using the openshift_public_hostname.  

TASK [Determine if node is currently scheduleable] *****************************
fatal: [openshift-228.example.com -> openshift-228.example.com]: FAILED! => {
    "changed": false,
    "cmd": [
        "/usr/local/bin/oc",
        "get",
        "node",
        "qe-11329master-1",
        "-o",
        "json"
    ],
    "delta": "0:00:00.410114",
    "end": "2017-02-06 02:24:57.030521",
    "failed": true,
    "rc": 1,
    "start": "2017-02-06 02:24:56.620407",
    "warnings": []
}

STDERR:

Error from server (NotFound): nodes "qe-11329master-1" not found

NO MORE HOSTS LEFT *************************************************************
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_5/upgrade.retry



Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.5.3-1.git.0.80c2436.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. Launch instances with name 'qe-11329master-1','qe-11329etcd-1' and 'qe-11329node-registry-router-1'.

2. Reset instance name to different names openshift-202.example.com, openshift-202.example.com,openshift-228.example.com (workaround BZ#1367201)

3. Install Openshift v3.4 with enabling cloudprovider and specify the openshift_public_hostname
4. Upgrade Openshift to V3.5

TASK [Determine if node is currently scheduleable] *****************************
fatal: [openshift-228.example.com -> openshift-228.example.com]: FAILED! => {
    "changed": false,
    "cmd": [
        "/usr/local/bin/oc",
        "get",
        "node",
        "qe-11329master-1",
        "-o",
        "json"
    ],
    "delta": "0:00:00.410114",
    "end": "2017-02-06 02:24:57.030521",
    "failed": true,
    "rc": 1,
    "start": "2017-02-06 02:24:56.620407",
    "warnings": []
}

STDERR:

Error from server (NotFound): nodes "qe-11329master-1" not found

NO MORE HOSTS LEFT *************************************************************
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_5/upgrade.retry

PLAY RECAP *********************************************************************
localhost                  : ok=35   changed=0    unreachable=0    failed=0
openshift-202.example.com : ok=109  changed=6    unreachable=0    failed=0
openshift-211.example.com : ok=63   changed=2    unreachable=0    failed=0
openshift-228.example.com : ok=171  changed=15   unreachable=0    failed=1

Expected results:
1. The upgrade playbook should use same name with Openshift when Determine if node is currently scheduleable

Comment 1 Jason DeTiberus 2017-02-07 19:23:35 UTC

Instead of querying the metadata directly, I would expect that this should work with using openshift.common.hostname or openshift.node.nodename instead.

Comment 2 Jan Chaloupka 2017-02-10 12:26:29 UTC

Looking at it now

Comment 3 Anping Li 2017-02-15 03:06:58 UTC

This will block testing on openstack when the cloudprovide are using.

Comment 4 Jan Chaloupka 2017-02-16 13:45:27 UTC

Currently, the affected piece of play is in the following two files:
- playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml
- playbooks/common/openshift-cluster/upgrades/upgrade_control_plane.yml 

as:

```yaml
  - name: Mark node unschedulable
    oadm_manage_node:
      node: "{{ openshift.node.nodename | lower }}"
      schedulable: False
    delegate_to: "{{ groups.oo_first_master.0 }}"
    retries: 10
    delay: 5
    register: node_unschedulable
    until: node_unschedulable|succeeded
```

The ``openshift.node.nodename`` is used throughout the files on multiple places.

Comment 11 Jan Chaloupka 2017-02-22 15:08:12 UTC

https://github.com/openshift/openshift-ansible/issues/3455

Comment 13 Jan Chaloupka 2017-02-22 17:22:27 UTC

The problem here is the http://169.254.169.254/openstack/latest/meta_data.json no longer provides a valid hostname for a VM once the hostname is changed.

I would suggest to change openshift.node.nodename to openshift.common.hostname once the "Set hostname" task in roles/openshift_common/tasks/main.yml is run in a case openshift[_public]_hostname is set.

Not sure which variable of openshift[_public]_hostname is used to set openshift.common.hostname but that should not be hard to determine from the ansible.

Comment 14 Anping Li 2017-02-24 10:32:25 UTC

I found if we modified the hostname and enabled clouder provide and then run upgrade playbook. The upgrade playbook pass without error.  so remove testblock

Comment 15 Jan Chaloupka 2017-02-24 12:39:13 UTC

In short, what is happening here:
---
When a VM is provisioned in OpenStack, it is given a name.
At the same time VM's hostname is set according to the name (e.g. hostname-test is translated into hostname-test.localdomain).
VM can access both name and hostname via openstack's metadata at http://169.254.169.254/openstack/latest/meta_data.json.

When VM's name is changed, the hostname is not affected. The new name is updated in OpenStack's metada and available through the same link.

When VM's hostname is changed inside the VM (e.g. by executing ``hostnamectl set-hostname``), the hostname is changed but VM's metadata are not affected. When the VM is restarted, hostname is reset back to its original name (set in VM's metadata). The reset is done by cloud-init. In order to disable the hostname reset during each reboot, one has to update cloud-init's configuration. E.g. drop a file under /etc/cloud/cloud.cfg.d directory:
# cat /etc/cloud/cloud.cfg.d/99_hostname.cfg
preserve_hostname: true

Then, the hostname is preserved during reboots. Still, hostname in VM's metada is not affected and has the same value since the time it was created.
---

> I found if we modified the hostname and enabled clouder provide and then run upgrade playbook. The upgrade playbook pass without error.  so remove testblock

Given that I am decreasing the severity to low.

Anping, can you elaborate more on what you did? E.g. provide updated inventory file with steps how to proceed?

Note You need to log in before you can comment on or make changes to this bug.