Bug 1649074 - Node label `type=upgrade` is ignored when upgrading OCP
Summary: Node label `type=upgrade` is ignored when upgrading OCP
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 3.11.z
Assignee: aos-install
QA Contact: Gaoyun Pei
URL:
Whiteboard:
: 1651224 (view as bug list)
Depends On:
Blocks: 1655674
TreeView+ depends on / blocked
 
Reported: 2018-11-12 21:06 UTC by Greg Rodriguez II
Modified: 2024-03-25 15:09 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1655674 (view as bug list)
Environment:
Last Closed: 2019-01-10 09:04:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0024 0 None None None 2019-01-10 09:05:49 UTC

Internal Links: 1860906

Description Greg Rodriguez II 2018-11-12 21:06:03 UTC
Description of problem:
Customer reports that upgrades ignore label type when using node label `type=upgrade` to infra node when upgrading from 3.10.14 to 3.10.45, and again when upgrading from 3.10.45 to 3.11, when running `-e openshift_upgrade_nodes_label="type=upgrade"`.  The upgrade occurs on all nodes, not just the `type=upgrade` nodes, as expected.

Version-Release number of the following components:
$ ansible --version
ansible 2.6.7
  config file = /home/ocpdeploy/openshift-ansible-unix/ansible.cfg
  configured module search path = [u'/home/ocpdeploy/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /home/ocpdeploy/virtualenv/ansible-2.6.7/lib/python2.7/site-packages/ansible
  executable location = /home/ocpdeploy/virtualenv/ansible-2.6.7/bin/ansible
  python version = 2.7.5 (default, May 31 2018, 09:41:32) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

How reproducible:
Customer verified occurs when upgrading from 3.10.14 to 3.10.45, and from 3.10.45 to 3.11, they are hesitant to upgrade 3.11 because of this

Steps to Reproduce:
1.  Add `type=upgrade` node label
2.  Run `-e openshift_upgrade_nodes_label="type=upgrade"`
3.  All nodes get updated, not just the 'upgrade' labeled nodes

Actual results:
Customer was not able to produce an ansible log for this issue, as it was not recorded at time of upgrade.  Since this is a GlusterFS site, it was not advised to have them run the upgrade again to obtain logs.

Expected results:
Expected only the `upgrade` labeled nodes to be upgraded, however all nodes were upgraded.

Comment 3 Greg Rodriguez II 2018-11-12 21:08:12 UTC
Added inventory file and output of the following to private comments: 

$ ansible --version

$ oc get nodes --show-labels

Comment 4 Scott Dodson 2018-11-13 21:31:04 UTC
Are you certain that the label is actually applied before the upgrade starts? The output above does not show it. Also, based on their inventory, setting a label of 'type=upgrade' seems ill advised as that will override existing labels defined in their node groups.

Comment 5 Greg Rodriguez II 2018-11-14 20:27:42 UTC
Scott, here is the response I got back from the customer regarding verification of the above:

~~~

The labels changed since that time. I overrode the label of type=physical with type=upgrade while I performed the upgrade. Since it didn't work and all my nodes were upgraded, I changed the label back. 

I did follow the upgrade instructions you posted. 

Regarding overriding existing labels: I agree, but currently we are not leveraging the type=physical label so overriding didn't make a material difference. The label should be arbitrary so I should be able to use foo=bar if I wanted to, correct?

~~~

The upgrade instructions being referred to are the docs [1].

Please let me know if we need anything else.

[1]  https://docs.openshift.com/container-platform/3.11/upgrading/automated_upgrades.html#special-considerations-for-glusterfs

Comment 6 mforbush 2018-11-16 18:45:38 UTC
Spoke with the customer. They are asking if we have any workarounds in the interim? They said they're happy to try different things if we have something but if we do believe that it's just a bug that will get fixed later, then additional information around it would be appreciated. This is a blocker to their upgrade.

Comment 7 Scott Dodson 2018-11-19 13:26:19 UTC
*** Bug 1651224 has been marked as a duplicate of this bug. ***

Comment 8 Abhishek 2018-11-19 13:35:18 UTC
Upgrading OCP cluster from 3.10 to 3.11 based on labels and it skipped the match and upgrade all the nodes.As `/usr/bin/oc get node --selector=<key>=<value> -o json -n default` command output gives short hostname and failed while matching with FQDN mentioned in the inventory file.

Do we have any workaround for this issue?

Comment 9 Greg Rodriguez II 2018-11-19 20:01:55 UTC
Customer provided the following update and workaround for Engineering review:

~~~

Looking at this a bit deeper it appears to be happening in the task "Map labelled nodes to inventory hosts" in playbooks/common/openshift-cluster/upgrades/initialize_nodes_to_upgrade.yml. That task uses the variable hostvars[item].openshift.common.hostname which gets set in the openshift_facts module as the output of "hostname -f". Unfortunately nodes are not listed in 'oc get nodes' using their FQDN so the match never succeeds. Even when the server hostnames matched those in the inventory exactly, it would always skip all hosts, resulting in each node being upgraded:

TASK [Map labelled nodes to inventory hosts] ******************************************************************************************************************
skipping: [master02] => (item=node02)
skipping: [master02] => (item=node03)
skipping: [master02] => (item=node01)
skipping: [master02] => (item=master01)
skipping: [master02] => (item=master02)
skipping: [master02] => (item=master03)

I changed that task to use the variable hostvars[item].openshift.common.raw_hostname instead, which is set in openshift_facts from the output of command "hostname", and that finally selected the single node during this task, and resulted in only that node being upgraded:

TASK [Map labelled nodes to inventory hosts] ******************************************************************************************************************
skipping: [master02] => (item=node02)
skipping: [master02] => (item=node03)
ok: [master02] => (item=node01)
skipping: [master02] => (item=master01)
skipping: [master02] => (item=master02)
skipping: [master02] => (item=master03)

In my test environment I am now able to step through these node upgrades (in the output below, only node01 has been upgraded thus far):
$ oc get nodes
NAME       STATUS    ROLES     AGE       VERSION
master01   Ready     master    6d        v1.11.0+d4cacc0
master02   Ready     master    6d        v1.11.0+d4cacc0
master03   Ready     master    6d        v1.11.0+d4cacc0
node01     Ready     infra     6d        v1.11.0+d4cacc0
node02     Ready     infra     6d        v1.10.0+b81c8f8
node03     Ready     compute   6d        v1.10.0+b81c8f8

~~~

Comment 10 Michael Gugino 2018-11-30 13:47:20 UTC
Correct value should be openshift.node.nodename.  Will get a patch out for this in 3.11 and backport to 3.10 most likely.

Comment 11 Michael Gugino 2018-12-03 16:36:27 UTC
PR created in 3.11: https://github.com/openshift/openshift-ansible/pull/10809

Comment 12 Scott Dodson 2018-12-12 19:22:15 UTC
In openshift-ansible-3.11.55-1

Comment 13 Gaoyun Pei 2018-12-19 07:09:51 UTC
Verified this bug with openshift-ansible-3.11.58-1.git.0.ce7e387.el7.noarch.

[root@qe-gpei-3101node-2 ~]# hostname -f
qe-gpei-3101node-2.int.1219-s4p.qe.rhcloud.com

Add label "type=upgrade" to node 'qe-gpei-3101node-2'
[root@qe-gpei-3101master-etcd-1 ~]# oc label node qe-gpei-3101node-2 type=upgrade
node "qe-gpei-3101node-2" labeled


ansible-playbook -i 310 /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade.yml -e openshift_upgrade_nodes_label="type=upgrade"

<-snip->
TASK [Retrieve list of openshift nodes matching upgrade label] **************************************************************************************************************
ok: [host-8-251-116.host.centralci.eng.rdu2.redhat.com]

TASK [Fail if no nodes match openshift_upgrade_nodes_label] *****************************************************************************************************************
skipping: [host-8-251-116.host.centralci.eng.rdu2.redhat.com]

TASK [Map labelled nodes to inventory hosts] ********************************************************************************************************************************
skipping: [host-8-251-116.host.centralci.eng.rdu2.redhat.com] => (item=host-8-251-116.host.centralci.eng.rdu2.redhat.com)
skipping: [host-8-251-116.host.centralci.eng.rdu2.redhat.com] => (item=host-8-252-248.host.centralci.eng.rdu2.redhat.com)
skipping: [host-8-251-116.host.centralci.eng.rdu2.redhat.com] => (item=host-8-249-250.host.centralci.eng.rdu2.redhat.com)
ok: [host-8-251-116.host.centralci.eng.rdu2.redhat.com] => (item=host-8-250-227.host.centralci.eng.rdu2.redhat.com)
<-snip->


[root@qe-gpei-3101master-etcd-1 ~]# oc get node 
NAME                                 STATUS    ROLES     AGE       VERSION
qe-gpei-3101master-etcd-1            Ready     master    1h        v1.11.0+d4cacc0
qe-gpei-3101node-1                   Ready     compute   1h        v1.10.0+b81c8f8
qe-gpei-3101node-2                   Ready     compute   1h        v1.11.0+d4cacc0
qe-gpei-3101node-registry-router-1   Ready     <none>    1h        v1.10.0+b81c8f8

Only the node qe-gpei-3101node-2 got upgraded.

Comment 15 errata-xmlrpc 2019-01-10 09:04:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0024


Note You need to log in before you can comment on or make changes to this bug.