Description of problem: Customer reports that upgrades ignore label type when using node label `type=upgrade` to infra node when upgrading from 3.10.14 to 3.10.45, and again when upgrading from 3.10.45 to 3.11, when running `-e openshift_upgrade_nodes_label="type=upgrade"`. The upgrade occurs on all nodes, not just the `type=upgrade` nodes, as expected. Version-Release number of the following components: $ ansible --version ansible 2.6.7 config file = /home/ocpdeploy/openshift-ansible-unix/ansible.cfg configured module search path = [u'/home/ocpdeploy/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /home/ocpdeploy/virtualenv/ansible-2.6.7/lib/python2.7/site-packages/ansible executable location = /home/ocpdeploy/virtualenv/ansible-2.6.7/bin/ansible python version = 2.7.5 (default, May 31 2018, 09:41:32) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] How reproducible: Customer verified occurs when upgrading from 3.10.14 to 3.10.45, and from 3.10.45 to 3.11, they are hesitant to upgrade 3.11 because of this Steps to Reproduce: 1. Add `type=upgrade` node label 2. Run `-e openshift_upgrade_nodes_label="type=upgrade"` 3. All nodes get updated, not just the 'upgrade' labeled nodes Actual results: Customer was not able to produce an ansible log for this issue, as it was not recorded at time of upgrade. Since this is a GlusterFS site, it was not advised to have them run the upgrade again to obtain logs. Expected results: Expected only the `upgrade` labeled nodes to be upgraded, however all nodes were upgraded.
Added inventory file and output of the following to private comments: $ ansible --version $ oc get nodes --show-labels
Are you certain that the label is actually applied before the upgrade starts? The output above does not show it. Also, based on their inventory, setting a label of 'type=upgrade' seems ill advised as that will override existing labels defined in their node groups.
Scott, here is the response I got back from the customer regarding verification of the above: ~~~ The labels changed since that time. I overrode the label of type=physical with type=upgrade while I performed the upgrade. Since it didn't work and all my nodes were upgraded, I changed the label back. I did follow the upgrade instructions you posted. Regarding overriding existing labels: I agree, but currently we are not leveraging the type=physical label so overriding didn't make a material difference. The label should be arbitrary so I should be able to use foo=bar if I wanted to, correct? ~~~ The upgrade instructions being referred to are the docs [1]. Please let me know if we need anything else. [1] https://docs.openshift.com/container-platform/3.11/upgrading/automated_upgrades.html#special-considerations-for-glusterfs
Spoke with the customer. They are asking if we have any workarounds in the interim? They said they're happy to try different things if we have something but if we do believe that it's just a bug that will get fixed later, then additional information around it would be appreciated. This is a blocker to their upgrade.
*** Bug 1651224 has been marked as a duplicate of this bug. ***
Upgrading OCP cluster from 3.10 to 3.11 based on labels and it skipped the match and upgrade all the nodes.As `/usr/bin/oc get node --selector=<key>=<value> -o json -n default` command output gives short hostname and failed while matching with FQDN mentioned in the inventory file. Do we have any workaround for this issue?
Customer provided the following update and workaround for Engineering review: ~~~ Looking at this a bit deeper it appears to be happening in the task "Map labelled nodes to inventory hosts" in playbooks/common/openshift-cluster/upgrades/initialize_nodes_to_upgrade.yml. That task uses the variable hostvars[item].openshift.common.hostname which gets set in the openshift_facts module as the output of "hostname -f". Unfortunately nodes are not listed in 'oc get nodes' using their FQDN so the match never succeeds. Even when the server hostnames matched those in the inventory exactly, it would always skip all hosts, resulting in each node being upgraded: TASK [Map labelled nodes to inventory hosts] ****************************************************************************************************************** skipping: [master02] => (item=node02) skipping: [master02] => (item=node03) skipping: [master02] => (item=node01) skipping: [master02] => (item=master01) skipping: [master02] => (item=master02) skipping: [master02] => (item=master03) I changed that task to use the variable hostvars[item].openshift.common.raw_hostname instead, which is set in openshift_facts from the output of command "hostname", and that finally selected the single node during this task, and resulted in only that node being upgraded: TASK [Map labelled nodes to inventory hosts] ****************************************************************************************************************** skipping: [master02] => (item=node02) skipping: [master02] => (item=node03) ok: [master02] => (item=node01) skipping: [master02] => (item=master01) skipping: [master02] => (item=master02) skipping: [master02] => (item=master03) In my test environment I am now able to step through these node upgrades (in the output below, only node01 has been upgraded thus far): $ oc get nodes NAME STATUS ROLES AGE VERSION master01 Ready master 6d v1.11.0+d4cacc0 master02 Ready master 6d v1.11.0+d4cacc0 master03 Ready master 6d v1.11.0+d4cacc0 node01 Ready infra 6d v1.11.0+d4cacc0 node02 Ready infra 6d v1.10.0+b81c8f8 node03 Ready compute 6d v1.10.0+b81c8f8 ~~~
Correct value should be openshift.node.nodename. Will get a patch out for this in 3.11 and backport to 3.10 most likely.
PR created in 3.11: https://github.com/openshift/openshift-ansible/pull/10809
In openshift-ansible-3.11.55-1
Verified this bug with openshift-ansible-3.11.58-1.git.0.ce7e387.el7.noarch. [root@qe-gpei-3101node-2 ~]# hostname -f qe-gpei-3101node-2.int.1219-s4p.qe.rhcloud.com Add label "type=upgrade" to node 'qe-gpei-3101node-2' [root@qe-gpei-3101master-etcd-1 ~]# oc label node qe-gpei-3101node-2 type=upgrade node "qe-gpei-3101node-2" labeled ansible-playbook -i 310 /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade.yml -e openshift_upgrade_nodes_label="type=upgrade" <-snip-> TASK [Retrieve list of openshift nodes matching upgrade label] ************************************************************************************************************** ok: [host-8-251-116.host.centralci.eng.rdu2.redhat.com] TASK [Fail if no nodes match openshift_upgrade_nodes_label] ***************************************************************************************************************** skipping: [host-8-251-116.host.centralci.eng.rdu2.redhat.com] TASK [Map labelled nodes to inventory hosts] ******************************************************************************************************************************** skipping: [host-8-251-116.host.centralci.eng.rdu2.redhat.com] => (item=host-8-251-116.host.centralci.eng.rdu2.redhat.com) skipping: [host-8-251-116.host.centralci.eng.rdu2.redhat.com] => (item=host-8-252-248.host.centralci.eng.rdu2.redhat.com) skipping: [host-8-251-116.host.centralci.eng.rdu2.redhat.com] => (item=host-8-249-250.host.centralci.eng.rdu2.redhat.com) ok: [host-8-251-116.host.centralci.eng.rdu2.redhat.com] => (item=host-8-250-227.host.centralci.eng.rdu2.redhat.com) <-snip-> [root@qe-gpei-3101master-etcd-1 ~]# oc get node NAME STATUS ROLES AGE VERSION qe-gpei-3101master-etcd-1 Ready master 1h v1.11.0+d4cacc0 qe-gpei-3101node-1 Ready compute 1h v1.10.0+b81c8f8 qe-gpei-3101node-2 Ready compute 1h v1.11.0+d4cacc0 qe-gpei-3101node-registry-router-1 Ready <none> 1h v1.10.0+b81c8f8 Only the node qe-gpei-3101node-2 got upgraded.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0024