Bug 1440167 - Control Plane Upgrade Fails if Nodes Do Not Have Access To Latest Excluder
Summary: Control Plane Upgrade Fails if Nodes Do Not Have Access To Latest Excluder
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Jan Chaloupka
QA Contact: liujia
URL:
Whiteboard:
Depends On: 1456093
Blocks: 1436348
TreeView+ depends on / blocked
 
Reported: 2017-04-07 12:54 UTC by Devan Goodwin
Modified: 2020-09-10 10:26 UTC (History)
6 users (show)

Fixed In Version: openshift-ansible-3.5.78-1
Doc Type: Bug Fix
Doc Text:
During the control plan upgrade subset of pre-check and verification tasks for upgrade is run. Unfortunately, the tasks were run over non-control plane nodes as well. Some of the tasks need excluders to be disabled in order to work properly. Given the excluders are disable on control plane hosts only, the tasks run over the remaining nodes caused a failure. With this fix all the pre-check and verification tasks are run over control plane nodes only.
Clone Of:
Environment:
Last Closed: 2017-06-29 13:33:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:1666 0 normal SHIPPED_LIVE OpenShift Container Platform atomic-openshift-utils bug fix and enhancement 2017-06-29 17:32:39 UTC

Description Devan Goodwin 2017-04-07 12:54:17 UTC
Description of problem:

It appears that the control plane upgrade will attempt to perform excluder tasks on all nodes in the cluster. If those nodes do not have the new repo enabled (in this case for 3.5), they fail and while the masters are then upgraded relatively as expected, the overall ansible operation reports a failure much later due to the problems on those nodes.

This is a problem particularly for blue green upgrades, old nodes should probably not be modifying their repos to access 3.5 (the new version), as they will not be using it. I suspect if the repos were available it would also try to update to 3.5 packages.


Version-Release number of selected component (if applicable):

openshift-ansible 3.5.48


How reproducible:

I believe 100%.

Steps to Reproduce:
1. Ensure only masters have 3.5 repos enabled, nodes should not. (as they will be replaced by new nodes)
2. Run control plane upgrade.


Actual results:

2017-04-07 08:07:25,582 p=19738 u=dgoodwin |  TASK [openshift_excluder : Evalute if docker excluder is to be enabled] ********
2017-04-07 08:07:25,610 p=19738 u=dgoodwin |  ok: [ded-stage-aws-master-ec552]
2017-04-07 08:07:25,621 p=19738 u=dgoodwin |  ok: [ded-stage-aws-master-cbc60]
2017-04-07 08:07:25,643 p=19738 u=dgoodwin |  ok: [ded-stage-aws-master-28671]
2017-04-07 08:07:25,655 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-compute-a39c2]
2017-04-07 08:07:25,667 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-compute-f5ad4]
2017-04-07 08:07:25,667 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-compute-037a7]
2017-04-07 08:07:25,667 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-compute-bfd99]
2017-04-07 08:07:25,676 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-infra-95fd7]
2017-04-07 08:07:25,687 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-infra-01651]
2017-04-07 08:07:25,692 p=19738 u=dgoodwin |  TASK [openshift_excluder : debug] **********************************************
2017-04-07 08:07:25,720 p=19738 u=dgoodwin |  ok: [ded-stage-aws-master-ec552] => {
    "docker_excluder_on": true
}
2017-04-07 08:07:25,735 p=19738 u=dgoodwin |  ok: [ded-stage-aws-master-cbc60] => {
    "docker_excluder_on": true
}
2017-04-07 08:07:25,743 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-compute-bfd99] => {
    "docker_excluder_on": true
}
2017-04-07 08:07:25,753 p=19738 u=dgoodwin |  ok: [ded-stage-aws-master-28671] => {
    "docker_excluder_on": true
}
2017-04-07 08:07:25,764 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-compute-f5ad4] => {
    "docker_excluder_on": true
}
2017-04-07 08:07:25,776 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-compute-a39c2] => {
    "docker_excluder_on": true
}
2017-04-07 08:07:25,777 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-compute-037a7] => {
    "docker_excluder_on": true
}
2017-04-07 08:07:25,779 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-infra-95fd7] => {
    "docker_excluder_on": true
}
2017-04-07 08:07:25,794 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-infra-01651] => {
    "docker_excluder_on": true
}
2017-04-07 08:07:25,799 p=19738 u=dgoodwin |  TASK [openshift_excluder : Evalute if openshift excluder is to be enabled] *****
2017-04-07 08:07:25,827 p=19738 u=dgoodwin |  ok: [ded-stage-aws-master-ec552]
2017-04-07 08:07:25,838 p=19738 u=dgoodwin |  ok: [ded-stage-aws-master-cbc60]
2017-04-07 08:07:25,859 p=19738 u=dgoodwin |  ok: [ded-stage-aws-master-28671]
2017-04-07 08:07:25,870 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-compute-a39c2]
2017-04-07 08:07:25,882 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-compute-f5ad4]
2017-04-07 08:07:25,883 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-compute-bfd99]
2017-04-07 08:07:25,884 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-compute-037a7]
2017-04-07 08:07:25,891 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-infra-95fd7]
2017-04-07 08:07:25,903 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-infra-01651]
2017-04-07 08:07:25,907 p=19738 u=dgoodwin |  TASK [openshift_excluder : debug] **********************************************
2017-04-07 08:07:25,943 p=19738 u=dgoodwin |  ok: [ded-stage-aws-master-cbc60] => {
    "openshift_excluder_on": true
}
2017-04-07 08:07:25,955 p=19738 u=dgoodwin |  ok: [ded-stage-aws-master-ec552] => {
    "openshift_excluder_on": true
}
2017-04-07 08:07:25,964 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-compute-a39c2] => {
    "openshift_excluder_on": true
}
2017-04-07 08:07:25,973 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-compute-bfd99] => {
    "openshift_excluder_on": true
}
2017-04-07 08:07:25,985 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-compute-037a7] => {
    "openshift_excluder_on": true
}
2017-04-07 08:07:25,985 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-compute-f5ad4] => {
    "openshift_excluder_on": true
}
2017-04-07 08:07:25,986 p=19738 u=dgoodwin |  ok: [ded-stage-aws-master-28671] => {
    "openshift_excluder_on": true
}
2017-04-07 08:07:25,993 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-infra-95fd7] => {
    "openshift_excluder_on": true
}
2017-04-07 08:07:26,001 p=19738 u=dgoodwin |  ok: [ded-stage-aws-node-infra-01651] => {
    "openshift_excluder_on": true
}
2017-04-07 08:07:26,005 p=19738 u=dgoodwin |  TASK [openshift_excluder : Install docker excluder] ****************************
2017-04-07 08:07:35,580 p=19738 u=dgoodwin |  fatal: [ded-stage-aws-node-compute-f5ad4]: FAILED! => {"changed": false, "failed": true, "msg": "No package matching 'atomic-openshift-docker-excluder-3.5.5.3*' found available, installed or updated", "rc": 126, "results": ["No package matching 'atomic-openshift-docker-excluder-3.5.5.3*' found available, installed or updated"]}
2017-04-07 08:07:35,670 p=19738 u=dgoodwin |  fatal: [ded-stage-aws-node-compute-bfd99]: FAILED! => {"changed": false, "failed": true, "msg": "No package matching 'atomic-openshift-docker-excluder-3.5.5.3*' found available, installed or updated", "rc": 126, "results": ["No package matching 'atomic-openshift-docker-excluder-3.5.5.3*' found available, installed or updated"]}
2017-04-07 08:07:35,773 p=19738 u=dgoodwin |  fatal: [ded-stage-aws-node-compute-037a7]: FAILED! => {"changed": false, "failed": true, "msg": "No package matching 'atomic-openshift-docker-excluder-3.5.5.3*' found available, installed or updated", "rc": 126, "results": ["No package matching 'atomic-openshift-docker-excluder-3.5.5.3*' found available, installed or updated"]}
2017-04-07 08:07:35,889 p=19738 u=dgoodwin |  fatal: [ded-stage-aws-node-compute-a39c2]: FAILED! => {"changed": false, "failed": true, "msg": "No package matching 'atomic-openshift-docker-excluder-3.5.5.3*' found available, installed or updated", "rc": 126, "results": ["No package matching 'atomic-openshift-docker-excluder-3.5.5.3*' found available, installed or updated"]}
2017-04-07 08:07:36,149 p=19738 u=dgoodwin |  fatal: [ded-stage-aws-node-infra-01651]: FAILED! => {"changed": false, "failed": true, "msg": "No package matching 'atomic-openshift-docker-excluder-3.5.5.3*' found available, installed or updated", "rc": 126, "results": ["No package matching 'atomic-openshift-docker-excluder-3.5.5.3*' found available, installed or updated"]}
2017-04-07 08:07:36,478 p=19738 u=dgoodwin |  fatal: [ded-stage-aws-node-infra-95fd7]: FAILED! => {"changed": false, "failed": true, "msg": "No package matching 'atomic-openshift-docker-excluder-3.5.5.3*' found available, installed or updated", "rc": 126, "results": ["No package matching 'atomic-openshift-docker-excluder-3.5.5.3*' found available, installed or updated"]}
2017-04-07 08:07:44,225 p=19738 u=dgoodwin |  changed: [ded-stage-aws-master-ec552]
2017-04-07 08:07:45,199 p=19738 u=dgoodwin |  changed: [ded-stage-aws-master-cbc60]
2017-04-07 08:07:45,491 p=19738 u=dgoodwin |  changed: [ded-stage-aws-master-28671]
2017-04-07 08:07:45,496 p=19738 u=dgoodwin |  TASK [openshift_excluder : Install openshift excluder] *************************
2017-04-07 08:08:06,447 p=19738 u=dgoodwin |  changed: [ded-stage-aws-master-ec552]
2017-04-07 08:08:06,549 p=19738 u=dgoodwin |  changed: [ded-stage-aws-master-28671]
2017-04-07 08:08:06,553 p=19738 u=dgoodwin |  changed: [ded-stage-aws-master-cbc60]
2017-04-07 08:08:06,559 p=19738 u=dgoodwin |  TASK [openshift_excluder : Check for docker-excluder] **************************
2017-04-07 08:08:06,730 p=19738 u=dgoodwin |  ok: [ded-stage-aws-master-cbc60]
2017-04-07 08:08:06,738 p=19738 u=dgoodwin |  ok: [ded-stage-aws-master-ec552]
2017-04-07 08:08:06,742 p=19738 u=dgoodwin |  ok: [ded-stage-aws-master-28671]
2017-04-07 08:08:06,747 p=19738 u=dgoodwin |  TASK [openshift_excluder : Enable docker excluder] *****************************
2017-04-07 08:08:06,944 p=19738 u=dgoodwin |  changed: [ded-stage-aws-master-cbc60]
2017-04-07 08:08:06,951 p=19738 u=dgoodwin |  changed: [ded-stage-aws-master-28671]
2017-04-07 08:08:06,962 p=19738 u=dgoodwin |  changed: [ded-stage-aws-master-ec552]


Much later the end result is:

2017-04-07 08:29:32,125 p=19738 u=dgoodwin |  PLAY RECAP *********************************************************************
2017-04-07 08:29:32,125 p=19738 u=dgoodwin |  ded-stage-aws-master-28671 : ok=308  changed=31   unreachable=0    failed=0   
2017-04-07 08:29:32,125 p=19738 u=dgoodwin |  ded-stage-aws-master-cbc60 : ok=308  changed=31   unreachable=0    failed=0   
2017-04-07 08:29:32,125 p=19738 u=dgoodwin |  ded-stage-aws-master-ec552 : ok=447  changed=66   unreachable=0    failed=0   
2017-04-07 08:29:32,126 p=19738 u=dgoodwin |  ded-stage-aws-node-compute-037a7 : ok=48   changed=2    unreachable=0    failed=1   
2017-04-07 08:29:32,126 p=19738 u=dgoodwin |  ded-stage-aws-node-compute-a39c2 : ok=48   changed=2    unreachable=0    failed=1   
2017-04-07 08:29:32,126 p=19738 u=dgoodwin |  ded-stage-aws-node-compute-bfd99 : ok=48   changed=2    unreachable=0    failed=1   
2017-04-07 08:29:32,126 p=19738 u=dgoodwin |  ded-stage-aws-node-compute-f5ad4 : ok=48   changed=2    unreachable=0    failed=1   
2017-04-07 08:29:32,126 p=19738 u=dgoodwin |  ded-stage-aws-node-infra-01651 : ok=48   changed=2    unreachable=0    failed=1   
2017-04-07 08:29:32,126 p=19738 u=dgoodwin |  ded-stage-aws-node-infra-95fd7 : ok=48   changed=2    unreachable=0    failed=1   
2017-04-07 08:29:32,126 p=19738 u=dgoodwin |  localhost                  : ok=35   changed=0    unreachable=0    failed=0   



Expected results:

Nodes should not be touched during a control plane upgrade, and should not require access to the latest excluder rpms.


Additional info:

Comment 1 Devan Goodwin 2017-04-07 13:20:37 UTC
If I enable a 3.5 repo on these old nodes and then re-try, the excluder is upgraded to 3.5, but no other packages are affected.

However the excluder is then *disabled* even on the old nodes, which should not be getting upgraded:

[root@ded-stage-aws-node-compute-a39c2 ~]# atomic-openshift-excluder status
unexclude -- At least one package not excluded

Control plane upgrade now succeeds.

This may be an acceptable workaround for now provided use of the 3.5 excluder will not cause problems on a 3.4 system.

However long term:

- old nodes should not need access to new openshift repos for a control plane upgrade
- old nodes should not have  rpms updated during a control plane upgrade
- old nodes should not get their excluder disabled during a control plane upgrade

Comment 2 Jan Chaloupka 2017-04-07 14:24:17 UTC
Upstream PR: https://github.com/openshift/openshift-ansible/pull/3879, possible fix.

Comment 3 Jan Chaloupka 2017-06-05 10:46:08 UTC
With https://github.com/openshift/openshift-ansible/pull/4321 merged, control plane upgrade and nodes upgrade have separate pre-verification tasks now.

Comment 11 liujia 2017-06-15 07:34:53 UTC
Version:
atomic-openshift-utils-3.5.82-1.git.0.e3e25f6.el7.noarch

Step:
1. install ocp3.4(one master/node + one node)
2. ensure atomic-openshift-excluder and atomic-openshift-docker-excluder installed and enabled on all hosts
3. only enable 3.5 repo on master
run upgrade_control_plane.yml to upgrade masters first
# ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_5/upgrade_control_plane.yml

Result:
Upgrade master succeed with no failure. Excluders are upgraded only on master host.

Comment 13 errata-xmlrpc 2017-06-29 13:33:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1666


Note You need to log in before you can comment on or make changes to this bug.