Bug 1382380 - Upgrade from 3.2 to 3.3 fails with could not get EgressNetworkPolicies
Summary: Upgrade from 3.2 to 3.3 fails with could not get EgressNetworkPolicies
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.3.1
Assignee: Devan Goodwin
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-06 14:08 UTC by Steven Walter
Modified: 2020-04-15 14:43 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Node service was incorrectly being restarted after upgrading master RPM packages. Consequence: In some environments a version mismatch could trigger between the node service, and the not yet restarted master service, causing upgrade to fail. Fix: Incorrect node restart was removed and logic shuffled to ensure masters are upgraded and restarted before we proceed to node upgrade/restart. Result: Upgrade will now complete successfully.
Clone Of:
Environment:
Last Closed: 2016-10-27 16:13:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:2122 0 normal SHIPPED_LIVE OpenShift Container Platform atomic-openshift-utils bug fix update 2016-10-27 20:11:30 UTC

Description Steven Walter 2016-10-06 14:08:52 UTC
Description of problem:
Upgrade from 3.2 to 3.3 fails during node restart

Version-Release number of selected component (if applicable):

RHEL 7.2 - openshift rpms 3.3.22 installed.

Another customer:
openshift-ansible-lookup-plugins-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-sdn-ovs-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-docs-3.3.28-1.git.0.762256b.el7.noarch tuned-profiles-atomic-openshift-node-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-filter-plugins-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-utils-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-node-3.3.0.34-1.git.0.83f306f.el7.x86_64 atomic-openshift-clients-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-roles-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-master-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-playbooks-3.3.28-1.git.0.762256b.el7.noarch

Another customer:
openshift-ansible-playbooks-3.3.22-1.git.0.6c888c2.el7.noarch.

How reproducible:
Not yet reproduced

Steps to Reproduce:
1. Install 3.2
2. Upgrade to 3.3
3. It seems this might only occur when masters are supposed to be schedulable

Actual results:


Expected results:
atomic-openshift-node[79829]: F1005 12:04:26.820427   79829 node.go:343] error: SDN node startup failed: could not get EgressNetworkPolicies: the server could not find the requested resource
atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
systemd[1]: Failed to start Atomic OpenShift Node.

I'm working to get full ansible output if possible

Comment 1 Steven Walter 2016-10-06 14:14:09 UTC
Additionally, restarting the master services seems to resolve the issue. I am still working to verify that the install playbook can be re-run successfully (i.e. the upgrade actually completes)

Comment 2 Steven Walter 2016-10-10 13:48:46 UTC
Re-running the install works after restarting the services. Customer has provided ansible output showing the install complete, so the workaround is more or less confirmed.

systemctl restart atomic-openshift-api
systemctl restart atomic-openshift-controllers

Comment 4 Devan Goodwin 2016-10-12 18:10:10 UTC
I was unable to reproduce but with the logfile Steven provided I found a likely fix: https://github.com/openshift/openshift-ansible/pull/2593

Comment 6 Anping Li 2016-10-14 06:04:22 UTC
Ater upgraded, the atomic-openshift-node PID service is same as before. The service should be restarted.

Comment 9 Devan Goodwin 2016-10-14 14:20:05 UTC
Easy enough to reproduce on both masters and nodes, this was apparently the only node restart being done during upgrade, if nothing changed in /etc/sysconfig/atomic-openshift-node. (there is nothing version specific in there so often, nothing will change)

Comment 10 Devan Goodwin 2016-10-14 15:09:24 UTC
This was a good catch, thanks Anping.

https://github.com/openshift/openshift-ansible/pull/2604

Comment 13 errata-xmlrpc 2016-10-27 16:13:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:2122


Note You need to log in before you can comment on or make changes to this bug.