Bug 1382380

Summary: Upgrade from 3.2 to 3.3 fails with could not get EgressNetworkPolicies
Product: OpenShift Container Platform Reporter: Steven Walter <stwalter>
Component: Cluster Version OperatorAssignee: Devan Goodwin <dgoodwin>
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.3.0CC: anli, aos-bugs, dgoodwin, jialiu, jokerman, mmccomas
Target Milestone: ---   
Target Release: 3.3.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Node service was incorrectly being restarted after upgrading master RPM packages. Consequence: In some environments a version mismatch could trigger between the node service, and the not yet restarted master service, causing upgrade to fail. Fix: Incorrect node restart was removed and logic shuffled to ensure masters are upgraded and restarted before we proceed to node upgrade/restart. Result: Upgrade will now complete successfully.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-27 16:13:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Steven Walter 2016-10-06 14:08:52 UTC
Description of problem:
Upgrade from 3.2 to 3.3 fails during node restart

Version-Release number of selected component (if applicable):

RHEL 7.2 - openshift rpms 3.3.22 installed.

Another customer:
openshift-ansible-lookup-plugins-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-sdn-ovs-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-docs-3.3.28-1.git.0.762256b.el7.noarch tuned-profiles-atomic-openshift-node-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-filter-plugins-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-utils-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-node-3.3.0.34-1.git.0.83f306f.el7.x86_64 atomic-openshift-clients-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-roles-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-master-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-playbooks-3.3.28-1.git.0.762256b.el7.noarch

Another customer:
openshift-ansible-playbooks-3.3.22-1.git.0.6c888c2.el7.noarch.

How reproducible:
Not yet reproduced

Steps to Reproduce:
1. Install 3.2
2. Upgrade to 3.3
3. It seems this might only occur when masters are supposed to be schedulable

Actual results:


Expected results:
atomic-openshift-node[79829]: F1005 12:04:26.820427   79829 node.go:343] error: SDN node startup failed: could not get EgressNetworkPolicies: the server could not find the requested resource
atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
systemd[1]: Failed to start Atomic OpenShift Node.

I'm working to get full ansible output if possible

Comment 1 Steven Walter 2016-10-06 14:14:09 UTC
Additionally, restarting the master services seems to resolve the issue. I am still working to verify that the install playbook can be re-run successfully (i.e. the upgrade actually completes)

Comment 2 Steven Walter 2016-10-10 13:48:46 UTC
Re-running the install works after restarting the services. Customer has provided ansible output showing the install complete, so the workaround is more or less confirmed.

systemctl restart atomic-openshift-api
systemctl restart atomic-openshift-controllers

Comment 4 Devan Goodwin 2016-10-12 18:10:10 UTC
I was unable to reproduce but with the logfile Steven provided I found a likely fix: https://github.com/openshift/openshift-ansible/pull/2593

Comment 6 Anping Li 2016-10-14 06:04:22 UTC
Ater upgraded, the atomic-openshift-node PID service is same as before. The service should be restarted.

Comment 9 Devan Goodwin 2016-10-14 14:20:05 UTC
Easy enough to reproduce on both masters and nodes, this was apparently the only node restart being done during upgrade, if nothing changed in /etc/sysconfig/atomic-openshift-node. (there is nothing version specific in there so often, nothing will change)

Comment 10 Devan Goodwin 2016-10-14 15:09:24 UTC
This was a good catch, thanks Anping.

https://github.com/openshift/openshift-ansible/pull/2604

Comment 13 errata-xmlrpc 2016-10-27 16:13:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:2122