Description of problem: Upgrade from 3.2 to 3.3 fails during node restart Version-Release number of selected component (if applicable): RHEL 7.2 - openshift rpms 3.3.22 installed. Another customer: openshift-ansible-lookup-plugins-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-sdn-ovs-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-docs-3.3.28-1.git.0.762256b.el7.noarch tuned-profiles-atomic-openshift-node-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-filter-plugins-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-utils-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-node-3.3.0.34-1.git.0.83f306f.el7.x86_64 atomic-openshift-clients-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-roles-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-master-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-playbooks-3.3.28-1.git.0.762256b.el7.noarch Another customer: openshift-ansible-playbooks-3.3.22-1.git.0.6c888c2.el7.noarch. How reproducible: Not yet reproduced Steps to Reproduce: 1. Install 3.2 2. Upgrade to 3.3 3. It seems this might only occur when masters are supposed to be schedulable Actual results: Expected results: atomic-openshift-node[79829]: F1005 12:04:26.820427 79829 node.go:343] error: SDN node startup failed: could not get EgressNetworkPolicies: the server could not find the requested resource atomic-openshift-node.service: main process exited, code=exited, status=255/n/a systemd[1]: Failed to start Atomic OpenShift Node. I'm working to get full ansible output if possible
Additionally, restarting the master services seems to resolve the issue. I am still working to verify that the install playbook can be re-run successfully (i.e. the upgrade actually completes)
Re-running the install works after restarting the services. Customer has provided ansible output showing the install complete, so the workaround is more or less confirmed. systemctl restart atomic-openshift-api systemctl restart atomic-openshift-controllers
I was unable to reproduce but with the logfile Steven provided I found a likely fix: https://github.com/openshift/openshift-ansible/pull/2593
Ater upgraded, the atomic-openshift-node PID service is same as before. The service should be restarted.
Easy enough to reproduce on both masters and nodes, this was apparently the only node restart being done during upgrade, if nothing changed in /etc/sysconfig/atomic-openshift-node. (there is nothing version specific in there so often, nothing will change)
This was a good catch, thanks Anping. https://github.com/openshift/openshift-ansible/pull/2604
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:2122