Hide Forgot
Could you grab 'journalctl -b -u atomic-openshift-node' from the host when this fails? Ideally the node process is also being run with --loglevel=5, but we might be able to diagnose with normal log level too.
Some brief log overview Packages updated Nov 20 21:04:34 openshift-214 yum[77721]: Updated: atomic-openshift-3.4.0.28-1.git.0.dfe3a66.el7.x86_64 Nov 20 21:04:36 openshift-214 yum[77721]: Updated: tuned-profiles-atomic-openshift-node-3.4.0.28-1.git.0.dfe3a66.el7.x86_64 Nov 20 21:04:36 openshift-214 yum[77721]: Updated: atomic-openshift-node-3.4.0.28-1.git.0.dfe3a66.el7.x86_64 Nov 20 21:04:37 openshift-214 yum[77721]: Updated: atomic-openshift-sdn-ovs-3.4.0.28-1.git.0.dfe3a66.el7.x86_64 Nov 20 21:04:37 openshift-214 yum[77721]: Updated: atomic-openshift-master-3.4.0.28-1.git.0.dfe3a66.el7.x86_64 Nov 20 21:04:37 openshift-214 systemd: Reloading. Master is restarted Nov 20 21:04:56 openshift-214 systemd: Starting Atomic OpenShift Master... Nov 20 21:05:22 openshift-214 systemd: Starting Atomic OpenShift Master... Docker is restarted then things go sideways after it comes back up, not sure who did this, perhaps ansible? Nov 20 21:06:30 openshift-214 systemd: Stopping Docker Application Container Engine... Nov 20 21:07:38 openshift-214 systemd: Starting Docker Application Container Engine... Nov 20 21:07:43 openshift-214 ovs-ofctl: ovs|00001|ofp_util|WARN|Negative value -1 is not a valid port number. Eventually the node is restarted as 3.4 process and things right themselves Nov 20 22:15:03 openshift-214 systemd: Starting Atomic OpenShift Node...
What I'm pretty sure is happening here is that openshift-sdn has been updated on-disk, but the process hasn't been restarted. So it's attempting to use a /usr/bin/openshift-sdn-ovs script that expects to be called by the new version, not the old one. Better warning here: https://github.com/openshift/origin/pull/11990 Not sure what else needs to happen, but whenever you update the openshift-sdn RPM, you should probably restart openshift-node while you're at it. Classic RPM update problem which people have had for years, like with Firefox.
Additional change that might help here, the failures look to be all related to updating the registry/router, done by running deployer pods. At the time this bug was filed this was done *between* control plane upgrade and node upgrade/restart. As fallout from https://bugzilla.redhat.com/show_bug.cgi?id=1395081 I reverted this such that those upgrade deployer pods only run after nodes are fully upgraded. That might help a lot with the above in that we're not trying to run new pods ourselves in the middle of the upgrade while we're in this weird state with node service and ovs.
Reducing priority because masters should not be schedulable outside of POC environments. Workaround is to ensure that nodes are upgraded at the same time or that the masters do not have any pods scheduled.
I'm increasing the priority since we're planning to deploy a diagnostic pod to the masters soon.