Bug 1396919

Summary: The node combined with master don't work after upgrade control plane
Product: OpenShift Container Platform Reporter: Anping Li <anli>
Component: Cluster Version OperatorAssignee: Devan Goodwin <dgoodwin>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Anping Li <anli>
Severity: medium Docs Contact:
Priority: low    
Version: 3.4.0CC: anli, aos-bugs, bbennett, bleanhar, dcbw, dgoodwin, jokerman, mmccomas
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-09 14:13:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Comment 1 Dan Williams 2016-11-21 17:11:38 UTC
Could you grab 'journalctl -b -u atomic-openshift-node' from the host when this fails?  Ideally the node process is also being run with --loglevel=5, but we might be able to diagnose with normal log level too.

Comment 3 Scott Dodson 2016-11-21 18:50:09 UTC
Some brief log overview

Packages updated

Nov 20 21:04:34 openshift-214 yum[77721]: Updated: atomic-openshift-3.4.0.28-1.git.0.dfe3a66.el7.x86_64
Nov 20 21:04:36 openshift-214 yum[77721]: Updated: tuned-profiles-atomic-openshift-node-3.4.0.28-1.git.0.dfe3a66.el7.x86_64
Nov 20 21:04:36 openshift-214 yum[77721]: Updated: atomic-openshift-node-3.4.0.28-1.git.0.dfe3a66.el7.x86_64
Nov 20 21:04:37 openshift-214 yum[77721]: Updated: atomic-openshift-sdn-ovs-3.4.0.28-1.git.0.dfe3a66.el7.x86_64
Nov 20 21:04:37 openshift-214 yum[77721]: Updated: atomic-openshift-master-3.4.0.28-1.git.0.dfe3a66.el7.x86_64
Nov 20 21:04:37 openshift-214 systemd: Reloading.

Master is restarted 

Nov 20 21:04:56 openshift-214 systemd: Starting Atomic OpenShift Master...
Nov 20 21:05:22 openshift-214 systemd: Starting Atomic OpenShift Master...

Docker is restarted then things go sideways after it comes back up, not sure who did this, perhaps ansible?

Nov 20 21:06:30 openshift-214 systemd: Stopping Docker Application Container Engine...
Nov 20 21:07:38 openshift-214 systemd: Starting Docker Application Container Engine...
Nov 20 21:07:43 openshift-214 ovs-ofctl: ovs|00001|ofp_util|WARN|Negative value -1 is not a valid port number.

Eventually the node is restarted as 3.4 process and things right themselves

Nov 20 22:15:03 openshift-214 systemd: Starting Atomic OpenShift Node...

Comment 4 Dan Williams 2016-11-21 23:23:36 UTC
What I'm pretty sure is happening here is that openshift-sdn has been updated on-disk, but the process hasn't been restarted.  So it's attempting to use a /usr/bin/openshift-sdn-ovs script that expects to be called by the new version, not the old one.

Better warning here: https://github.com/openshift/origin/pull/11990

Not sure what else needs to happen, but whenever you update the openshift-sdn RPM, you should probably restart openshift-node while you're at it.  Classic RPM update problem which people have had for years, like with Firefox.

Comment 5 Devan Goodwin 2016-11-22 14:16:05 UTC
Additional change that might help here, the failures look to be all related to updating the registry/router, done by running deployer pods. At the time this bug was filed this was done *between* control plane upgrade and node upgrade/restart.

As fallout from https://bugzilla.redhat.com/show_bug.cgi?id=1395081 I reverted this such that those upgrade deployer pods only run after nodes are fully upgraded. That might help a lot with the above in that we're not trying to run new pods ourselves in the middle of the upgrade while we're in this weird state with node service and ovs.

Comment 6 Scott Dodson 2016-11-22 16:32:34 UTC
Reducing priority because masters should not be schedulable outside of POC environments.

Workaround is to ensure that nodes are upgraded at the same time or that the masters do not have any pods scheduled.

Comment 7 Brenton Leanhardt 2017-08-24 19:20:38 UTC
I'm increasing the priority since we're planning to deploy a diagnostic pod to the masters soon.