1366722 – SDN node startup failed for those upgraded nodes during upgrade

Bug 1366722 - SDN node startup failed for those upgraded nodes during upgrade

Summary: SDN node startup failed for those upgraded nodes during upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Devan Goodwin
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-12 15:59 UTC by Anping Li
Modified:	2020-04-15 14:36 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-10-11 15:15:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Upgrade logs for this cases (365.13 KB, text/plain) 2016-08-17 15:03 UTC, Anping Li	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1933	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.3 Release Advisory	2016-09-27 13:24:36 UTC

Description Anping Li 2016-08-12 15:59:52 UTC

Description of problem:
There is a new policy changes recently, i guess the name is EgressNetworkPolicies. Due to the Reconcile task was executed after node upgrade, so the upgraded node can't be started with 'node.go:339] error: SDN node startup failed: Could not get EgressNetworkPolicies:'. 
The old node can be started after policy was reconciled, maybe we need to adjust the sequence to Reconcile policy.


Version-Release number of selected component (if applicable):
openshift-ansible:master

How reproducible:
always

Steps to Reproduce:
1. install v3.2
2. upgrade to v3.3
3. check the node status before upgrade finished.


Actual results:
Aug 12 15:14:26 upgrade-share-master-1.novalocal systemd[1]: Starting atomic-openshift-node.service...
Aug 12 15:14:26 upgrade-share-master-1.novalocal atomic-openshift-node[16701]: Failed to remove container (atomic-openshift-node): Error response from daemon: No such container: atomic-openshift-node
-bash-4.2# systemctl status atomic-openshift-node
● atomic-openshift-node.service
   Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
   Active: activating (start-post) since Fri 2016-08-12 15:14:26 UTC; 9s ago
  Process: 16691 ExecStop=/usr/bin/docker stop atomic-openshift-node (code=exited, status=1/FAILURE)
  Process: 16701 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-node (code=exited, status=1/FAILURE)
 Main PID: 16709 (docker-current);         : 16710 (sleep)
   Memory: 8.0M
   CGroup: /system.slice/atomic-openshift-node.service
           ├─16709 /usr/bin/docker-current run --name atomic-openshift-node --rm --privileged --net=host --pid=host --env-file=/etc/sysconfig/atomic-openshift-node -v /:/rootfs:ro -e CONFIG_FILE=/etc/origin/n...
           └─control
             └─16710 /usr/bin/sleep 10

Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.497236   16753 vnids.go:114] Associate netid 16 to namespace "ruby22"
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.497443   16753 reflector.go:202] Starting reflector *api.NetNamespace (30m0s) from github.com/openshift...stry.go:306
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.497475   16753 reflector.go:253] Listing and watching *api.NetNamespace from github.com/openshift/origi...stry.go:306
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.497639   16753 reflector.go:202] Starting reflector *api.Service (30m0s) from github.com/openshift/orig...stry.go:306
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.497665   16753 reflector.go:253] Listing and watching *api.Service from github.com/openshift/origin/pkg...stry.go:306
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.508068   16753 fs.go:139] Filesystem partitions: map[/dev/mapper/atomicos-root:{mountpoint:/rootfs majo...ker/devicem
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.517296   16753 subnets.go:225] Watch MODIFIED event for HostSubnet "openshift-116.lab.sjc.redhat.com"
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.517387   16753 controller.go:489] AddHostSubnetRules for openshift-116.lab.sjc.redhat.com (host: "opens....1.2.0/24")
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.517439   16753 ovs.go:37] Executing: /usr/bin/ovs-ofctl -O OpenFlow13 add-flow br0 table=1, priority=10...oto_table:5
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: F0812 15:14:34.521810   16753 node.go:339] error: SDN node startup failed: Could not get EgressNetworkPolicies: User "...the cluster
Hint: Some lines were ellipsized, use -l to show in full.


Expected results:
The node is on service immediately once it was upgraded.
Additional info:

Comment 1 Devan Goodwin 2016-08-16 19:22:20 UTC

Sounds like a fix is incoming in core OpenShift to stop shutting down the node when this happens, but we can also prevent and avoid the need for a node restart after the reconcile by moving the reconcile between master and node upgrade.

https://github.com/openshift/openshift-ansible/pull/2310

Comment 2 Anping Li 2016-08-17 15:03:59 UTC

Created attachment 1191660 [details]
Upgrade logs for this cases

Only moving the reconcile between master and node
upgrade is not enough, I still hit this error.

Comment 6 Anping Li 2016-08-24 09:17:13 UTC

Verified and pass on atomic-openshift-utils-3.3.14

Comment 8 errata-xmlrpc 2016-09-27 09:43:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933

Comment 9 Steven Walter 2016-10-05 20:31:16 UTC

Sorry to reopen but we have customers still seeing this behavior. One is reporting:
RHEL 7.2 - openshift rpms 3.3.22 installed.


Another:
openshift-ansible-lookup-plugins-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-sdn-ovs-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-docs-3.3.28-1.git.0.762256b.el7.noarch tuned-profiles-atomic-openshift-node-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-filter-plugins-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-utils-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-node-3.3.0.34-1.git.0.83f306f.el7.x86_64 atomic-openshift-clients-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-roles-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-master-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-playbooks-3.3.28-1.git.0.762256b.el7.noarch

Another:
openshift-ansible-playbooks-3.3.22-1.git.0.6c888c2.el7.noarch.

Another was able to resolve the issue by simply restarting the master services.

Attaching cases now; the output in the logs is nearly identical to that which started this case. I can open a separate bz instead if you want, although this seems very much the same issue.

Comment 16 Scott Dodson 2016-10-11 15:15:31 UTC

re-closing, we'll follow up in the new bug that was filed.

Note You need to log in before you can comment on or make changes to this bug.