Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1366722 - SDN node startup failed for those upgraded nodes during upgrade
SDN node startup failed for those upgraded nodes during upgrade
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Upgrade (Show other bugs)
3.3.0
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Devan Goodwin
Anping Li
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-08-12 11:59 EDT by Anping Li
Modified: 2016-10-11 11:15 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-10-11 11:15:31 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Upgrade logs for this cases (365.13 KB, text/plain)
2016-08-17 11:03 EDT, Anping Li
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1933 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.3 Release Advisory 2016-09-27 09:24:36 EDT

  None (edit)
Description Anping Li 2016-08-12 11:59:52 EDT
Description of problem:
There is a new policy changes recently, i guess the name is EgressNetworkPolicies. Due to the Reconcile task was executed after node upgrade, so the upgraded node can't be started with 'node.go:339] error: SDN node startup failed: Could not get EgressNetworkPolicies:'. 
The old node can be started after policy was reconciled, maybe we need to adjust the sequence to Reconcile policy.


Version-Release number of selected component (if applicable):
openshift-ansible:master

How reproducible:
always

Steps to Reproduce:
1. install v3.2
2. upgrade to v3.3
3. check the node status before upgrade finished.


Actual results:
Aug 12 15:14:26 upgrade-share-master-1.novalocal systemd[1]: Starting atomic-openshift-node.service...
Aug 12 15:14:26 upgrade-share-master-1.novalocal atomic-openshift-node[16701]: Failed to remove container (atomic-openshift-node): Error response from daemon: No such container: atomic-openshift-node
-bash-4.2# systemctl status atomic-openshift-node
● atomic-openshift-node.service
   Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
   Active: activating (start-post) since Fri 2016-08-12 15:14:26 UTC; 9s ago
  Process: 16691 ExecStop=/usr/bin/docker stop atomic-openshift-node (code=exited, status=1/FAILURE)
  Process: 16701 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-node (code=exited, status=1/FAILURE)
 Main PID: 16709 (docker-current);         : 16710 (sleep)
   Memory: 8.0M
   CGroup: /system.slice/atomic-openshift-node.service
           ├─16709 /usr/bin/docker-current run --name atomic-openshift-node --rm --privileged --net=host --pid=host --env-file=/etc/sysconfig/atomic-openshift-node -v /:/rootfs:ro -e CONFIG_FILE=/etc/origin/n...
           └─control
             └─16710 /usr/bin/sleep 10

Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.497236   16753 vnids.go:114] Associate netid 16 to namespace "ruby22"
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.497443   16753 reflector.go:202] Starting reflector *api.NetNamespace (30m0s) from github.com/openshift...stry.go:306
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.497475   16753 reflector.go:253] Listing and watching *api.NetNamespace from github.com/openshift/origi...stry.go:306
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.497639   16753 reflector.go:202] Starting reflector *api.Service (30m0s) from github.com/openshift/orig...stry.go:306
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.497665   16753 reflector.go:253] Listing and watching *api.Service from github.com/openshift/origin/pkg...stry.go:306
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.508068   16753 fs.go:139] Filesystem partitions: map[/dev/mapper/atomicos-root:{mountpoint:/rootfs majo...ker/devicem
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.517296   16753 subnets.go:225] Watch MODIFIED event for HostSubnet "openshift-116.lab.sjc.redhat.com"
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.517387   16753 controller.go:489] AddHostSubnetRules for openshift-116.lab.sjc.redhat.com (host: "opens....1.2.0/24")
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.517439   16753 ovs.go:37] Executing: /usr/bin/ovs-ofctl -O OpenFlow13 add-flow br0 table=1, priority=10...oto_table:5
Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: F0812 15:14:34.521810   16753 node.go:339] error: SDN node startup failed: Could not get EgressNetworkPolicies: User "...the cluster
Hint: Some lines were ellipsized, use -l to show in full.


Expected results:
The node is on service immediately once it was upgraded.
Additional info:
Comment 1 Devan Goodwin 2016-08-16 15:22:20 EDT
Sounds like a fix is incoming in core OpenShift to stop shutting down the node when this happens, but we can also prevent and avoid the need for a node restart after the reconcile by moving the reconcile between master and node upgrade.

https://github.com/openshift/openshift-ansible/pull/2310
Comment 2 Anping Li 2016-08-17 11:03 EDT
Created attachment 1191660 [details]
Upgrade logs for this cases

Only moving the reconcile between master and node
upgrade is not enough, I still hit this error.
Comment 6 Anping Li 2016-08-24 05:17:13 EDT
Verified and pass on atomic-openshift-utils-3.3.14
Comment 8 errata-xmlrpc 2016-09-27 05:43:59 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933
Comment 9 Steven Walter 2016-10-05 16:31:16 EDT
Sorry to reopen but we have customers still seeing this behavior. One is reporting:
RHEL 7.2 - openshift rpms 3.3.22 installed.


Another:
openshift-ansible-lookup-plugins-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-sdn-ovs-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-docs-3.3.28-1.git.0.762256b.el7.noarch tuned-profiles-atomic-openshift-node-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-filter-plugins-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-utils-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-node-3.3.0.34-1.git.0.83f306f.el7.x86_64 atomic-openshift-clients-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-roles-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-master-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-playbooks-3.3.28-1.git.0.762256b.el7.noarch

Another:
openshift-ansible-playbooks-3.3.22-1.git.0.6c888c2.el7.noarch.

Another was able to resolve the issue by simply restarting the master services.

Attaching cases now; the output in the logs is nearly identical to that which started this case. I can open a separate bz instead if you want, although this seems very much the same issue.
Comment 16 Scott Dodson 2016-10-11 11:15:31 EDT
re-closing, we'll follow up in the new bug that was filed.

Note You need to log in before you can comment on or make changes to this bug.