Description of problem: There is a new policy changes recently, i guess the name is EgressNetworkPolicies. Due to the Reconcile task was executed after node upgrade, so the upgraded node can't be started with 'node.go:339] error: SDN node startup failed: Could not get EgressNetworkPolicies:'. The old node can be started after policy was reconciled, maybe we need to adjust the sequence to Reconcile policy. Version-Release number of selected component (if applicable): openshift-ansible:master How reproducible: always Steps to Reproduce: 1. install v3.2 2. upgrade to v3.3 3. check the node status before upgrade finished. Actual results: Aug 12 15:14:26 upgrade-share-master-1.novalocal systemd[1]: Starting atomic-openshift-node.service... Aug 12 15:14:26 upgrade-share-master-1.novalocal atomic-openshift-node[16701]: Failed to remove container (atomic-openshift-node): Error response from daemon: No such container: atomic-openshift-node -bash-4.2# systemctl status atomic-openshift-node ● atomic-openshift-node.service Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Active: activating (start-post) since Fri 2016-08-12 15:14:26 UTC; 9s ago Process: 16691 ExecStop=/usr/bin/docker stop atomic-openshift-node (code=exited, status=1/FAILURE) Process: 16701 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-node (code=exited, status=1/FAILURE) Main PID: 16709 (docker-current); : 16710 (sleep) Memory: 8.0M CGroup: /system.slice/atomic-openshift-node.service ├─16709 /usr/bin/docker-current run --name atomic-openshift-node --rm --privileged --net=host --pid=host --env-file=/etc/sysconfig/atomic-openshift-node -v /:/rootfs:ro -e CONFIG_FILE=/etc/origin/n... └─control └─16710 /usr/bin/sleep 10 Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.497236 16753 vnids.go:114] Associate netid 16 to namespace "ruby22" Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.497443 16753 reflector.go:202] Starting reflector *api.NetNamespace (30m0s) from github.com/openshift...stry.go:306 Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.497475 16753 reflector.go:253] Listing and watching *api.NetNamespace from github.com/openshift/origi...stry.go:306 Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.497639 16753 reflector.go:202] Starting reflector *api.Service (30m0s) from github.com/openshift/orig...stry.go:306 Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.497665 16753 reflector.go:253] Listing and watching *api.Service from github.com/openshift/origin/pkg...stry.go:306 Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.508068 16753 fs.go:139] Filesystem partitions: map[/dev/mapper/atomicos-root:{mountpoint:/rootfs majo...ker/devicem Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.517296 16753 subnets.go:225] Watch MODIFIED event for HostSubnet "openshift-116.lab.sjc.redhat.com" Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.517387 16753 controller.go:489] AddHostSubnetRules for openshift-116.lab.sjc.redhat.com (host: "opens....1.2.0/24") Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: I0812 15:14:34.517439 16753 ovs.go:37] Executing: /usr/bin/ovs-ofctl -O OpenFlow13 add-flow br0 table=1, priority=10...oto_table:5 Aug 12 15:14:34 upgrade-share-master-1.novalocal atomic-openshift-node[16709]: F0812 15:14:34.521810 16753 node.go:339] error: SDN node startup failed: Could not get EgressNetworkPolicies: User "...the cluster Hint: Some lines were ellipsized, use -l to show in full. Expected results: The node is on service immediately once it was upgraded. Additional info:
Sounds like a fix is incoming in core OpenShift to stop shutting down the node when this happens, but we can also prevent and avoid the need for a node restart after the reconcile by moving the reconcile between master and node upgrade. https://github.com/openshift/openshift-ansible/pull/2310
Created attachment 1191660 [details] Upgrade logs for this cases Only moving the reconcile between master and node upgrade is not enough, I still hit this error.
Verified and pass on atomic-openshift-utils-3.3.14
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1933
Sorry to reopen but we have customers still seeing this behavior. One is reporting: RHEL 7.2 - openshift rpms 3.3.22 installed. Another: openshift-ansible-lookup-plugins-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-sdn-ovs-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-docs-3.3.28-1.git.0.762256b.el7.noarch tuned-profiles-atomic-openshift-node-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-filter-plugins-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-utils-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-node-3.3.0.34-1.git.0.83f306f.el7.x86_64 atomic-openshift-clients-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-roles-3.3.28-1.git.0.762256b.el7.noarch atomic-openshift-master-3.3.0.34-1.git.0.83f306f.el7.x86_64 openshift-ansible-playbooks-3.3.28-1.git.0.762256b.el7.noarch Another: openshift-ansible-playbooks-3.3.22-1.git.0.6c888c2.el7.noarch. Another was able to resolve the issue by simply restarting the master services. Attaching cases now; the output in the logs is nearly identical to that which started this case. I can open a separate bz instead if you want, although this seems very much the same issue.
re-closing, we'll follow up in the new bug that was filed.