Description of problem: This is similar to https://github.com/openshift/origin/issues/2266. In short, normally on RHEL-based operating systems an admin can pass the `-w` flag to iptables to have it wait until it can acquire the xtables lock. Without that the default behavior is simply to fail which will break SDN initialization if another application happens to have the lock (like the kube-proxy or anything else running on the system). The failure can be observed by the following logs: -- Logs begin at mar. 2015-09-29 18:57:05 CEST, end at mar. 2015-09-29 22:13:33 CEST. -- sept. 29 18:57:20 ose3-master.example.com openshift-node[3392]: F0929 18:57:20.355627 3392 flatsdn.go:47] SDN Node failed: exit status 1 sept. 29 18:57:20 ose3-master.example.com openshift-node[3392]: Error: exit status 1 sept. 29 18:57:20 ose3-master.example.com openshift-node[3392]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option? sept. 29 18:57:20 ose3-master.example.com openshift-node[3392]: + iptables -t nat -A POSTROUTING -s 10.1.0.0/16 '!' -d 10.1.0.0/16 -j MASQUERADE sept. 29 18:57:20 ose3-master.example.com systemd[1]: Unit openshift-node.service entered failed state. sept. 29 18:57:20 ose3-master.example.com systemd[1]: openshift-node.service: main process exited, code=exited, status=255/n/a sept. 29 18:57:20 ose3-master.example.com openshift-node[3392]: + true sept. 29 18:57:20 ose3-master.example.com openshift-node[3392]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option? sept. 29 18:57:20 ose3-master.example.com openshift-node[3392]: + iptables -t nat -D POSTROUTING -s 10.1.0.0/16 '!' -d 10.1.0.0/16 -j MASQUERADE sept. 29 18:57:20 ose3-master.example.com openshift-node[3392]: + ip route add 10.1.0.0/16 dev tun0 proto kernel scope link In most cases the workaround is to manually restart the OpenShift Node. Version-Release number of selected component (if applicable): openshift-3.0.2.0-0.git.5.ee06ab6.el7ose How reproducible: To reproduce this you could follow the steps mentioned here and then try launching the SDN: https://github.com/kubernetes/kubernetes/issues/7370#issuecomment-97475070 Additional info: In the case of the kube-proxy debian needed to be supported so retry logic was added if initialization fails. Origin added retry logic here since the start up of the proxy is slightly different: https://github.com/openshift/origin/blob/master/pkg/cmd/server/kubernetes/node.go#L173 If openshift-sdn is only intended to work for RHEL-based operating systems this could likely be resolved simply by adding the `-w` flag to all the iptables commands in https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/openshift/openshift-sdn/pkg/ovssubnet/controller/kube/bin/openshift-sdn-kube-subnet-setup.sh I didn't see any direct calls to iptables in the multitenant plugin. It looks like the multitenant setup script sets `net.bridge.bridge-nf-call-iptables=0` which means packets won't flow through iptables at all.
This is fixed with the latest git origin.
Dan, would you mind linking a PR in this bug? I looking in origin and openshift-sdn and it wasn't obvious to me where the fix was. It helps us know when we've built the change and went to let QE know they can actually test it in OSE. I'm going to move it to MODIFIED for now just to let QE know it's not yet built.
This got fixed on the kubernetes side by https://github.com/kubernetes/kubernetes/pull/13386 which then got pulled into origin via https://github.com/kubernetes/kubernetes/pull/13386. However it occurs to me now that openshift-sdn's scripts aren't using -w yet, so this isn't completely fixed.
(In reply to Dan Winship from comment #4) > This got fixed on the kubernetes side by > https://github.com/kubernetes/kubernetes/pull/13386 which then got pulled > into origin via https://github.com/kubernetes/kubernetes/pull/13386. Er, second link should have been https://github.com/openshift/origin/pull/4663
fixed via https://github.com/openshift/openshift-sdn/pull/173. Note that this is not yet merged into origin so you'd have to run sync-to-origin.sh from openshift-sdn git master to test it.
Because this bug is opened against Enterprise component, according to verification work flow, QE have to wait until the fix PR is merged into OSE, build a new rpm package, and include the rpm into a new puddle.
*** Bug 1269454 has been marked as a duplicate of this bug. ***
Checked with puddle 2015-10-17.1 With step in https://github.com/kubernetes/kubernetes/issues/7370#issuecomment-97475070 There is no such xtables error in openshift-node log. Move the bug to verified.
This fix is available in OpenShift Enterprise 3.1.