Bug 1267670 - Applications holding the xtables lock can block SDN initialization
Summary: Applications holding the xtables lock can block SDN initialization
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.0.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Dan Winship
QA Contact: Meng Bo
URL:
Whiteboard:
: 1269454 (view as bug list)
Depends On:
Blocks: 1267746
TreeView+ depends on / blocked
 
Reported: 2015-09-30 15:40 UTC by Brenton Leanhardt
Modified: 2019-08-15 05:34 UTC (History)
6 users (show)

Fixed In Version: atomic-openshift-3.0.2.901-0.git.61.568adb6.el7aos.x86_64
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-11-23 14:24:59 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1977453 0 None None None Never

Description Brenton Leanhardt 2015-09-30 15:40:45 UTC
Description of problem:

This is similar to https://github.com/openshift/origin/issues/2266.  In short, normally on RHEL-based operating systems an admin can pass the `-w` flag to iptables to have it wait until it can acquire the xtables lock.  Without that the default behavior is simply to fail which will break SDN initialization if another application happens to have the lock (like the kube-proxy or anything else running on the system).

The failure can be observed by the following logs:

-- Logs begin at mar. 2015-09-29 18:57:05 CEST, end at mar. 2015-09-29 22:13:33 CEST. --
sept. 29 18:57:20 ose3-master.example.com openshift-node[3392]: F0929 18:57:20.355627    3392 flatsdn.go:47] SDN Node failed: exit status 1
sept. 29 18:57:20 ose3-master.example.com openshift-node[3392]: Error: exit status 1
sept. 29 18:57:20 ose3-master.example.com openshift-node[3392]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
sept. 29 18:57:20 ose3-master.example.com openshift-node[3392]: + iptables -t nat -A POSTROUTING -s 10.1.0.0/16 '!' -d 10.1.0.0/16 -j MASQUERADE
sept. 29 18:57:20 ose3-master.example.com systemd[1]: Unit openshift-node.service entered failed state.
sept. 29 18:57:20 ose3-master.example.com systemd[1]: openshift-node.service: main process exited, code=exited, status=255/n/a
sept. 29 18:57:20 ose3-master.example.com openshift-node[3392]: + true
sept. 29 18:57:20 ose3-master.example.com openshift-node[3392]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
sept. 29 18:57:20 ose3-master.example.com openshift-node[3392]: + iptables -t nat -D POSTROUTING -s 10.1.0.0/16 '!' -d 10.1.0.0/16 -j MASQUERADE
sept. 29 18:57:20 ose3-master.example.com openshift-node[3392]: + ip route add 10.1.0.0/16 dev tun0 proto kernel scope link

In most cases the workaround is to manually restart the OpenShift Node.

Version-Release number of selected component (if applicable):
openshift-3.0.2.0-0.git.5.ee06ab6.el7ose

How reproducible:

To reproduce this you could follow the steps mentioned here and then try launching the SDN:

https://github.com/kubernetes/kubernetes/issues/7370#issuecomment-97475070

Additional info:

In the case of the kube-proxy debian needed to be supported so retry logic was added if initialization fails.  Origin added retry logic here since the start up of the proxy is slightly different: https://github.com/openshift/origin/blob/master/pkg/cmd/server/kubernetes/node.go#L173

If openshift-sdn is only intended to work for RHEL-based operating systems this could likely be resolved simply by adding the `-w` flag to all the iptables commands in https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/openshift/openshift-sdn/pkg/ovssubnet/controller/kube/bin/openshift-sdn-kube-subnet-setup.sh

I didn't see any direct calls to iptables in the multitenant plugin.  It looks like the multitenant setup script sets `net.bridge.bridge-nf-call-iptables=0` which means packets won't flow through iptables at all.

Comment 2 Dan Winship 2015-10-02 13:42:00 UTC
This is fixed with the latest git origin.

Comment 3 Brenton Leanhardt 2015-10-02 14:32:27 UTC
Dan, would you mind linking a PR in this bug?  I looking in origin and openshift-sdn and it wasn't obvious to me where the fix was.  It helps us know when we've built the change and went to let QE know they can actually test it in OSE.

I'm going to move it to MODIFIED for now just to let QE know it's not yet built.

Comment 4 Dan Winship 2015-10-02 15:06:24 UTC
This got fixed on the kubernetes side by https://github.com/kubernetes/kubernetes/pull/13386 which then got pulled into origin via https://github.com/kubernetes/kubernetes/pull/13386. However it occurs to me now that openshift-sdn's scripts aren't using -w yet, so this isn't completely fixed.

Comment 5 Dan Winship 2015-10-02 15:07:14 UTC
(In reply to Dan Winship from comment #4)
> This got fixed on the kubernetes side by
> https://github.com/kubernetes/kubernetes/pull/13386 which then got pulled
> into origin via https://github.com/kubernetes/kubernetes/pull/13386.

Er, second link should have been https://github.com/openshift/origin/pull/4663

Comment 6 Dan Winship 2015-10-07 14:51:42 UTC
fixed via https://github.com/openshift/openshift-sdn/pull/173. Note that this is not yet merged into origin so you'd have to run sync-to-origin.sh from openshift-sdn git master to test it.

Comment 7 Johnny Liu 2015-10-13 05:16:16 UTC
Because this bug is opened against Enterprise component, according to verification work flow, QE have to wait until the fix PR is merged into OSE, build a new rpm package, and include the rpm into a new puddle.

Comment 8 Josep 'Pep' Turro Mauri 2015-10-14 11:31:48 UTC
*** Bug 1269454 has been marked as a duplicate of this bug. ***

Comment 10 Meng Bo 2015-10-20 11:06:37 UTC
Checked with puddle 2015-10-17.1

With step in https://github.com/kubernetes/kubernetes/issues/7370#issuecomment-97475070

There is no such xtables error in openshift-node log.

Move the bug to verified.

Comment 11 Brenton Leanhardt 2015-11-23 14:24:59 UTC
This fix is available in OpenShift Enterprise 3.1.


Note You need to log in before you can comment on or make changes to this bug.