Bug 1417234
Summary: | Proxier and Docker Daemon - iptables: Resource temporarily unavailable. (exit status 4) | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Matthew Robson <mrobson> |
Component: | Networking | Assignee: | Dan Williams <dcbw> |
Status: | CLOSED ERRATA | QA Contact: | Meng Bo <bmeng> |
Severity: | urgent | Docs Contact: | |
Priority: | medium | ||
Version: | 3.2.1 | CC: | aconole, aloughla, aos-bugs, atragler, bbennett, byount, crempel, dcbw, decarr, eparis, erich, hongli, jkaur, jokerman, jotodd, mchappel, mleitner, mmccomas, mrobson, nbhatt, pdwyer, rromerom, sreber, stwalter, sukulkar, wgordon |
Target Milestone: | --- | ||
Target Release: | 3.6.z | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: the iptables proxy was not properly locking its use of iptables.
Consequence: the iptables proxy could conflict with docker and the openshift-node process and cause a failure to start containers.
Fix: the iptables proxy now locks its use of iptables.
Result: pod creation failures due to improper locking of iptables should no longer occur
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2017-10-25 13:00:48 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1438597 | ||
Bug Blocks: |
Description
Matthew Robson
2017-01-27 16:04:30 UTC
What version of iptables do they have? One is at: Version: iptables-1.4.21-16.el7.x86_64 Other is at: Version: iptables-1.4.21-17.el7.x86_64 Matt Possibly linked to: https://github.com/docker/docker/issues/30379 https://github.com/docker/docker/issues/12547 Possibly resolved through docker 1.12+, but nothing concrete... Are they using a containerized install? https://github.com/docker/docker/issues/30379 suggests the bug is caused by running iptables inside --host=network containers. No, these are standard OCP installs on RHEL 7.2/7.3. Per the logs, this is even coming from containers like fluentd. Network is default in these containers. From what I have seen, this always coincided when an endpoint being setup via proxier.go.. Ex: Jan 26 12:59:28 atomic-openshift-node[31055]: I0126 12:59:28.524396 31055 proxier.go:484] Setting endpoints for "a-b-c/a-b-c-d:" to [10.1.2.32:9090] You can see the kube-services chain services; https://github.com/kubernetes/kubernetes/blob/v1.3.0/pkg/proxy/iptables/proxier.go#L59 Ex: Chain KUBE-SERVICES (1 references) pkts bytes target prot opt in out source destination 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.36.114 /* has no endpoints */ tcp dpt:8081 reject-with icmp-port-unreachable 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.10.124 /* has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.191.57 /* has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.235.127 /* has no endpoints */ tcp dpt:9090 reject-with icmp-port-unreachable 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.106.65 /* has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.84.187 /* has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.110.32 /* openshift-infra/logging-kibana-ops: has no endpoints */ tcp dpt:443 reject-with icmp-port-unreachable 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.188.191 /* has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.111.42 /* has no endpoints */ tcp dpt:80 reject-with icmp-port-unreachable 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.182.109 /* has no endpoints */ tcp dpt:9090 reject-with icmp-port-unreachable 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.157.157 /* has no endpoints */ tcp dpt:9090 reject-with icmp-port-unreachable dcbw: Can you please take a look at this and see if you have any ideas? Can you please grab the output from iptables-save on both versions please? Would you be willing and able to turn on auditd on one of the nodes able to reproduce the problem (if it is not already running). After you reproduce with auditd running will you attach the /var/log/audit/audit.log? Or more specifically `ausearch -i -m NETFILTER_CFG` Can you make sure that you do no have a default 'do not audit anything' rule in /etc/audit/audit.rules? A minimal rules file that would collect the best we can about who and how iptables is getting called would include: -D -a always,exit -b64 -S kill -F pid=1 Can you also give us a `ps -ef` listing from around the time of the reproduction while the audit rules were running (so we can match pids/ppids in the records to your daemons?) Hopefully it can help us track down if some non-kube/docker process is adding/removing iptables rules and causing the issue... Also, would the customer be willing to install an instrumented iptables RPM on the problem nodes so we can figure out where the EAGAIN is coming from? I've spent a bunch of time looking at the code and can't figure out how that error is showing up from a simple "check this rule -C" call. But I'm happy to build a testing RPM for them. Also, any chance you can get the docker RPM version in each of the cases? There's a debugging RPM here: http://people.redhat.com/dcbw/iptables-1.4.21-17.1.el7_3.x86_64.rpm install that on an affected node with: rpm -Uhv iptables-1.4.21-17.1.el7_3.x86_64.rpm --force and then run origin and reproduce the problem. Unfortunately, due to a docker coding decision (eg, it assumes failure if iptables prints any output to stdout/stderr, even debugging output), this modified iptables will write debug output to /tmp/iptables-debug.txt. So when the problem has been reproduced, please get: 1) /tmp/iptables-debug.txt 2) journalctl -b -u atomic-openshift-node and we'll see how much further we can push this. Thanks! Thanks. I've analyzed the iptables save dump and found what I believe to be a bug in iptables that causes rule checks to fail with the above error. It appears that for jump targets that are kernel modules the operation will sometimes be marked "changed" and even though it's a simple check, iptables will run through an entire table replacement operation which is causing the EAGAIN/Resource temporarily unavailable. I've started a query with our iptables experts to figure out if this is actually a bug or intended behavior. To be clear, this doesn't solve the root-cause of *why* there are iptables useage conflicts, but it does reduce the issues related to some of them. *** Bug 1425205 has been marked as a duplicate of this bug. *** Assuming the problem was duplicated during the audit there there are a couple things I can glean. There seem to be 3 places that call into the kernel's net/netfilter/x_tables.c::xt_replace_table() function. 1. Lots of calls from node -> iptables-restore 2. Fewer but many calls from node -> iptables 3. Docker calls to clone(2) with CLONE_NEWNET also hit here I also notice that I wouldn't get a NETFILTER_CFG record for something that got EGAIN, so that kinda stinks. I could give some more different ruleset to try to capture those calls as well, but not certain we need it. I'm wondering (and this is WILD speculation) if the race is between docker clone(2) calls and the iptables usage. Docker wouldn't be grabbing any file locks. Its not like docker is explicitly doing anything with iptables... Aaron any chance you can see what you think about my wild speculation. I haven't found the code path between clone() and xt_replace_tables() but the audit log tells me such a thing exists (well, or clone() hits do_replace_finish() in the eb_tables code, but that seems much less likely). Maybe if I've got some time I'll try to use systemtap to force a stack trace when clone hits the iptables update function in the kernel to figure out that kernel code path... Looking at the audit log and the netfilter_cfg log, none of them show anomalous "exit=XX" values including no syscall returns for EAGAIN, leading me to wonder if the audit logs really do capture the error scenario. Will, any chance you can capture some audit logs that definitely have the error? The rules won't capture the EAGAIN case. The NETFILTER_CFG audit message is generated AFTER the EAGAIN check. blech. But they do prove there were only 3 things that ever successfully got into that function. The node got in there via two paths and docker calling clone gets in there too (although i wildly guess that is in the new net namespace, not the host net namespace) To get the failures as well we could probably add: -a exit,always -F arch=b32 -S setsockopt -F exit=EAGAIN -a exit,always -F arch=b64 -S setsockopt -F exit=EAGAIN If those generate way too much data (possible) I think we could limit it via -F a2=64, but I'd rather get more than less... So it turns out iptables-restore is not properly locked against the 'iptables' binary and the two can run at the same time, both modifying rules. Since in openshift with many services and endpoints, iptables-restore takes a long amount of time, we may run into conflicts from that angle. Please ask the customer to install this package to test: http://people.redhat.com/dcbw/iptables-1.4.21-17.2.el7_3.x86_64.rpm with --force and --nodeps (if required). It does *not* contain the previous debugging patch, it only adds code to ensure that iptables-restore also grabs the lock. Let me know if this reduces or eliminates the problem on the nodes. RPM updated with the "check" fix as well; same filename and location. Whenever you're able to test it out, let me know the results. Thanks! ***** NOTE: install these packages with --no-scripts to ensure that it does not restart the iptables system service, which will overwrite OpenShift iptables rules and potentially screw up node iptables. Any update on this? Anyone able to install the RPMs (with --noscripts!) and determine if the issue is still happening on that node? I've filed a backport request for the relevant patches to the iptables package for RHEL7 in https://bugzilla.redhat.com/show_bug.cgi?id=1438597. I've also filed a Kubernetes PR upstream to detect and use the new option as https://github.com/kubernetes/kubernetes/pull/43575. If/when that lands we'll need to backport to OpenShift, which we'll keep this bug open for. Marking upcoming release since we need changes to iptables to merge in to RHEL. what would be our impact of doing out own file based locking? so we at least won't fight with ourselves? Small amount of code on our part? (In reply to Eric Paris from comment #57) > what would be our impact of doing out own file based locking? so we at least > won't fight with ourselves? Small amount of code on our part? It's not so much about fighting with ourselves. It's about openshift running iptables-restore, and then *other* stuff (docker, etc) failing that we don't directly control. We can obviously serialize our own iptables-restore/iptables usage with a Big Giant Lock. A variation of this idea is to duplicate the iptables locking logic in pkg/util/iptables right before calling iptables-restore. Unfortunately, there are two ways to do the logic; one for newer iptables (Fedora) and one for older iptables (RHEL). In no way is this ever, ever going to get upstream. It's going to have to be OpenShift + RHEL specific forever. Until Google hits it, and then of course somebody will add this hack in 4 hours directly to git :( For an pretty ugly workaround in OpenShift, see: https://github.com/openshift/origin/pull/13845 Moved the interim iptables workaround over to upstream Kubernetes instead of OpenShift; if/when it gets merged into Kubernetes we'll backport to OpenShift. https://github.com/kubernetes/kubernetes/pull/44895 Not a regression, and would only affect containerized openshift, the changes Online needs are in Origin already. Upstream fix was merged to origin as https://github.com/openshift/origin/pull/14186 Outstanding work is to ensure that /run/xtables.lock is mapped into the node container when running containerized openshift-node. Tested on OCP build v3.6.102 And iptables v1.4.21 Use an infinite loop to keep creating services which have same endpoint from master side. On the node side, check the process which is opening the /run/xtables.lock file. # while true ; do lsof +c0 /run/xtables.lock ; done The output will contain lots of iptables-restore command COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME iptables-restor 55274 root 3rW REG 0,18 0 53266 /run/xtables.lock COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME iptables-restor 55274 root 3rW REG 0,18 0 53266 /run/xtables.lock COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME iptables-restor 55298 root 3rW REG 0,18 0 53266 /run/xtables.lock COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME iptables-restor 55298 root 3rW REG 0,18 0 53266 /run/xtables.lock COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME iptables-restor 55320 root 3rW REG 0,18 0 53266 /run/xtables.lock COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME iptables-restor 55340 root 3r REG 0,18 0 53266 /run/xtables.lock I am going to mark this bug as verified. @dcbw Could you help check if my method is correct? (In reply to Meng Bo from comment #72) > Tested on OCP build v3.6.102 > And iptables v1.4.21 > > Use an infinite loop to keep creating services which have same endpoint from > master side. > > On the node side, check the process which is opening the /run/xtables.lock > file. > > # while true ; do lsof +c0 /run/xtables.lock ; done > > The output will contain lots of iptables-restore command > > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > iptables-restor 55274 root 3rW REG 0,18 0 53266 > /run/xtables.lock > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > iptables-restor 55274 root 3rW REG 0,18 0 53266 > /run/xtables.lock > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > iptables-restor 55298 root 3rW REG 0,18 0 53266 > /run/xtables.lock > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > iptables-restor 55298 root 3rW REG 0,18 0 53266 > /run/xtables.lock > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > iptables-restor 55320 root 3rW REG 0,18 0 53266 > /run/xtables.lock > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > iptables-restor 55340 root 3r REG 0,18 0 53266 > /run/xtables.lock > > I am going to mark this bug as verified. > > @dcbw Could you help check if my method is correct? Yes. Also we should verify in the openshift-node logs that we do not see any messages from about iptables resource temporarily unavailable (exit status 4). That would show up if the lock is not taken in some cases. verified in OCP 3.6.109 and iptables v1.4.21 using the same step in https://bugzilla.redhat.com/show_bug.cgi?id=1417234#c72, didn't see any node logs like "Resource temporarily unavailable (exit status 4)" I have seen a few cases appearing in the past week or so that seem to be giving the same messages as this bug "using network plugins cni: CNI request failed with status 400: Failed to ensure that nat chain POSTROUTING jumps to MASQUERADE . . . iptables: Resource temporarily unavilable" However the impact seems higher than what seems to be reported in this bz. Actually in the most recently attached case it seems to be responsible for a pod restarting thousands of times over a couple weeks! Is this potentially a different issue that happens to have a similar symptom? This seems to be occurring on 3.4 Updating severity to "urgent" to match multiple customer cases; issue seems to cause pods to be unable to start up, or to restart very frequently mrobson and others hitting this issue, I got some collaboration from Robbb; looks like the issues customers have been hitting in 3.4 or 3.5 may be related to the -w flag for waiting on iptables rules. [1] shows when we made decision back in 2015 to use the -w flag [2] should be a backport fix for [3] which is when this is solved; some issue with the wait flag. It looks like iptables-1.4.21-18 *should* fix the issue. Can OCP engineering make any comments as to if there is anything that needs to be patched in OCP? Is this the same issue as originally reported in this bug or are we clogging up the pipes? :) [1] https://github.com/kubernetes/kubernetes/issues/7370#issuecomment-97512684 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1438597 [3] https://github.com/kubernetes/kubernetes/pull/43575 Ok, the issues we've been hitting are different than the original issue in the bug. The issue in the bug is primarily logspam. The "new" issue is fixed in 3.6 with https://github.com/kubernetes/kubernetes/pull/43575 and I have opened https://bugzilla.redhat.com/show_bug.cgi?id=1481782 for 3.5. Sincere apologies for the bugspam! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3049 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |