Bug 1417234 - Proxier and Docker Daemon - iptables: Resource temporarily unavailable. (exit status 4) [NEEDINFO]
Summary: Proxier and Docker Daemon - iptables: Resource temporarily unavailable. (exit...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.2.1
Hardware: All
OS: All
medium
urgent
Target Milestone: ---
: 3.6.z
Assignee: Dan Williams
QA Contact: Meng Bo
URL:
Whiteboard:
: 1425205 (view as bug list)
Depends On: 1438597
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-27 16:04 UTC by Matthew Robson
Modified: 2017-10-25 13:00 UTC (History)
26 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: the iptables proxy was not properly locking its use of iptables. Consequence: the iptables proxy could conflict with docker and the openshift-node process and cause a failure to start containers. Fix: the iptables proxy now locks its use of iptables. Result: pod creation failures due to improper locking of iptables should no longer occur
Clone Of:
Environment:
Last Closed: 2017-10-25 13:00:48 UTC
Target Upstream Version:
erich: needinfo? (jotodd)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Origin (Github) 14186 0 None None None 2017-07-11 15:04:11 UTC
Red Hat Knowledge Base (Solution) 3059761 0 None None None 2017-05-30 08:59:43 UTC
Red Hat Product Errata RHBA-2017:3049 0 normal SHIPPED_LIVE OpenShift Container Platform 3.6, 3.5, and 3.4 bug fix and enhancement update 2017-10-25 15:57:15 UTC

Description Matthew Robson 2017-01-27 16:04:30 UTC
Description of problem:
Unsure if this is kubernetes side or docker side, but figured proxier was a good starting point..  It looks like concurrent access to iptables can lead to container failures as well as docker daemon failures during a restart with a 'Resource temporarily unavailable. (exit status 4)'

Jan 26 12:59:28 atomic-openshift-node[31055]: I0126 12:59:28.524396   31055 proxier.go:484] Setting endpoints for "a-b-c/a-b-c-d:" to [10.1.2.32:9090]
Jan 26 12:59:28 atomic-openshift-node[31055]: I0126 12:59:28.530055   31055 kubelet.go:2411] SyncLoop (UPDATE, "api"): "a-deployment-tjc-1-4apy8_my-project(344544e2-e196-11e6-aecc-005056b548f4)"
Jan 26 12:59:28 atomic-openshift-node[31055]: I0126 12:59:28.530134   31055 kubelet.go:2411] SyncLoop (UPDATE, "api"): "logging-fluentd-1-t5rfn_openshift-infra(d69cdcd3-a86a-11e6-aecc-005056b548f4)"
Jan 26 12:59:28 atomic-openshift-node[31055]: I0126 12:59:28.530161   31055 kubelet.go:2411] SyncLoop (UPDATE, "api"): "frontend-1-deploy_a-b-app(8652f333-dceb-11e6-aecc-005056b548f4)"
Jan 26 12:59:28 atomic-openshift-node[31055]: I0126 12:59:28.530287   31055 kubelet.go:2430] SyncLoop (PLEG): "a-dc-g-1-g3mh6_q-b-c-d(e4aed719-dd9c-11e6-aecc-005056b548f4)", event: &pleg.PodLifecycleEvent{ID:"e4aed719-dd9c-11e6-aecc-005056b548f4", Type:"ContainerDied", Data:"c3edfc24cd05cf4d8c386919be054a0669012ea2566f0f1ceecd322667fd0457"}
Jan 26 12:59:28 forward-journal[73810]: time="2017-01-26T12:59:28.560663530-05:00" level=info msg="{Action=create, LoginUID=4294967295, PID=31055}"
Jan 26 12:59:28 forward-journal[73810]: time="2017-01-26T12:59:28.707170112-05:00" level=warning msg="Failed to allocate and map port 1095-1095: iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 1095 -j DNAT --to-destination 10.1.0.2:1095 ! -i lbr0: iptables: Resource temporarily unavailable.\n (exit status 4)"
Jan 26 12:59:28 forward-journal[73810]: time="2017-01-26T12:59:28.871004265-05:00" level=error msg="Handler for POST /containers/802a445d8f17921b8a2eac9113951d470859f774e70a1d245e62c9557c26d92e/start returned error: failed to create endpoint k8s_POD.6ec9fdbc_logging-fluentd-1-t5rfn_openshift-infra_d69cdcd3-a86a-11e6-aecc-005056b548f4_ef915621 on network bridge: iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 1095 -j DNAT --to-destination 10.1.0.2:1095 ! -i lbr0: iptables: Resource temporarily unavailable.\n (exit status 4)"
Jan 26 12:59:28 forward-journal[73810]: time="2017-01-26T12:59:28.871078909-05:00" level=error msg="Handler for POST /containers/802a445d8f17921b8a2eac9113951d470859f774e70a1d245e62c9557c26d92e/start returned error: failed to create endpoint k8s_POD.6ec9fdbc_logging-fluentd-1-t5rfn_openshift-infra_d69cdcd3-a86a-11e6-aecc-005056b548f4_ef915621 on network bridge: iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 1095 -j DNAT --to-destination 10.1.0.2:1095 ! -i lbr0: iptables: Resource temporarily unavailable.\n (exit status 4)"
Jan 26 12:59:28 atomic-openshift-node[31055]: E0126 12:59:28.872026   31055 manager.go:1858] Failed to create pod infra container: RunContainerError; Skipping pod "logging-fluentd-1-t5rfn_openshift-infra(d69cdcd3-a86a-11e6-aecc-005056b548f4)"
Jan 26 12:59:28 atomic-openshift-node[31055]: E0126 12:59:28.872114   31055 pod_workers.go:138] Error syncing pod d69cdcd3-a86a-11e6-aecc-005056b548f4, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): failed to create endpoint k8s_POD.6ec9fdbc_logging-fluentd-1-t5rfn_openshift-infra_d69cdcd3-a86a-11e6-aecc-005056b548f4_ef915621 on network bridge: iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 1095 -j DNAT --to-destination 10.1.0.2:1095 ! -i lbr0: iptables: Resource temporarily unavailable.\n (exit status 4)\n"

Two instances of docker daemon failing to start.  In this case, the docker daemon was restarted by the ansible scaleup in progress.

Jan 26 12:59:28 atomic-openshift-node[26076]: I0126 12:59:28.525083   26076 proxier.go:484] Setting endpoints for "a-b-c/a-b-c-d:" to [10.1.2.32:9090]
Jan 26 12:59:28 forward-journal[41380]: time="2017-01-26T12:59:28.707006438-05:00" level=fatal msg="Error starting daemon: Error initializing network controller: Error creating default \"bridge\" network: Failed to Setup IP tables: Unable to enable SKIP DNAT rule:  (iptables failed: iptables --wait -t nat -I DOCKER -i lbr0 -j RETURN: iptables: Resource temporarily unavailable.\n (exit status 4))"

Jan 26 12:39:23 atomic-openshift-node[31055]: I0126 12:39:23.532096   31055 proxier.go:484] Setting endpoints for "a-b/a-b-c-d:" to [10.1.2.10:9090]
Jan 26 12:39:23 forward-journal[64458]: time="2017-01-26T12:39:23.723613811-05:00" level=fatal msg="Error starting daemon: Error initializing network controller: Error creating default \"bridge\" network: Failed to program NAT chain: Failed to inject docker in PREROUTING chain: iptables failed: iptables --wait -t nat -A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER: iptables: Resource temporarily unavailable.\n (exit status 4)"


Version-Release number of selected component (if applicable):
3.2.1.9

How reproducible:
Random

Steps to Reproduce:
1.
2.
3.

Actual results:
Container Fails
Docker Daemon Fails to Start

Expected results:
Everything starts or retires.

Additional info:

Comment 3 Ben Bennett 2017-01-31 19:48:08 UTC
What version of iptables do they have?

Comment 4 Matthew Robson 2017-01-31 19:51:44 UTC
One is at: Version: iptables-1.4.21-16.el7.x86_64

Other is at: Version: iptables-1.4.21-17.el7.x86_64

Matt

Comment 5 Matthew Robson 2017-01-31 19:55:36 UTC
Possibly linked to:

https://github.com/docker/docker/issues/30379

https://github.com/docker/docker/issues/12547

Possibly resolved through docker 1.12+, but nothing concrete...

Comment 6 Dan Winship 2017-02-01 15:26:14 UTC
Are they using a containerized install? https://github.com/docker/docker/issues/30379 suggests the bug is caused by running iptables inside --host=network containers.

Comment 7 Matthew Robson 2017-02-01 20:37:44 UTC
No, these are standard OCP installs on RHEL 7.2/7.3.  Per the logs, this is even coming from containers like fluentd. Network is default in these containers.

From what I have seen, this always coincided when an endpoint being setup via proxier.go..

Ex: Jan 26 12:59:28 atomic-openshift-node[31055]: I0126 12:59:28.524396   31055 proxier.go:484] Setting endpoints for "a-b-c/a-b-c-d:" to [10.1.2.32:9090]

You can see the kube-services chain services;

https://github.com/kubernetes/kubernetes/blob/v1.3.0/pkg/proxy/iptables/proxier.go#L59

Ex:

Chain KUBE-SERVICES (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.36.114        /* has no endpoints */ tcp dpt:8081 reject-with icmp-port-unreachable
    0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.10.124        /* has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable
    0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.191.57        /* has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable
    0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.235.127       /* has no endpoints */ tcp dpt:9090 reject-with icmp-port-unreachable
    0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.106.65        /* has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable
    0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.84.187        /* has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable
    0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.110.32        /* openshift-infra/logging-kibana-ops: has no endpoints */ tcp dpt:443 reject-with icmp-port-unreachable
    0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.188.191       /* has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable
    0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.111.42        /* has no endpoints */ tcp dpt:80 reject-with icmp-port-unreachable
    0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.182.109       /* has no endpoints */ tcp dpt:9090 reject-with icmp-port-unreachable
    0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.157.157       /* has no endpoints */ tcp dpt:9090 reject-with icmp-port-unreachable

Comment 11 Ben Bennett 2017-02-09 12:15:19 UTC
dcbw: Can you please take a look at this and see if you have any ideas?

Comment 12 Ben Bennett 2017-02-09 20:27:49 UTC
Can you please grab the output from iptables-save on both versions please?

Comment 14 Eric Paris 2017-02-10 23:37:07 UTC
Would you be willing and able to turn on auditd on one of the nodes able to reproduce the problem (if it is not already running). After you reproduce with auditd running will you attach the /var/log/audit/audit.log? Or more specifically `ausearch -i -m NETFILTER_CFG`


Can you make sure that you do no have a default 'do not audit anything' rule in /etc/audit/audit.rules? A minimal rules file that would collect the best we can about who and how iptables is getting called would include:

-D
-a always,exit -b64 -S kill -F pid=1

Can you also give us a `ps -ef` listing from around the time of the reproduction while the audit rules were running (so we can match pids/ppids in the records to your daemons?)

Hopefully it can help us track down if some non-kube/docker process is adding/removing iptables rules and causing the issue...

Comment 15 Dan Williams 2017-02-13 04:55:43 UTC
Also, would the customer be willing to install an instrumented iptables RPM on the problem nodes so we can figure out where the EAGAIN is coming from?  I've spent a bunch of time looking at the code and can't figure out how that error is showing up from a simple "check this rule -C" call.  But I'm happy to build a testing RPM for them.

Comment 19 Dan Williams 2017-02-13 19:29:41 UTC
Also, any chance you can get the docker RPM version in each of the cases?

Comment 21 Dan Williams 2017-02-15 05:26:23 UTC
There's a debugging RPM here:

http://people.redhat.com/dcbw/iptables-1.4.21-17.1.el7_3.x86_64.rpm

install that on an affected node with:

rpm -Uhv iptables-1.4.21-17.1.el7_3.x86_64.rpm --force

and then run origin and reproduce the problem.  Unfortunately, due to a docker coding decision (eg, it assumes failure if iptables prints any output to stdout/stderr, even debugging output), this modified iptables will write debug output to /tmp/iptables-debug.txt.

So when the problem has been reproduced, please get:

1) /tmp/iptables-debug.txt
2) journalctl -b -u atomic-openshift-node

and we'll see how much further we can push this.  Thanks!

Comment 33 Dan Williams 2017-02-24 00:31:58 UTC
Thanks.  I've analyzed the iptables save dump and found what I believe to be a bug in iptables that causes rule checks to fail with the above error.

It appears that for jump targets that are kernel modules the operation will sometimes be marked "changed" and even though it's a simple check, iptables will run through an entire table replacement operation which is causing the EAGAIN/Resource temporarily unavailable.

I've started a query with our iptables experts to figure out if this is actually a bug or intended behavior.

Comment 35 Dan Williams 2017-02-26 03:40:40 UTC
To be clear, this doesn't solve the root-cause of *why* there are iptables useage conflicts, but it does reduce the issues related to some of them.

Comment 37 Eric Paris 2017-03-01 19:15:11 UTC
*** Bug 1425205 has been marked as a duplicate of this bug. ***

Comment 39 Eric Paris 2017-03-01 19:46:26 UTC
Assuming the problem was duplicated during the audit there there are a couple things I can glean.

There seem to be 3 places that call into the kernel's net/netfilter/x_tables.c::xt_replace_table() function.

1. Lots of calls from node -> iptables-restore
2. Fewer but many calls from node -> iptables
3. Docker calls to clone(2) with CLONE_NEWNET also hit here

I also notice that I wouldn't get a NETFILTER_CFG record for something that got EGAIN, so that kinda stinks. I could give some more different ruleset to try to capture those calls as well, but not certain we need it.

I'm wondering (and this is WILD speculation) if the race is between docker clone(2) calls and the iptables usage. Docker wouldn't be grabbing any file locks. Its not like docker is explicitly doing anything with iptables...

Comment 40 Eric Paris 2017-03-02 00:32:56 UTC
Aaron any chance you can see what you think about my wild speculation. I haven't found the code path between clone() and xt_replace_tables() but the audit log tells me such a thing exists (well, or clone() hits do_replace_finish() in the eb_tables code, but that seems much less likely).

Maybe if I've got some time I'll try to use systemtap to force a stack trace when clone hits the iptables update function in the kernel to figure out that kernel code path...

Comment 41 Dan Williams 2017-03-06 20:52:13 UTC
Looking at the audit log and the netfilter_cfg log, none of them show anomalous "exit=XX" values including no syscall returns for EAGAIN, leading me to wonder if the audit logs really do capture the error scenario.

Will, any chance you can capture some audit logs that definitely have the error?

Comment 44 Eric Paris 2017-03-09 19:47:03 UTC
The rules won't capture the EAGAIN case. The NETFILTER_CFG audit message is generated AFTER the EAGAIN check. blech.

But they do prove there were only 3 things that ever successfully got into that function. The node got in there via two paths and docker calling clone gets in there too (although i wildly guess that is in the new net namespace, not the host net namespace)

To get the failures as well we could probably add:

-a exit,always -F arch=b32 -S setsockopt -F exit=EAGAIN
-a exit,always -F arch=b64 -S setsockopt -F exit=EAGAIN

If those generate way too much data (possible) I think we could limit it via -F a2=64, but I'd rather get more than less...

Comment 45 Dan Williams 2017-03-15 20:40:31 UTC
So it turns out iptables-restore is not properly locked against the 'iptables' binary and the two can run at the same time, both modifying rules.  Since in openshift with many services and endpoints, iptables-restore takes a long amount of time, we may run into conflicts from that angle.

Please ask the customer to install this package to test:

http://people.redhat.com/dcbw/iptables-1.4.21-17.2.el7_3.x86_64.rpm

with --force and --nodeps (if required). It does *not* contain the previous debugging patch, it only adds code to ensure that iptables-restore also grabs the lock.  Let me know if this reduces or eliminates the problem on the nodes.

Comment 47 Dan Williams 2017-03-23 15:53:49 UTC
RPM updated with the "check" fix as well; same filename and location.  Whenever you're able to test it out, let me know the results.  Thanks!

Comment 51 Dan Williams 2017-03-29 18:23:14 UTC
***** NOTE: install these packages with --no-scripts to ensure that it does not restart the iptables system service, which will overwrite OpenShift iptables rules and potentially screw up node iptables.

Comment 52 Dan Williams 2017-04-01 04:30:52 UTC
Any update on this?  Anyone able to install the RPMs (with --noscripts!) and determine if the issue is still happening on that node?

Comment 55 Dan Williams 2017-04-04 21:48:00 UTC
I've filed a backport request for the relevant patches to the iptables package for RHEL7 in https://bugzilla.redhat.com/show_bug.cgi?id=1438597.

I've also filed a Kubernetes PR upstream to detect and use the new option as https://github.com/kubernetes/kubernetes/pull/43575.  If/when that lands we'll need to backport to OpenShift, which we'll keep this bug open for.

Comment 56 Ben Bennett 2017-04-19 14:05:20 UTC
Marking upcoming release since we need changes to iptables to merge in to RHEL.

Comment 57 Eric Paris 2017-04-19 14:11:38 UTC
what would be our impact of doing out own file based locking? so we at least won't fight with ourselves? Small amount of code on our part?

Comment 58 Dan Williams 2017-04-19 14:53:28 UTC
(In reply to Eric Paris from comment #57)
> what would be our impact of doing out own file based locking? so we at least
> won't fight with ourselves? Small amount of code on our part?

It's not so much about fighting with ourselves.  It's about openshift running iptables-restore, and then *other* stuff (docker, etc) failing that we don't directly control.

We can obviously serialize our own iptables-restore/iptables usage with a Big Giant Lock.

A variation of this idea is to duplicate the iptables locking logic in pkg/util/iptables right before calling iptables-restore.  Unfortunately, there are two ways to do the logic; one for newer iptables (Fedora) and one for older iptables (RHEL).

In no way is this ever, ever going to get upstream.  It's going to have to be OpenShift + RHEL specific forever.  Until Google hits it, and then of course somebody will add this hack in 4 hours directly to git :(

Comment 59 Dan Williams 2017-04-20 19:57:28 UTC
For an pretty ugly workaround in OpenShift, see:

https://github.com/openshift/origin/pull/13845

Comment 62 Dan Williams 2017-04-25 03:18:19 UTC
Moved the interim iptables workaround over to upstream Kubernetes instead of OpenShift; if/when it gets merged into Kubernetes we'll backport to OpenShift.

https://github.com/kubernetes/kubernetes/pull/44895

Comment 66 Ben Bennett 2017-06-02 18:26:05 UTC
Not a regression, and would only affect containerized openshift, the changes Online needs are in Origin already.

Comment 67 Dan Williams 2017-06-02 18:32:14 UTC
Upstream fix was merged to origin as https://github.com/openshift/origin/pull/14186

Outstanding work is to ensure that /run/xtables.lock is mapped into the node container when running containerized openshift-node.

Comment 72 Meng Bo 2017-06-13 11:21:34 UTC
Tested on OCP build v3.6.102
And iptables v1.4.21

Use an infinite loop to keep creating services which have same endpoint from master side.

On the node side, check the process which is opening the /run/xtables.lock file.

# while true ; do lsof +c0 /run/xtables.lock ; done

The output will contain lots of iptables-restore command

COMMAND           PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
iptables-restor 55274 root    3rW  REG   0,18        0 53266 /run/xtables.lock
COMMAND           PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
iptables-restor 55274 root    3rW  REG   0,18        0 53266 /run/xtables.lock
COMMAND           PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
iptables-restor 55298 root    3rW  REG   0,18        0 53266 /run/xtables.lock
COMMAND           PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
iptables-restor 55298 root    3rW  REG   0,18        0 53266 /run/xtables.lock
COMMAND           PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
iptables-restor 55320 root    3rW  REG   0,18        0 53266 /run/xtables.lock
COMMAND           PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
iptables-restor 55340 root    3r   REG   0,18        0 53266 /run/xtables.lock

I am going to mark this bug as verified.

@dcbw Could you help check if my method is correct?

Comment 73 Dan Williams 2017-06-19 21:26:22 UTC
(In reply to Meng Bo from comment #72)
> Tested on OCP build v3.6.102
> And iptables v1.4.21
> 
> Use an infinite loop to keep creating services which have same endpoint from
> master side.
> 
> On the node side, check the process which is opening the /run/xtables.lock
> file.
> 
> # while true ; do lsof +c0 /run/xtables.lock ; done
> 
> The output will contain lots of iptables-restore command
> 
> COMMAND           PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
> iptables-restor 55274 root    3rW  REG   0,18        0 53266
> /run/xtables.lock
> COMMAND           PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
> iptables-restor 55274 root    3rW  REG   0,18        0 53266
> /run/xtables.lock
> COMMAND           PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
> iptables-restor 55298 root    3rW  REG   0,18        0 53266
> /run/xtables.lock
> COMMAND           PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
> iptables-restor 55298 root    3rW  REG   0,18        0 53266
> /run/xtables.lock
> COMMAND           PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
> iptables-restor 55320 root    3rW  REG   0,18        0 53266
> /run/xtables.lock
> COMMAND           PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
> iptables-restor 55340 root    3r   REG   0,18        0 53266
> /run/xtables.lock
> 
> I am going to mark this bug as verified.
> 
> @dcbw Could you help check if my method is correct?

Yes.  Also we should verify in the openshift-node logs that we do not see any messages from about iptables resource temporarily unavailable (exit status 4).  That would show up if the lock is not taken in some cases.

Comment 74 Hongan Li 2017-06-20 09:11:56 UTC
verified in OCP 3.6.109 and iptables v1.4.21 using the same step in  https://bugzilla.redhat.com/show_bug.cgi?id=1417234#c72,

didn't see any node logs like "Resource temporarily unavailable (exit status 4)"

Comment 78 Steven Walter 2017-08-11 19:56:42 UTC
I have seen a few cases appearing in the past week or so that seem to be giving the same messages as this bug

"using network plugins cni: CNI request failed with status 400: Failed to ensure that nat chain POSTROUTING jumps to MASQUERADE . . . iptables: Resource temporarily unavilable"

However the impact seems higher than what seems to be reported in this bz. Actually in the most recently attached case it seems to be responsible for a pod restarting thousands of times over a couple weeks! Is this potentially a different issue that happens to have a similar symptom? This seems to be occurring on 3.4

Comment 80 Steven Walter 2017-08-14 22:01:07 UTC
Updating severity to "urgent" to match  multiple customer cases; issue seems to cause pods to be unable to start up, or to restart very frequently

Comment 82 Steven Walter 2017-08-15 14:04:52 UTC
mrobson and others hitting this issue, I got some collaboration from Robbb; looks like the issues customers have been hitting in 3.4 or 3.5 may be related to the -w flag for waiting on iptables rules.

[1] shows when we made decision back in 2015 to use the -w flag
[2] should be a backport fix for [3] which is when this is solved; some issue with the wait flag. It looks like iptables-1.4.21-18 *should* fix the issue.

Can OCP engineering make any comments as to if there is anything that needs to be patched in OCP? Is this the same issue as originally reported in this bug or are we clogging up the pipes? :)

[1] https://github.com/kubernetes/kubernetes/issues/7370#issuecomment-97512684
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1438597
[3] https://github.com/kubernetes/kubernetes/pull/43575

Comment 83 Steven Walter 2017-08-15 17:19:25 UTC
Ok, the issues we've been hitting are different than the original issue in the bug. The issue in the bug is primarily logspam. The "new" issue is fixed in 3.6 with https://github.com/kubernetes/kubernetes/pull/43575 and I have opened https://bugzilla.redhat.com/show_bug.cgi?id=1481782 for 3.5. Sincere apologies for the bugspam!

Comment 88 errata-xmlrpc 2017-10-25 13:00:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049


Note You need to log in before you can comment on or make changes to this bug.