1404860 – daemonset burst on a large node count cluster can DoS/platform instability

Bug 1404860 - daemonset burst on a large node count cluster can DoS/platform instability

Summary: daemonset burst on a large node count cluster can DoS/platform instability

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.4.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jay Vyas
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:	aos-scalability-34
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-12-14 21:15 UTC by Jeremy Eder
Modified:	2017-08-16 19:51 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Daemonsets would attempt to update 500 nodes at the same time. Consequence: This could cause a large impact on the registry and network as 500 nodes attempted to pull the same new image at the same time. Fix: Reduce the burst to 250 at a time. Result: Halves the networking and registry impact.
Clone Of:
Environment:
Last Closed:	2017-08-10 05:17:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:1716	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.6 RPM Release Advisory	2017-08-10 09:02:50 UTC

Comment 3 Dan Williams 2016-12-16 17:43:09 UTC

What version of openvswitch RPMs do you have on this system?

Comment 4 Jeremy Eder 2016-12-16 17:46:32 UTC

openvswitch-2.4.0-2.el7_2.x86_64

Comment 5 Dan Williams 2016-12-16 17:51:19 UTC

Though it's also been pointed out that OpenShift is taking 15GB of RSS, so the OOM killing of ovs-vswitchd is probably a misguided attempt to free up memory.

Comment 6 Eric Paris 2016-12-16 17:53:04 UTC

This isn't "OVS" or "Registry" dying. This is an OOM. The OpenShift process is consuming 16G or memory. No? Sure an OVS daemon eating 768M is a bit, maybe we can address that.

https://bugzilla.redhat.com/show_bug.cgi?id=1331590

Already makes it much much less likely that we will pick one of the ovs daemons.


Can you explain exactly what the box in question is doing? The fact that the kernel is complaining about a SYN flood says it is getting "metric crap ton" of new connections. Is that part of the cause of OpenShift proccess eating 16G of ram. Seriously 16G? The machine on has 16G of ram, of course something died.

We just shouldn't blame OVS here...

Comment 7 Eric Paris 2016-12-16 17:54:45 UTC

How about the version of openshift. the bug I reference as helping to make OVS more resilient is in 3.4, but clearly wasn't in place here...

Comment 8 Dan Williams 2016-12-16 17:59:43 UTC

This version of OVS is pretty old and doesn't contain the workaround in https://bugzilla.redhat.com/show_bug.cgi?id=1331590 for situations like this, AFAICT.  Can you try updating to RHEL 7.3 at all?

Comment 10 Ricardo Lourenco 2016-12-19 22:07:30 UTC

This was a "burst install" of logging, which means 980 nodes where labeled with 'oc label nodes --all logging-infra-fluentd=true'. This definitely isn't the best way of scaling up logging on such a large cluster but IMO it is prone to happen.


https://github.com/ewolinetz/origin-aggregated-logging/blob/b38a207c63128ec70936a5300cd751907f459be6/deployer/README.md#fluentd-1


In this case the logging deployer template will pull and schedule
1 deployer image per ES instance (ES_CLUSTER_SIZE = 3)
1 curator deployer image
1 kibana deployer
980 fluentd's (1 DaemonSet 1 per each cluster node)


While following the aggregate logging installation docs, a sysadmin with cluster-admin capability (node labeling) might inadvertently cause a similar situation. Especially if they are installing/scaling up logging on a big enough cluster, which might be already loaded with several other projects.

Comment 11 Rich Megginson 2017-01-06 16:36:55 UTC

This isn't a logging issue - it is an issue with a thundering herd of daemonsets.

Comment 12 Derek Carr 2017-01-18 15:02:24 UTC

Tim - how would you like to proceed on this?

Comment 13 Timothy St. Clair 2017-01-18 15:55:15 UTC

I'm working on a patch upstream for 1.6 cycle.

Comment 15 Andy Goldstein 2017-03-22 19:42:20 UTC

Still waiting on 1.6 rebase

Comment 16 Andy Goldstein 2017-04-17 14:17:12 UTC

Still waiting on 1.6 rebase

Comment 17 Andy Goldstein 2017-05-02 18:34:18 UTC

1.6.1 in in, moving to MODIFIED

Comment 19 Andy Goldstein 2017-05-02 19:18:44 UTC

If there is a fix for this going into kube 1.7 or 1.8 and we need to get the fix into OCP sooner, please let me know. Otherwise I'm going to lower the severity of this bug to get it off the blocker list.

Comment 20 Jay Vyas 2017-05-02 19:39:50 UTC

- The real fix = non-subversion of the scheduler.

- Requirements for non-subversion = primitives for priority/preempt/gaurantee_admission.

- Those primitives?  not merged in yet.  So need we need to  wait for 1.7 for them to merge.

- Example, we need to do https://github.com/kubernetes/kubernetes/issues/22212, which is labelled for next-candidate, ive pinged folks to label it for 1.7.


Conclusion: Implementing daemonsets which are gated in a sane way without completely pruning them of their high-priority properties requires upstream primitives in place.  Those primitives need to merge in 1.7, hence knocking us back to 1.8 for the 'real' fix .

Comment 21 Andy Goldstein 2017-05-02 19:43:50 UTC

Ok, so we can either play UpcomingRelease ping pong on this every 3 weeks, or lower the severity. What do you want to do? And do you mind if I assign this to you, Jay?

Comment 22 Jay Vyas 2017-05-02 19:48:49 UTC

Feel free to assign this to me.

- The only workaround I can think of is to make the burst configurable.

- I say we lower priority, b/c there should i believe be some last minute workarounds which you could add with the pluggable ADM controllers, which are pending for 1.7.  i.e. have an adm controller slow down pod operations to prevent bursts.  Crazy idea but easy workaround i think.

Comment 23 Andy Goldstein 2017-05-02 19:51:14 UTC

You either have to lower the severity to low to get this off the blocker list every 3 weeks, or you end up playing ping pong. Up to you. Reassigned, thanks!

Comment 26 Derek Carr 2017-06-02 16:06:07 UTC

The referenced patch reduced the burst from 500 to 250.

I am not inclined to carry a bug through Kubernetes 1.8 for this item.  The referenced issue (https://github.com/kubernetes/kubernetes/issues/22212) which is a feature request did not make Kubernetes 1.7, and even if it makes Kubernetes 1.8, it is still likely alpha.

For the use-case, it sounds like the safest way to roll out a DaemonSet at this scale today is to do the following:

1. write a daemonset with a targeted label selector
2. control the rate at which you add that label to existing nodes
3. once all 1000 nodes are updated, remove the label selector on the daemonset.

is this not a fair enough work-around for the time-being?

Comment 27 Jeremy Eder 2017-06-02 17:47:42 UTC

That slow-roll is what we have documented:

https://docs.openshift.com/container-platform/3.5/install_config/aggregate_logging_sizing.html#install-config-aggregate-logging-sizing-guidelines-large-cluster-installation

For completeness there is a more recent issue here which has been active lately:
https://github.com/kubernetes/kubernetes/issues/42002

Comment 28 Jeremy Eder 2017-06-07 13:01:29 UTC

Gentleman from IBM Klaus Ma has put together a design document for this for kube 1.8.  Posting here for completeness.

https://docs.google.com/document/d/10Ch3dhD88mnHYTq9q4jtX3e9e6gpndC78g5Ea6q4JY4/edit?usp=sharing

Comment 29 Rich Megginson 2017-06-12 17:33:13 UTC

This has been partially fixed in the openshift-ansible roles/openshift_logging_fluentd role.  For each node, we label the node, then sleep for 1/2 second by default.  You can change the sleep time by setting the openshift_logging_fluentd_label_delay in ansible.

Comment 30 Mike Fiedler 2017-07-06 14:04:31 UTC

Marking verified for 3.6.   There are still opens issues such as https://bugzilla.redhat.com/show_bug.cgi?id=1467416 but 3.6 is improved.   There is the reduction in burst limit, additional sleeps in fluentd deployment and the new fluentd mux service to cut out the fluentd -> master-api-server connections.

Comment 32 errata-xmlrpc 2017-08-10 05:17:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Note You need to log in before you can comment on or make changes to this bug.