Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1404860 - daemonset burst on a large node count cluster can DoS/platform instability
daemonset burst on a large node count cluster can DoS/platform instability
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod (Show other bugs)
3.4.0
All Linux
high Severity high
: ---
: ---
Assigned To: Jay Vyas
Mike Fiedler
aos-scalability-34
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-12-14 16:15 EST by Jeremy Eder
Modified: 2017-08-16 15 EDT (History)
16 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Daemonsets would attempt to update 500 nodes at the same time. Consequence: This could cause a large impact on the registry and network as 500 nodes attempted to pull the same new image at the same time. Fix: Reduce the burst to 250 at a time. Result: Halves the networking and registry impact.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-10 01:17:28 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1716 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.6 RPM Release Advisory 2017-08-10 05:02:50 EDT

  None (edit)
Comment 3 Dan Williams 2016-12-16 12:43:09 EST
What version of openvswitch RPMs do you have on this system?
Comment 4 Jeremy Eder 2016-12-16 12:46:32 EST
openvswitch-2.4.0-2.el7_2.x86_64
Comment 5 Dan Williams 2016-12-16 12:51:19 EST
Though it's also been pointed out that OpenShift is taking 15GB of RSS, so the OOM killing of ovs-vswitchd is probably a misguided attempt to free up memory.
Comment 6 Eric Paris 2016-12-16 12:53:04 EST
This isn't "OVS" or "Registry" dying. This is an OOM. The OpenShift process is consuming 16G or memory. No? Sure an OVS daemon eating 768M is a bit, maybe we can address that.

https://bugzilla.redhat.com/show_bug.cgi?id=1331590

Already makes it much much less likely that we will pick one of the ovs daemons.


Can you explain exactly what the box in question is doing? The fact that the kernel is complaining about a SYN flood says it is getting "metric crap ton" of new connections. Is that part of the cause of OpenShift proccess eating 16G of ram. Seriously 16G? The machine on has 16G of ram, of course something died.

We just shouldn't blame OVS here...
Comment 7 Eric Paris 2016-12-16 12:54:45 EST
How about the version of openshift. the bug I reference as helping to make OVS more resilient is in 3.4, but clearly wasn't in place here...
Comment 8 Dan Williams 2016-12-16 12:59:43 EST
This version of OVS is pretty old and doesn't contain the workaround in https://bugzilla.redhat.com/show_bug.cgi?id=1331590 for situations like this, AFAICT.  Can you try updating to RHEL 7.3 at all?
Comment 10 Ricardo Lourenco 2016-12-19 17:07:30 EST
This was a "burst install" of logging, which means 980 nodes where labeled with 'oc label nodes --all logging-infra-fluentd=true'. This definitely isn't the best way of scaling up logging on such a large cluster but IMO it is prone to happen.


https://github.com/ewolinetz/origin-aggregated-logging/blob/b38a207c63128ec70936a5300cd751907f459be6/deployer/README.md#fluentd-1


In this case the logging deployer template will pull and schedule
1 deployer image per ES instance (ES_CLUSTER_SIZE = 3)
1 curator deployer image
1 kibana deployer
980 fluentd's (1 DaemonSet 1 per each cluster node)


While following the aggregate logging installation docs, a sysadmin with cluster-admin capability (node labeling) might inadvertently cause a similar situation. Especially if they are installing/scaling up logging on a big enough cluster, which might be already loaded with several other projects.
Comment 11 Rich Megginson 2017-01-06 11:36:55 EST
This isn't a logging issue - it is an issue with a thundering herd of daemonsets.
Comment 12 Derek Carr 2017-01-18 10:02:24 EST
Tim - how would you like to proceed on this?
Comment 13 Timothy St. Clair 2017-01-18 10:55:15 EST
I'm working on a patch upstream for 1.6 cycle.
Comment 15 Andy Goldstein 2017-03-22 15:42:20 EDT
Still waiting on 1.6 rebase
Comment 16 Andy Goldstein 2017-04-17 10:17:12 EDT
Still waiting on 1.6 rebase
Comment 17 Andy Goldstein 2017-05-02 14:34:18 EDT
1.6.1 in in, moving to MODIFIED
Comment 19 Andy Goldstein 2017-05-02 15:18:44 EDT
If there is a fix for this going into kube 1.7 or 1.8 and we need to get the fix into OCP sooner, please let me know. Otherwise I'm going to lower the severity of this bug to get it off the blocker list.
Comment 20 Jay Vyas 2017-05-02 15:39:50 EDT
- The real fix = non-subversion of the scheduler.

- Requirements for non-subversion = primitives for priority/preempt/gaurantee_admission.

- Those primitives?  not merged in yet.  So need we need to  wait for 1.7 for them to merge.

- Example, we need to do https://github.com/kubernetes/kubernetes/issues/22212, which is labelled for next-candidate, ive pinged folks to label it for 1.7.


Conclusion: Implementing daemonsets which are gated in a sane way without completely pruning them of their high-priority properties requires upstream primitives in place.  Those primitives need to merge in 1.7, hence knocking us back to 1.8 for the 'real' fix .
Comment 21 Andy Goldstein 2017-05-02 15:43:50 EDT
Ok, so we can either play UpcomingRelease ping pong on this every 3 weeks, or lower the severity. What do you want to do? And do you mind if I assign this to you, Jay?
Comment 22 Jay Vyas 2017-05-02 15:48:49 EDT
Feel free to assign this to me.

- The only workaround I can think of is to make the burst configurable.

- I say we lower priority, b/c there should i believe be some last minute workarounds which you could add with the pluggable ADM controllers, which are pending for 1.7.  i.e. have an adm controller slow down pod operations to prevent bursts.  Crazy idea but easy workaround i think.
Comment 23 Andy Goldstein 2017-05-02 15:51:14 EDT
You either have to lower the severity to low to get this off the blocker list every 3 weeks, or you end up playing ping pong. Up to you. Reassigned, thanks!
Comment 26 Derek Carr 2017-06-02 12:06:07 EDT
The referenced patch reduced the burst from 500 to 250.

I am not inclined to carry a bug through Kubernetes 1.8 for this item.  The referenced issue (https://github.com/kubernetes/kubernetes/issues/22212) which is a feature request did not make Kubernetes 1.7, and even if it makes Kubernetes 1.8, it is still likely alpha.

For the use-case, it sounds like the safest way to roll out a DaemonSet at this scale today is to do the following:

1. write a daemonset with a targeted label selector
2. control the rate at which you add that label to existing nodes
3. once all 1000 nodes are updated, remove the label selector on the daemonset.

is this not a fair enough work-around for the time-being?
Comment 27 Jeremy Eder 2017-06-02 13:47:42 EDT
That slow-roll is what we have documented:

https://docs.openshift.com/container-platform/3.5/install_config/aggregate_logging_sizing.html#install-config-aggregate-logging-sizing-guidelines-large-cluster-installation

For completeness there is a more recent issue here which has been active lately:
https://github.com/kubernetes/kubernetes/issues/42002
Comment 28 Jeremy Eder 2017-06-07 09:01:29 EDT
Gentleman from IBM Klaus Ma has put together a design document for this for kube 1.8.  Posting here for completeness.

https://docs.google.com/document/d/10Ch3dhD88mnHYTq9q4jtX3e9e6gpndC78g5Ea6q4JY4/edit?usp=sharing
Comment 29 Rich Megginson 2017-06-12 13:33:13 EDT
This has been partially fixed in the openshift-ansible roles/openshift_logging_fluentd role.  For each node, we label the node, then sleep for 1/2 second by default.  You can change the sleep time by setting the openshift_logging_fluentd_label_delay in ansible.
Comment 30 Mike Fiedler 2017-07-06 10:04:31 EDT
Marking verified for 3.6.   There are still opens issues such as https://bugzilla.redhat.com/show_bug.cgi?id=1467416 but 3.6 is improved.   There is the reduction in burst limit, additional sleeps in fluentd deployment and the new fluentd mux service to cut out the fluentd -> master-api-server connections.
Comment 32 errata-xmlrpc 2017-08-10 01:17:28 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Note You need to log in before you can comment on or make changes to this bug.