What version of openvswitch RPMs do you have on this system?
openvswitch-2.4.0-2.el7_2.x86_64
Though it's also been pointed out that OpenShift is taking 15GB of RSS, so the OOM killing of ovs-vswitchd is probably a misguided attempt to free up memory.
This isn't "OVS" or "Registry" dying. This is an OOM. The OpenShift process is consuming 16G or memory. No? Sure an OVS daemon eating 768M is a bit, maybe we can address that. https://bugzilla.redhat.com/show_bug.cgi?id=1331590 Already makes it much much less likely that we will pick one of the ovs daemons. Can you explain exactly what the box in question is doing? The fact that the kernel is complaining about a SYN flood says it is getting "metric crap ton" of new connections. Is that part of the cause of OpenShift proccess eating 16G of ram. Seriously 16G? The machine on has 16G of ram, of course something died. We just shouldn't blame OVS here...
How about the version of openshift. the bug I reference as helping to make OVS more resilient is in 3.4, but clearly wasn't in place here...
This version of OVS is pretty old and doesn't contain the workaround in https://bugzilla.redhat.com/show_bug.cgi?id=1331590 for situations like this, AFAICT. Can you try updating to RHEL 7.3 at all?
This was a "burst install" of logging, which means 980 nodes where labeled with 'oc label nodes --all logging-infra-fluentd=true'. This definitely isn't the best way of scaling up logging on such a large cluster but IMO it is prone to happen. https://github.com/ewolinetz/origin-aggregated-logging/blob/b38a207c63128ec70936a5300cd751907f459be6/deployer/README.md#fluentd-1 In this case the logging deployer template will pull and schedule 1 deployer image per ES instance (ES_CLUSTER_SIZE = 3) 1 curator deployer image 1 kibana deployer 980 fluentd's (1 DaemonSet 1 per each cluster node) While following the aggregate logging installation docs, a sysadmin with cluster-admin capability (node labeling) might inadvertently cause a similar situation. Especially if they are installing/scaling up logging on a big enough cluster, which might be already loaded with several other projects.
This isn't a logging issue - it is an issue with a thundering herd of daemonsets.
Tim - how would you like to proceed on this?
I'm working on a patch upstream for 1.6 cycle.
Still waiting on 1.6 rebase
1.6.1 in in, moving to MODIFIED
If there is a fix for this going into kube 1.7 or 1.8 and we need to get the fix into OCP sooner, please let me know. Otherwise I'm going to lower the severity of this bug to get it off the blocker list.
- The real fix = non-subversion of the scheduler. - Requirements for non-subversion = primitives for priority/preempt/gaurantee_admission. - Those primitives? not merged in yet. So need we need to wait for 1.7 for them to merge. - Example, we need to do https://github.com/kubernetes/kubernetes/issues/22212, which is labelled for next-candidate, ive pinged folks to label it for 1.7. Conclusion: Implementing daemonsets which are gated in a sane way without completely pruning them of their high-priority properties requires upstream primitives in place. Those primitives need to merge in 1.7, hence knocking us back to 1.8 for the 'real' fix .
Ok, so we can either play UpcomingRelease ping pong on this every 3 weeks, or lower the severity. What do you want to do? And do you mind if I assign this to you, Jay?
Feel free to assign this to me. - The only workaround I can think of is to make the burst configurable. - I say we lower priority, b/c there should i believe be some last minute workarounds which you could add with the pluggable ADM controllers, which are pending for 1.7. i.e. have an adm controller slow down pod operations to prevent bursts. Crazy idea but easy workaround i think.
You either have to lower the severity to low to get this off the blocker list every 3 weeks, or you end up playing ping pong. Up to you. Reassigned, thanks!
The referenced patch reduced the burst from 500 to 250. I am not inclined to carry a bug through Kubernetes 1.8 for this item. The referenced issue (https://github.com/kubernetes/kubernetes/issues/22212) which is a feature request did not make Kubernetes 1.7, and even if it makes Kubernetes 1.8, it is still likely alpha. For the use-case, it sounds like the safest way to roll out a DaemonSet at this scale today is to do the following: 1. write a daemonset with a targeted label selector 2. control the rate at which you add that label to existing nodes 3. once all 1000 nodes are updated, remove the label selector on the daemonset. is this not a fair enough work-around for the time-being?
That slow-roll is what we have documented: https://docs.openshift.com/container-platform/3.5/install_config/aggregate_logging_sizing.html#install-config-aggregate-logging-sizing-guidelines-large-cluster-installation For completeness there is a more recent issue here which has been active lately: https://github.com/kubernetes/kubernetes/issues/42002
Gentleman from IBM Klaus Ma has put together a design document for this for kube 1.8. Posting here for completeness. https://docs.google.com/document/d/10Ch3dhD88mnHYTq9q4jtX3e9e6gpndC78g5Ea6q4JY4/edit?usp=sharing
This has been partially fixed in the openshift-ansible roles/openshift_logging_fluentd role. For each node, we label the node, then sleep for 1/2 second by default. You can change the sleep time by setting the openshift_logging_fluentd_label_delay in ansible.
Marking verified for 3.6. There are still opens issues such as https://bugzilla.redhat.com/show_bug.cgi?id=1467416 but 3.6 is improved. There is the reduction in burst limit, additional sleeps in fluentd deployment and the new fluentd mux service to cut out the fluentd -> master-api-server connections.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716