Created attachment 1212140 [details] Output from ss -tmpie Description of problem: While performing a 3000 pod scale test of logging during 1 hour the cluster became unstable. Probably due to lb/master port exhaustion. Too many connection to the API server. Version-Release number of selected component (if applicable): wget https://raw.githubusercontent.com/openshift/origin-aggregated-logging/enterprise/deployment/deployer.yaml -O ${ANSIBLE_TEMPLATE} Steps to Reproduce: 1. Install logging across 100 nodes. 2. Run 30 pods per node logging at 256B/s Possibly related to: https://bugzilla.redhat.com/show_bug.cgi?id=1384626 Additional info: 192.1.5.34 (svt-i-1): logging-es-pod1 192.1.5.35 (svt-i-2): logging-es-pod2 192.1.8.77 : logging-fluentd # environment data http://perf-infra.ec2.breakage.org/incoming/svt-j/bos_efk_256ll_15kbm_30ppn_100nodes_1hour/log/ # pbench data http://perf-infra.ec2.breakage.org/incoming/svt-j/bos_efk_256ll_15kbm_30ppn_100nodes_1hour/tools-default/
Can we run the 3000 pod scale test of logging again, but allow unlimited connections to the API server? If the problem is running out of sockets, can we do some tcp tuning like increasing the ephemeral port range, decreasing the time wait interval, and increasing the ability to reuse tw sockets?
Sure. I'll be installing logging in a new cluster today and get back with results as soon as I have some data.
(In reply to Ricardo Lourenco from comment #2) Did you have a summary of your results?
*** This bug has been marked as a duplicate of bug 1384626 ***
I wasn't able to use the cluster to reproduce this as other people where working on it. Any further info will be updated / tracked on bug 1384626