Bug 1386693

Summary: Scale test of logging breaks OCP
Product: OpenShift Container Platform Reporter: Ricardo Lourenco <rlourenc>
Component: LoggingAssignee: ewolinet
Status: CLOSED DUPLICATE QA Contact: Xia Zhao <xiazhao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: aos-bugs, jcantril, jeder, lmeyer, pportant, rlourenc, rmeggins
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-27 14:33:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Output from ss -tmpie none

Description Ricardo Lourenco 2016-10-19 13:18:04 UTC
Created attachment 1212140 [details]
Output from ss -tmpie

Description of problem:

While performing a 3000 pod scale test of logging during 1 hour the cluster became unstable. Probably due to lb/master port exhaustion. Too many connection to the API server.


Version-Release number of selected component (if applicable):
wget https://raw.githubusercontent.com/openshift/origin-aggregated-logging/enterprise/deployment/deployer.yaml -O ${ANSIBLE_TEMPLATE}


Steps to Reproduce:
1. Install logging across 100 nodes.
2. Run 30 pods per node logging at 256B/s

Possibly related to: 
https://bugzilla.redhat.com/show_bug.cgi?id=1384626


Additional info:

192.1.5.34 (svt-i-1): logging-es-pod1
192.1.5.35 (svt-i-2): logging-es-pod2
192.1.8.77 : logging-fluentd

# environment data
http://perf-infra.ec2.breakage.org/incoming/svt-j/bos_efk_256ll_15kbm_30ppn_100nodes_1hour/log/

# pbench data
http://perf-infra.ec2.breakage.org/incoming/svt-j/bos_efk_256ll_15kbm_30ppn_100nodes_1hour/tools-default/

Comment 1 Rich Megginson 2016-10-19 16:12:07 UTC
Can we run the 3000 pod scale test of logging again, but allow unlimited connections to the API server?
If the problem is running out of sockets, can we do some tcp tuning like increasing the ephemeral port range, decreasing the time wait interval, and increasing the ability to reuse tw sockets?

Comment 2 Ricardo Lourenco 2016-10-20 13:10:59 UTC
Sure. I'll be installing logging in a new cluster today and get back with results as soon as I have some data.

Comment 3 Luke Meyer 2016-10-27 14:25:43 UTC
(In reply to Ricardo Lourenco from comment #2)

Did you have a summary of your results?

Comment 4 Jeff Cantrill 2016-10-27 14:33:36 UTC

*** This bug has been marked as a duplicate of bug 1384626 ***

Comment 5 Ricardo Lourenco 2017-01-24 14:19:23 UTC
I wasn't able to use the cluster to reproduce this as other people where working on it. Any further info will be updated / tracked on bug 1384626