1386693 – Scale test of logging breaks OCP

Bug 1386693 - Scale test of logging breaks OCP

Summary: Scale test of logging breaks OCP

Keywords:
Status:	CLOSED DUPLICATE of bug 1384626
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.3.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	ewolinet
QA Contact:	Xia Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-19 13:18 UTC by Ricardo Lourenco
Modified:	2017-01-24 14:19 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-10-27 14:33:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Output from ss -tmpie (7.98 MB, text/plain) 2016-10-19 13:18 UTC, Ricardo Lourenco	no flags	Details
View All

Description Ricardo Lourenco 2016-10-19 13:18:04 UTC

Created attachment 1212140 [details]
Output from ss -tmpie

Description of problem:

While performing a 3000 pod scale test of logging during 1 hour the cluster became unstable. Probably due to lb/master port exhaustion. Too many connection to the API server.


Version-Release number of selected component (if applicable):
wget https://raw.githubusercontent.com/openshift/origin-aggregated-logging/enterprise/deployment/deployer.yaml -O ${ANSIBLE_TEMPLATE}


Steps to Reproduce:
1. Install logging across 100 nodes.
2. Run 30 pods per node logging at 256B/s

Possibly related to: 
https://bugzilla.redhat.com/show_bug.cgi?id=1384626


Additional info:

192.1.5.34 (svt-i-1): logging-es-pod1
192.1.5.35 (svt-i-2): logging-es-pod2
192.1.8.77 : logging-fluentd

# environment data
http://perf-infra.ec2.breakage.org/incoming/svt-j/bos_efk_256ll_15kbm_30ppn_100nodes_1hour/log/

# pbench data
http://perf-infra.ec2.breakage.org/incoming/svt-j/bos_efk_256ll_15kbm_30ppn_100nodes_1hour/tools-default/

Comment 1 Rich Megginson 2016-10-19 16:12:07 UTC

Can we run the 3000 pod scale test of logging again, but allow unlimited connections to the API server?
If the problem is running out of sockets, can we do some tcp tuning like increasing the ephemeral port range, decreasing the time wait interval, and increasing the ability to reuse tw sockets?

Comment 2 Ricardo Lourenco 2016-10-20 13:10:59 UTC

Sure. I'll be installing logging in a new cluster today and get back with results as soon as I have some data.

Comment 3 Luke Meyer 2016-10-27 14:25:43 UTC

(In reply to Ricardo Lourenco from comment #2)

Did you have a summary of your results?

Comment 4 Jeff Cantrill 2016-10-27 14:33:36 UTC


*** This bug has been marked as a duplicate of bug 1384626 ***

Comment 5 Ricardo Lourenco 2017-01-24 14:19:23 UTC

I wasn't able to use the cluster to reproduce this as other people where working on it. Any further info will be updated / tracked on bug 1384626

Note You need to log in before you can comment on or make changes to this bug.