Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1386693

Summary:

Scale test of logging breaks OCP

Product:

OpenShift Container Platform

Reporter:

Ricardo Lourenco <rlourenc>

Component:

Logging

Assignee:

ewolinet

Status:

CLOSED DUPLICATE

QA Contact:

Xia Zhao <xiazhao>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

3.3.0

CC:

aos-bugs, jcantril, jeder, lmeyer, pportant, rlourenc, rmeggins

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-10-27 14:33:36 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Output from ss -tmpie	none

Description Ricardo Lourenco 2016-10-19 13:18:04 UTC

Created attachment 1212140 [details]
Output from ss -tmpie

Description of problem:

While performing a 3000 pod scale test of logging during 1 hour the cluster became unstable. Probably due to lb/master port exhaustion. Too many connection to the API server.


Version-Release number of selected component (if applicable):
wget https://raw.githubusercontent.com/openshift/origin-aggregated-logging/enterprise/deployment/deployer.yaml -O ${ANSIBLE_TEMPLATE}


Steps to Reproduce:
1. Install logging across 100 nodes.
2. Run 30 pods per node logging at 256B/s

Possibly related to: 
https://bugzilla.redhat.com/show_bug.cgi?id=1384626


Additional info:

192.1.5.34 (svt-i-1): logging-es-pod1
192.1.5.35 (svt-i-2): logging-es-pod2
192.1.8.77 : logging-fluentd

# environment data
http://perf-infra.ec2.breakage.org/incoming/svt-j/bos_efk_256ll_15kbm_30ppn_100nodes_1hour/log/

# pbench data
http://perf-infra.ec2.breakage.org/incoming/svt-j/bos_efk_256ll_15kbm_30ppn_100nodes_1hour/tools-default/

Comment 1 Rich Megginson 2016-10-19 16:12:07 UTC

Can we run the 3000 pod scale test of logging again, but allow unlimited connections to the API server?
If the problem is running out of sockets, can we do some tcp tuning like increasing the ephemeral port range, decreasing the time wait interval, and increasing the ability to reuse tw sockets?

Comment 2 Ricardo Lourenco 2016-10-20 13:10:59 UTC

Sure. I'll be installing logging in a new cluster today and get back with results as soon as I have some data.

Comment 3 Luke Meyer 2016-10-27 14:25:43 UTC

(In reply to Ricardo Lourenco from comment #2)

Did you have a summary of your results?

Comment 4 Jeff Cantrill 2016-10-27 14:33:36 UTC


*** This bug has been marked as a duplicate of bug 1384626 ***

Comment 5 Ricardo Lourenco 2017-01-24 14:19:23 UTC

I wasn't able to use the cluster to reproduce this as other people where working on it. Any further info will be updated / tracked on bug 1384626