Bug 1969979

Summary: TCP Port exhaustion in events handling
Product: Service Telemetry Framework Reporter: Chris Sibbitt <csibbitt>
Component: sg-core-containerAssignee: Chris Sibbitt <csibbitt>
Status: CLOSED ERRATA QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact: Joanne O'Flynn <joflynn>
Priority: high    
Version: 1.3CC: lmadsen, mrunge, pleimer
Target Milestone: z2Keywords: Triaged, ZStream
Target Release: 1.3 (STF)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: sg-core-container-4.0.3-5 Doc Type: Bug Fix
Doc Text:
Before this update, code delivered 28231 events and stopped working because it consumed all available outbound TCP ports. As a result, delivery of events stopped when all outbound TCP ports were consumed. To re-use persistent TCP connections in the default HTTP transport, close and consume the response body. As a result, persistent TCP connections can be re-used without exhausting all available outbound TCP connections, resulting in event delivery without a limit.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-04 18:21:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chris Sibbitt 2021-06-09 15:01:49 UTC
Description of problem:

Existing code delivers 28231 events and then stops working because it has
consumed all available outbound TCP ports.

"It is critical to both close the response body and to consume it,
in order to re-use persistent TCP connections in the default HTTP transport."

https://github.com/elastic/go-elasticsearch/blob/master/README.md#usage


How reproducible: Always


Steps to Reproduce:
1. Send >30,000 events (rsyslog, or presumably ceilometer or collectd)
2. Observe that only 28231 make it to Elasticsearch
3. Observe that there are 28232 TCP connections from the sg pod to the ES pod


Actual results:

$ ./esquery /sglogs-osp_cloudops_02.2021.06.07/_count
{"count":28231,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}

$ oc rsh elasticsearch-es-default-0
sh-4.4$ ss -tnp | tail -2
ESTAB 0 0 [::ffff:10.129.3.21]:9200 [::ffff:10.128.3.36]:42332 users:(("java",pid=7,fd=1863)) 
ESTAB 0 0 [::ffff:10.129.3.21]:9200 [::ffff:10.128.3.36]:59351 users:(("java",pid=7,fd=24131))
sh-4.4$ ss -tnp | grep 10.128.3.36 | wc
 28232 169392 4545352


Expected results:
* One (or a few) TCP connections, and an uninterrupted flow of events


Additional info:

* This was found during testing of an unreleased logging feature, but the code appears to apply to all events
* A patch is actively being worked on and will be tested ASAP

Comment 8 Paul Leimer 2021-09-30 17:02:18 UTC
Tested the down stream container and I no longer see TCP port exhaustion

Comment 13 errata-xmlrpc 2021-10-04 18:21:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Service Telemetry Framework 1.3.2 - Container Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3721