1969979 – TCP Port exhaustion in events handling

Bug 1969979 - TCP Port exhaustion in events handling

Summary: TCP Port exhaustion in events handling

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Service Telemetry Framework
Classification:	Red Hat
Component:	sg-core-container
Sub Component:
Version:	1.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z2
Target Release:	1.3 (STF)
Assignee:	Chris Sibbitt
QA Contact:	Leonid Natapov
Docs Contact:	Joanne O'Flynn
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-09 15:01 UTC by Chris Sibbitt
Modified:	2021-10-04 18:21 UTC (History)
CC List:	3 users (show)
Fixed In Version:	sg-core-container-4.0.3-5
Doc Type:	Bug Fix
Doc Text:	Before this update, code delivered 28231 events and stopped working because it consumed all available outbound TCP ports. As a result, delivery of events stopped when all outbound TCP ports were consumed. To re-use persistent TCP connections in the default HTTP transport, close and consume the response body. As a result, persistent TCP connections can be re-used without exhausting all available outbound TCP connections, resulting in event delivery without a limit.
Clone Of:
Environment:
Last Closed:	2021-10-04 18:21:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	infrawatch sg-core pull 50	0	None	open	Fix TCP port exhaustion	2021-08-09 19:16:29 UTC
Red Hat Product Errata	RHBA-2021:3721	0	None	None	None	2021-10-04 18:21:02 UTC

Description Chris Sibbitt 2021-06-09 15:01:49 UTC

Description of problem:

Existing code delivers 28231 events and then stops working because it has
consumed all available outbound TCP ports.

"It is critical to both close the response body and to consume it,
in order to re-use persistent TCP connections in the default HTTP transport."

https://github.com/elastic/go-elasticsearch/blob/master/README.md#usage


How reproducible: Always


Steps to Reproduce:
1. Send >30,000 events (rsyslog, or presumably ceilometer or collectd)
2. Observe that only 28231 make it to Elasticsearch
3. Observe that there are 28232 TCP connections from the sg pod to the ES pod


Actual results:

$ ./esquery /sglogs-osp_cloudops_02.2021.06.07/_count
{"count":28231,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}

$ oc rsh elasticsearch-es-default-0
sh-4.4$ ss -tnp | tail -2
ESTAB 0 0 [::ffff:10.129.3.21]:9200 [::ffff:10.128.3.36]:42332 users:(("java",pid=7,fd=1863)) 
ESTAB 0 0 [::ffff:10.129.3.21]:9200 [::ffff:10.128.3.36]:59351 users:(("java",pid=7,fd=24131))
sh-4.4$ ss -tnp | grep 10.128.3.36 | wc
 28232 169392 4545352


Expected results:
* One (or a few) TCP connections, and an uninterrupted flow of events


Additional info:

* This was found during testing of an unreleased logging feature, but the code appears to apply to all events
* A patch is actively being worked on and will be tested ASAP

Comment 8 Paul Leimer 2021-09-30 17:02:18 UTC

Tested the down stream container and I no longer see TCP port exhaustion

Comment 13 errata-xmlrpc 2021-10-04 18:21:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Service Telemetry Framework 1.3.2 - Container Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3721

Note You need to log in before you can comment on or make changes to this bug.