1947989 – pmproxy hangs and consume 100% cpu if the redis datasource in grafana is configured with TLS

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1947989 - pmproxy hangs and consume 100% cpu if the redis datasource in grafana is configured with TLS

Summary: pmproxy hangs and consume 100% cpu if the redis datasource in grafana is conf...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pcp
Sub Component:
Version:	8.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	8.5
Assignee:	Mark Goodwin
QA Contact:	Jan Kurik
Docs Contact:	Apurva Bhide
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-09 17:03 UTC by Michele Casaburo
Modified:	2021-11-09 21:05 UTC (History)
CC List:	5 users (show)
Fixed In Version:	pcp-5.3.1-2.el8
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-11-09 17:49:39 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2021:4171	0	None	None	None	2021-11-09 17:49:56 UTC

Comment 18 Mark Goodwin 2021-05-19 10:56:28 UTC

After more investigation, this is an issue with parallel https RESTAPI client requests. The following changes to qa/1457 to issue multiple parallel https RESTAPI calls (similar to what the grafana-pcp datasource issues with the "PCP Redis Host Overview" dashboard) causes the test pmproxy process to spin 100% CPU and has to be manually killed.

diff --git a/qa/1457 b/qa/1457
index 94969f6e0..0efe1d3b1 100755
--- a/qa/1457
+++ b/qa/1457
@@ -141,10 +141,11 @@ date >>$seq.full
 echo "=== checking TLS operation ===" | tee -a $seq.full
 # (-k) allows us to use self-signed (insecure) certificates, so for testing only
 # (-v) provides very detailed TLS connection information, for debugging only
-curl -k --get 2>$tmp.err \
-       "https://localhost:$port/pmapi/metric?name=sample.long.ten" \
-       | _filter_json
-cat $tmp.err >>$seq.full
+for i in `seq 1 8`; do
+    curl -k --get "https://localhost:$port/pmapi/metric?name=sample.long.ten" >$tmp.$i.out & 2>$tmp.$i.err
+done
+wait
+for i in `seq 1 8`; do echo == $i ==; _filter_json <$tmp.$i.out; cat $tmp.$i.err >>$seq.full; done
 date >>$seq.full


THis results in the QA test pmproxy process spinning 100%, same as reported in Comment #0.

rhel84:mgoodwin@~[]$ sudo gdb -p 2600351
...
Attaching to process 2600351
[New LWP 2600354]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00005606c721e01f in flush_ssl_buffer (client=0x5606c90511b0) at secure.c:120
120	{
...
(gdb) thread apply all bt

Thread 2 (Thread 0x7f2af09c7700 (LWP 2600354)):
#0  0x00007f2af5914a41 in poll () from /lib64/libc.so.6
#1  0x00007f2af424b5ae in poll_func () from /lib64/libavahi-common.so.3
#2  0x00007f2af424b141 in avahi_simple_poll_run () from /lib64/libavahi-common.so.3
#3  0x00007f2af424b320 in avahi_simple_poll_iterate () from /lib64/libavahi-common.so.3
#4  0x00007f2af424b53b in avahi_simple_poll_loop () from /lib64/libavahi-common.so.3
#5  0x00007f2af424b61e in thread () from /lib64/libavahi-common.so.3
#6  0x00007f2af53f414a in start_thread () from /lib64/libpthread.so.0
#7  0x00007f2af591fdc3 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f2af7312980 (LWP 2600351)):
#0  0x00005606c721e01f in flush_ssl_buffer (client=0x5606c90511b0) at secure.c:120
#1  0x00005606c721e151 in flush_secure_module (proxy=0x5606c901f230) at secure.c:143
#2  0x00005606c7211431 in check_proxy (arg=0x7ffdba40b5f0) at server.c:830
#3  0x00007f2af62ec4c9 in uv.run_check () from /lib64/libuv.so.1
#4  0x00007f2af62e57e4 in uv_run () from /lib64/libuv.so.1
#5  0x00005606c7211578 in main_loop (arg=0x5606c901f230) at server.c:860
#6  0x00005606c720ed2c in main (argc=10, argv=0x7ffdba40c918) at pmproxy.c:467

I'm working on a fix before committing the QA changes (they'll cause the QA run to hang).

Comment 19 Mark Goodwin 2021-06-14 11:10:36 UTC

Fixed upstream (pcp-5.3.2 devel) with PR https://github.com/performancecopilot/pcp/pull/1321 which has the following fixes:


commit 3f5ba221842e6a02e9fb22e23c754854271c3c9a
Author: Mark Goodwin <mgoodwin>
Date:   Wed Jun 9 16:44:30 2021 +1000

    libpcp_web: add mutex to struct webgroup protecting the context dict
    
    Add a mutex to the local webgroups structure in libpcp_web and
    use it to protect multithreaded parallel updates (dictAdd,
    dictDelete) to the groups->contexts dict and the dict traversal
    in the timer driven garbage collector.
    
    Tested by qa/297 and related tests and also an updated version
    of qa/1457 (which now stress tests parallel http and https/tls
    pmproxy RESTAPI calls .. in a later commit).
    
    Related: RHBZ#1947989
    Resolves: https://github.com/performancecopilot/pcp/issues/1311

commit 2bad6aef10339f000f7cb578108db5ee80bd640c
Author: Mark Goodwin <mgoodwin>
Date:   Wed Jun 9 17:04:33 2021 +1000

    pmproxy: add mutex for client req lists, fix https/tls support, QA
    
    Add a new mutex to struct proxy and use it to protect parallel
    multithreaded updates to the proxy->first client list.
    
    Also use the same mutex to protect updates to the pending_writes
    client list and avoid the doubly linked list corruption that was
    causing parallel https/tls requests to get stuck spinning in
    flush_secure_module(), as reported in BZ#1947989.
    
    qa/1457 is extensively updated to test parallel http, https/tls
    (and combinations of http and https/tls) RESTAPI calls. Previously
    it only tested a single https/tls call.
    
    With these changes, parallel https/tls RESTAPI requests from the
    grafana-pcp datasource to pmproxy now work correctly whereas previously
    pmproxy would hang/spin.
    
    Resolves: RHBZ#1947989 - pmproxy hangs and consume 100% cpu if the
    redis datasource is configured with TLS.
    
    Related: https://github.com/performancecopilot/pcp/issues/1311

Comment 26 errata-xmlrpc 2021-11-09 17:49:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pcp bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4171

Note You need to log in before you can comment on or make changes to this bug.