Bug 1947989

Summary: pmproxy hangs and consume 100% cpu if the redis datasource in grafana is configured with TLS
Product: Red Hat Enterprise Linux 8 Reporter: Michele Casaburo <mcasabur>
Component: pcpAssignee: Mark Goodwin <mgoodwin>
Status: CLOSED ERRATA QA Contact: Jan Kurik <jkurik>
Severity: medium Docs Contact: Apurva Bhide <abhide>
Priority: unspecified    
Version: 8.3CC: agerstmayr, jkurik, mgoodwin, nathans, patrickm
Target Milestone: rcKeywords: Bugfix, Triaged
Target Release: 8.5Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcp-5.3.1-2.el8 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-09 17:49:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 18 Mark Goodwin 2021-05-19 10:56:28 UTC
After more investigation, this is an issue with parallel https RESTAPI client requests. The following changes to qa/1457 to issue multiple parallel https RESTAPI calls (similar to what the grafana-pcp datasource issues with the "PCP Redis Host Overview" dashboard) causes the test pmproxy process to spin 100% CPU and has to be manually killed.

diff --git a/qa/1457 b/qa/1457
index 94969f6e0..0efe1d3b1 100755
--- a/qa/1457
+++ b/qa/1457
@@ -141,10 +141,11 @@ date >>$seq.full
 echo "=== checking TLS operation ===" | tee -a $seq.full
 # (-k) allows us to use self-signed (insecure) certificates, so for testing only
 # (-v) provides very detailed TLS connection information, for debugging only
-curl -k --get 2>$tmp.err \
-       "https://localhost:$port/pmapi/metric?name=sample.long.ten" \
-       | _filter_json
-cat $tmp.err >>$seq.full
+for i in `seq 1 8`; do
+    curl -k --get "https://localhost:$port/pmapi/metric?name=sample.long.ten" >$tmp.$i.out & 2>$tmp.$i.err
+done
+wait
+for i in `seq 1 8`; do echo == $i ==; _filter_json <$tmp.$i.out; cat $tmp.$i.err >>$seq.full; done
 date >>$seq.full


THis results in the QA test pmproxy process spinning 100%, same as reported in Comment #0.

rhel84:mgoodwin@~[]$ sudo gdb -p 2600351
...
Attaching to process 2600351
[New LWP 2600354]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00005606c721e01f in flush_ssl_buffer (client=0x5606c90511b0) at secure.c:120
120	{
...
(gdb) thread apply all bt

Thread 2 (Thread 0x7f2af09c7700 (LWP 2600354)):
#0  0x00007f2af5914a41 in poll () from /lib64/libc.so.6
#1  0x00007f2af424b5ae in poll_func () from /lib64/libavahi-common.so.3
#2  0x00007f2af424b141 in avahi_simple_poll_run () from /lib64/libavahi-common.so.3
#3  0x00007f2af424b320 in avahi_simple_poll_iterate () from /lib64/libavahi-common.so.3
#4  0x00007f2af424b53b in avahi_simple_poll_loop () from /lib64/libavahi-common.so.3
#5  0x00007f2af424b61e in thread () from /lib64/libavahi-common.so.3
#6  0x00007f2af53f414a in start_thread () from /lib64/libpthread.so.0
#7  0x00007f2af591fdc3 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f2af7312980 (LWP 2600351)):
#0  0x00005606c721e01f in flush_ssl_buffer (client=0x5606c90511b0) at secure.c:120
#1  0x00005606c721e151 in flush_secure_module (proxy=0x5606c901f230) at secure.c:143
#2  0x00005606c7211431 in check_proxy (arg=0x7ffdba40b5f0) at server.c:830
#3  0x00007f2af62ec4c9 in uv.run_check () from /lib64/libuv.so.1
#4  0x00007f2af62e57e4 in uv_run () from /lib64/libuv.so.1
#5  0x00005606c7211578 in main_loop (arg=0x5606c901f230) at server.c:860
#6  0x00005606c720ed2c in main (argc=10, argv=0x7ffdba40c918) at pmproxy.c:467

I'm working on a fix before committing the QA changes (they'll cause the QA run to hang).

Comment 19 Mark Goodwin 2021-06-14 11:10:36 UTC
Fixed upstream (pcp-5.3.2 devel) with PR https://github.com/performancecopilot/pcp/pull/1321 which has the following fixes:


commit 3f5ba221842e6a02e9fb22e23c754854271c3c9a
Author: Mark Goodwin <mgoodwin>
Date:   Wed Jun 9 16:44:30 2021 +1000

    libpcp_web: add mutex to struct webgroup protecting the context dict
    
    Add a mutex to the local webgroups structure in libpcp_web and
    use it to protect multithreaded parallel updates (dictAdd,
    dictDelete) to the groups->contexts dict and the dict traversal
    in the timer driven garbage collector.
    
    Tested by qa/297 and related tests and also an updated version
    of qa/1457 (which now stress tests parallel http and https/tls
    pmproxy RESTAPI calls .. in a later commit).
    
    Related: RHBZ#1947989
    Resolves: https://github.com/performancecopilot/pcp/issues/1311

commit 2bad6aef10339f000f7cb578108db5ee80bd640c
Author: Mark Goodwin <mgoodwin>
Date:   Wed Jun 9 17:04:33 2021 +1000

    pmproxy: add mutex for client req lists, fix https/tls support, QA
    
    Add a new mutex to struct proxy and use it to protect parallel
    multithreaded updates to the proxy->first client list.
    
    Also use the same mutex to protect updates to the pending_writes
    client list and avoid the doubly linked list corruption that was
    causing parallel https/tls requests to get stuck spinning in
    flush_secure_module(), as reported in BZ#1947989.
    
    qa/1457 is extensively updated to test parallel http, https/tls
    (and combinations of http and https/tls) RESTAPI calls. Previously
    it only tested a single https/tls call.
    
    With these changes, parallel https/tls RESTAPI requests from the
    grafana-pcp datasource to pmproxy now work correctly whereas previously
    pmproxy would hang/spin.
    
    Resolves: RHBZ#1947989 - pmproxy hangs and consume 100% cpu if the
    redis datasource is configured with TLS.
    
    Related: https://github.com/performancecopilot/pcp/issues/1311

Comment 26 errata-xmlrpc 2021-11-09 17:49:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pcp bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4171