Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1973833

Summary: [RFE] Improve remote pmlogger over slow network connections
Product: Red Hat Enterprise Linux 8 Reporter: Parikshit Khedekar <pkhedeka>
Component: pcpAssignee: Nathan Scott <nathans>
Status: CLOSED ERRATA QA Contact: Jan Kurik <jkurik>
Severity: high Docs Contact: Apurva Bhide <abhide>
Priority: unspecified    
Version: 8.4CC: agerstmayr, jkurik, nathans, peter.vreman, surpatil
Target Milestone: betaKeywords: FutureFeature, Triaged
Target Release: 8.6Flags: pm-rhel: mirror+
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: pcp-5.3.5-4.el8 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-10 13:30:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Mark Goodwin 2021-07-07 04:00:08 UTC
Can you please post your pmlogger control files? I may have accidentally reproduced this issue. My /etc/pcp/pmlogger/control.d/pmcdhosts control file (for our 200 logger test cluster) was missing the following near the beginning of the file:

# === VARIABLE ASSIGNMENTS ===
#
# DO NOT REMOVE OR EDIT THE FOLLOWING LINE
$version=1.1


When I tried to start all pmloggers with "pmlogctl -c default -a start", they were taking between 5 and 10 mins each. /var/log/pcp/pmlogger/pmlogger_check.log.* contained the message "defaulting to version 1.0 control file" (or similar).

After I edited my control file and added $version=1.1 (as above) and then restarted the remote loggers, all 200 started in about 23 mins. Still slow (because the pmloggers are started serially rather than in parallel), but around 6 or 7 seconds per logger is a vast improvement - almost tolerable.

Maybe it's time pmlogger_check defaulted to version=1.1 rather than 1.0?

Comment 5 Peter Vreman 2021-07-12 15:37:31 UTC
Confirming i have no version header included in my local file:

~~~
[crash/Azure] hoiroot@li-lc-2635:~$ cat /etc/pcp/pmlogger/control.d/remote
#Host                          P?  S?  directory                                                  args
li-lc-1010.hag.hilti.com       n   n   PCP_LOG_DIR/pmlogger/remote/li-lc-1010.hag.hilti.com       -r -T24h10m -c /var/lib/pcp/config/pmlogger/remote/config.li-lc-1010.hag.hilti.com
li-lc-1011.hag.hilti.com       n   n   PCP_LOG_DIR/pmlogger/remote/li-lc-1011.hag.hilti.com       -r -T24h10m -c /var/lib/pcp/config/pmlogger/remote/config.li-lc-1011.hag.hilti.com
...
~~~

Comment 6 Peter Vreman 2021-07-16 07:23:30 UTC
I tested with $version=1.1 in the file and there is no difference

After reboot of the server starting the 70 remote pmlogger processes still takes 120 minutes:
~~~
[Azure] vrempet-admin@li-lc-2635 ~
$ rpm -q pcp
pcp-5.2.5-4.el8.x86_64

[Azure] vrempet-admin@li-lc-2635 ~
$ grep version /etc/pcp/pmlogger/control.d/remote
$version=1.1

[Azure] vrempet-admin@li-lc-2635 ~
$ ls  -l /etc/pcp/pmlogger/control.d/remote
-rw-r--r--. 1 root root 13589 Jul 15 13:34 /etc/pcp/pmlogger/control.d/remote

[Azure] vrempet-admin@li-lc-2635 ~
$ ps auxf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           2  0.0  0.0      0     0 ?        S    13:41   0:00 [kthreadd]
root           3  0.0  0.0      0     0 ?        I<   13:41   0:00  \_ [rcu_gp]
root           4  0.0  0.0      0     0 ?        I<   13:41   0:00  \_ [rcu_par_gp]
...
pcp         1687  2.8 11.8 2047644 1920452 ?     Rs   13:41   3:44 /usr/libexec/pcp/bin/pmproxy -F -A
pcp         2269  0.0  0.0 123444  7904 ?        S    13:41   0:00 /usr/libexec/pcp/bin/pmlogger -N -P -r -T24h10m -c config.default -v 100mb -m pmlogger_check %Y%m%d.
pcp         2326  0.0  0.0 122328  6872 ?        S    13:41   0:00 /usr/bin/pmie -b -F -P -l /var/log/pcp/pmie/li-lc-2635/pmie.log -c config.default
pcp         2652  0.0  0.0 125152  7756 ?        S    13:41   0:00 /usr/libexec/pcp/bin/pmlogger -h li-lc-1010.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogge
pcp        14707  0.0  0.0 125212  7556 ?        S    13:42   0:00 /usr/libexec/pcp/bin/pmlogger -h li-lc-1011.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogge
pcp        53156  0.0  0.0 125200  7316 ?        S    13:47   0:00 /usr/libexec/pcp/bin/pmlogger -h li-lc-1012.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogge
pcp        66507  0.0  0.0 125164  7504 ?        S    13:49   0:01 /usr/libexec/pcp/bin/pmlogger -h li-lc-1013.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogge
...
pcp      1599300  0.0  0.0 124964  7348 ?        S    15:38   0:00 /usr/libexec/pcp/bin/pmlogger -h li-lc-2622.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogge
pcp      1604929  0.0  0.0 125012  7408 ?        S    15:38   0:00 /usr/libexec/pcp/bin/pmlogger -h li-lc-2623.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogge
pcp      1607151  0.0  0.0 125020  7468 ?        S    15:38   0:00 /usr/libexec/pcp/bin/pmlogger -h li-lc-2624.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogge
pcp      1615879  0.0  0.0 125020  7404 ?        S    15:38   0:00 /usr/libexec/pcp/bin/pmlogger -h li-lc-2630.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogge
~~~

Comment 7 Nathan Scott 2021-10-17 23:30:22 UTC
I've been working with an upstream PCP developer (kenj) and this issue we now believe has been resolved (from pcp-5.3.5).  This has been achieved through a combination of protocol and API extensions that allow bulk metric descriptor transfers (new pmLookupDesc(3) API) and a greatly optimized PDU submission pattern and caching for the other parts of pmlogger startup.  Kenj reports high-latency-network pmlogger setups that previously took between 5-10 minutes to start individual loggers are now taking around 2-3 seconds to start.

https://github.com/performancecopilot/pcp/pull/1432
https://github.com/performancecopilot/pcp/pull/1443

Comment 11 Peter Vreman 2021-12-08 14:30:49 UTC
Sadly i cannot confirm it is validated in real-world.

On both the pmlogger farm and client (both on RHEL8.5) from 5.3.1 -> (upstream provided) 5.3.5 rpms.
There is no difference visible on the startup timing, it is still slow and the tcpdump shows still thousands of small tcp packets at startup of the pmlogger-farm

Is there any special setup (more than yum update) needed here?

Comment 12 Jan Kurik 2021-12-08 16:34:14 UTC
(In reply to Peter Vreman from comment #11)
> Sadly i cannot confirm it is validated in real-world.
> 
> On both the pmlogger farm and client (both on RHEL8.5) from 5.3.1 ->
> (upstream provided) 5.3.5 rpms.
> There is no difference visible on the startup timing, it is still slow and
> the tcpdump shows still thousands of small tcp packets at startup of the
> pmlogger-farm
> 
> Is there any special setup (more than yum update) needed here?

There is no special setup needed.
During my testing I am observing about 30% to 40% shorter times when comparing pcp-5.2.5 (rhel-8.4) and pcp-5.3.5.
Does the speed up in not sufficient for you, or you do not observe any speed up at all ?

Comment 13 Peter Vreman 2021-12-08 17:51:45 UTC
It is about expectation here. The 30-40% speedup is a good step in the right direction. But it is not the speed up promosied in https://bugzilla.redhat.com/show_bug.cgi?id=1973833#c7 that mentioned to expect 2-3 seconds instead of 5-10 minutes.
As a tester i would not have put it a performance improvement on Verified if i get only a 2x improvement, but the promised/expected performance improvement had to be 100-200x.


In my setup with a high latency VPN link it takes still 200 seconds. Looking at the tcpdump it shows there are ~12000 PDUs used to initialize logger even with the introduce PDU types 0x7016 and 0x7017 being used in the early steps (frame 49 and 56):
~~~
Frame 49: 430 bytes on wire (3440 bits), 430 bytes captured (3440 bits)
Ethernet II, Src: Cisco_b1:70:43 (70:6b:b9:b1:70:43), Dst: VMware_97:ae:98 (00:50:56:97:ae:98)
Internet Protocol Version 4, Src: 10.208.0.165, Dst: 10.92.14.229
Transmission Control Protocol, Src Port: 52166, Dst Port: 44321, Seq: 23841, Ack: 3141, Len: 364
[3 Reassembled TCP Segments (2996 bytes): #46(1316), #47(1316), #49(364)]
Performance Co-Pilot
    PDU Length: 2996
    Type: Unknown (0x00007016)
    From: 0
    [Expert Info (Warning/Undecoded): Unimplemented Packet Type]
~~~

~~~
Frame 56: 1382 bytes on wire (11056 bits), 1382 bytes captured (11056 bits)
Ethernet II, Src: VMware_97:ae:98 (00:50:56:97:ae:98), Dst: Cisco_9f:f0:01 (00:00:0c:9f:f0:01)
Internet Protocol Version 4, Src: 10.92.14.229, Dst: 10.208.0.165
Transmission Control Protocol, Src Port: 44321, Dst Port: 52166, Seq: 16721, Ack: 24205, Len: 1316
[6 Reassembled TCP Segments (14896 bytes): #50(2716), #51(2716), #52(2716), #53(2716), #54(2716), #56(1316)]
Performance Co-Pilot
    PDU Length: 14896
    Type: Unknown (0x00007017)
    From: 0
    [Expert Info (Warning/Undecoded): Unimplemented Packet Type]
~~~


After this there is still for metric another 10 (5 in both directions) PDUs used. As you can see from the timing in my real-world case that is still taking 0.16sec per metric
~~~
No.	Time	Source	Destination	Protocol	Length	Info
840	13.710756	10.208.0.165	10.92.14.229	PCP	90	Client > Server [PMNS_IDS]
841	13.710805	10.92.14.229	10.208.0.165	PCP	118	Server > Client [PMNS_NAMES]
842	13.745550	10.208.0.165	10.92.14.229	PCP	86	Client > Server [LABEL_REQ]
843	13.745630	10.92.14.229	10.208.0.165	PCP	94	Server > Client [LABEL]
844	13.782138	10.208.0.165	10.92.14.229	PCP	86	Client > Server [LABEL_REQ]
845	13.782252	10.92.14.229	10.208.0.165	PCP	94	Server > Client [LABEL]
846	13.839308	10.208.0.165	10.92.14.229	PCP	86	Client > Server [TEXT_REQ]
847	13.839431	10.92.14.229	10.208.0.165	PCP	134	Server > Client [TEXT]
848	13.873962	10.208.0.165	10.92.14.229	PCP	86	Client > Server [TEXT_REQ]
849	13.874083	10.92.14.229	10.208.0.165	PCP	178	Server > Client [TEXT]
~~~

In case 02952167 i have attached the complete tcpdump

Comment 14 Nathan Scott 2021-12-09 05:27:32 UTC
(In reply to Peter Vreman from comment #13)
> As a tester i would not [...]

Please choose your words more carefully if you wish to continue participating
in conversations directly with Red Hat engineering teams Peter.  It is not in
any way appropriate for you to suggest Jan is not doing his job appropriately.

In the context of a stable RHEL 8.6 release, his actions are spot on - we've
got to be very cautious about the changes we make to not introduce regression
and its clear this is a decent initial step, and he's indicated there remains
scope for further improvement in future releases.  We're not ignoring that.

It is also not appropriate for you to set NEEDINFO on bugzillas like this; it
is quite disruptive to us, as you and I have discussed before.  Consider this
a final caution - if we see it again we may have to close these BZs to public
discussion with you, as we have alot of work to get on with, and insulting us
is no way to work with us either.  We are open by default but you're crossing
over professional boundaries and haven't heeded requests to change behaviour.

> It is about expectation here. The 30-40% speedup is a good step in the right
> direction. But it is not the speed up promosied in
> https://bugzilla.redhat.com/show_bug.cgi?id=1973833#c7 that mentioned to
> expect 2-3 seconds instead of 5-10 minutes.

There are many, many factors here.  The good thing about these measurements
is they are completely reproducible when you start at the same place as the
original engineers comments.  Those comments focus specifically on pmlogger
itself (the actual binary).  A TCP dump at an endpoint observing a pmlogger
*service* startup is not measuring the same thing.

> In my setup with a high latency VPN link it takes still 200 seconds. Looking
> at the tcpdump it shows there are ~12000 PDUs used to initialize logger even

This is no longer comparing to the same thing as the original measurements so
its not hugely surprising.  There are other processes involved with a service
start, in particular you probably have pmlogconf doing probing.  You can setup
logger farms to not do this by removing the pmlogconf line at the start of any
generated pmlogger configuration (once you are happy with the generated config
you don't need to keep re-generating it every service start - this just adds
needless latency in many cases).

Over time we expect to continue to make progress in this general area - some
plans are wider impact though, and will likely only be possible in a major
version bump (e.g. using parallel systemd pmlogger services, being considered
for early in the life of RHEL9).  Other smaller, incremental changes (e.g.
in pmlogconf) we may be able to make in 8.7.  Either way, once again, its not
appropriate at all for you to suggest Jan has not acted appropriately here.

cheers.

Comment 15 Nathan Scott 2021-12-14 03:23:16 UTC
Following commit is merged upstream and planned for PCP v6.  It's a candidate for back-porting.

commit 6b907e90c74fce82d6b712493d8b362bdd1a1ec1
Author: Nathan Scott <nathans>
Date:   Tue Dec 14 08:54:14 2021 +1100

    pmlogconf: switch to the bulk pmLookupDescs(3) interface
    
    No functional change, all existing regression tests pass.
    
    Related to Red Hat BZ #1973833.

Comment 20 errata-xmlrpc 2022-05-10 13:30:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pcp bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1765