Bug 1973833
| Summary: | [RFE] Improve remote pmlogger over slow network connections | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Parikshit Khedekar <pkhedeka> |
| Component: | pcp | Assignee: | Nathan Scott <nathans> |
| Status: | CLOSED ERRATA | QA Contact: | Jan Kurik <jkurik> |
| Severity: | high | Docs Contact: | Apurva Bhide <abhide> |
| Priority: | unspecified | ||
| Version: | 8.4 | CC: | agerstmayr, jkurik, nathans, peter.vreman, surpatil |
| Target Milestone: | beta | Keywords: | FutureFeature, Triaged |
| Target Release: | 8.6 | Flags: | pm-rhel:
mirror+
|
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | pcp-5.3.5-4.el8 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-05-10 13:30:36 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Comment 2
Mark Goodwin
2021-07-07 04:00:08 UTC
Confirming i have no version header included in my local file: ~~~ [crash/Azure] hoiroot@li-lc-2635:~$ cat /etc/pcp/pmlogger/control.d/remote #Host P? S? directory args li-lc-1010.hag.hilti.com n n PCP_LOG_DIR/pmlogger/remote/li-lc-1010.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogger/remote/config.li-lc-1010.hag.hilti.com li-lc-1011.hag.hilti.com n n PCP_LOG_DIR/pmlogger/remote/li-lc-1011.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogger/remote/config.li-lc-1011.hag.hilti.com ... ~~~ I tested with $version=1.1 in the file and there is no difference After reboot of the server starting the 70 remote pmlogger processes still takes 120 minutes: ~~~ [Azure] vrempet-admin@li-lc-2635 ~ $ rpm -q pcp pcp-5.2.5-4.el8.x86_64 [Azure] vrempet-admin@li-lc-2635 ~ $ grep version /etc/pcp/pmlogger/control.d/remote $version=1.1 [Azure] vrempet-admin@li-lc-2635 ~ $ ls -l /etc/pcp/pmlogger/control.d/remote -rw-r--r--. 1 root root 13589 Jul 15 13:34 /etc/pcp/pmlogger/control.d/remote [Azure] vrempet-admin@li-lc-2635 ~ $ ps auxf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 2 0.0 0.0 0 0 ? S 13:41 0:00 [kthreadd] root 3 0.0 0.0 0 0 ? I< 13:41 0:00 \_ [rcu_gp] root 4 0.0 0.0 0 0 ? I< 13:41 0:00 \_ [rcu_par_gp] ... pcp 1687 2.8 11.8 2047644 1920452 ? Rs 13:41 3:44 /usr/libexec/pcp/bin/pmproxy -F -A pcp 2269 0.0 0.0 123444 7904 ? S 13:41 0:00 /usr/libexec/pcp/bin/pmlogger -N -P -r -T24h10m -c config.default -v 100mb -m pmlogger_check %Y%m%d. pcp 2326 0.0 0.0 122328 6872 ? S 13:41 0:00 /usr/bin/pmie -b -F -P -l /var/log/pcp/pmie/li-lc-2635/pmie.log -c config.default pcp 2652 0.0 0.0 125152 7756 ? S 13:41 0:00 /usr/libexec/pcp/bin/pmlogger -h li-lc-1010.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogge pcp 14707 0.0 0.0 125212 7556 ? S 13:42 0:00 /usr/libexec/pcp/bin/pmlogger -h li-lc-1011.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogge pcp 53156 0.0 0.0 125200 7316 ? S 13:47 0:00 /usr/libexec/pcp/bin/pmlogger -h li-lc-1012.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogge pcp 66507 0.0 0.0 125164 7504 ? S 13:49 0:01 /usr/libexec/pcp/bin/pmlogger -h li-lc-1013.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogge ... pcp 1599300 0.0 0.0 124964 7348 ? S 15:38 0:00 /usr/libexec/pcp/bin/pmlogger -h li-lc-2622.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogge pcp 1604929 0.0 0.0 125012 7408 ? S 15:38 0:00 /usr/libexec/pcp/bin/pmlogger -h li-lc-2623.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogge pcp 1607151 0.0 0.0 125020 7468 ? S 15:38 0:00 /usr/libexec/pcp/bin/pmlogger -h li-lc-2624.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogge pcp 1615879 0.0 0.0 125020 7404 ? S 15:38 0:00 /usr/libexec/pcp/bin/pmlogger -h li-lc-2630.hag.hilti.com -r -T24h10m -c /var/lib/pcp/config/pmlogge ~~~ I've been working with an upstream PCP developer (kenj) and this issue we now believe has been resolved (from pcp-5.3.5). This has been achieved through a combination of protocol and API extensions that allow bulk metric descriptor transfers (new pmLookupDesc(3) API) and a greatly optimized PDU submission pattern and caching for the other parts of pmlogger startup. Kenj reports high-latency-network pmlogger setups that previously took between 5-10 minutes to start individual loggers are now taking around 2-3 seconds to start. https://github.com/performancecopilot/pcp/pull/1432 https://github.com/performancecopilot/pcp/pull/1443 Sadly i cannot confirm it is validated in real-world. On both the pmlogger farm and client (both on RHEL8.5) from 5.3.1 -> (upstream provided) 5.3.5 rpms. There is no difference visible on the startup timing, it is still slow and the tcpdump shows still thousands of small tcp packets at startup of the pmlogger-farm Is there any special setup (more than yum update) needed here? (In reply to Peter Vreman from comment #11) > Sadly i cannot confirm it is validated in real-world. > > On both the pmlogger farm and client (both on RHEL8.5) from 5.3.1 -> > (upstream provided) 5.3.5 rpms. > There is no difference visible on the startup timing, it is still slow and > the tcpdump shows still thousands of small tcp packets at startup of the > pmlogger-farm > > Is there any special setup (more than yum update) needed here? There is no special setup needed. During my testing I am observing about 30% to 40% shorter times when comparing pcp-5.2.5 (rhel-8.4) and pcp-5.3.5. Does the speed up in not sufficient for you, or you do not observe any speed up at all ? It is about expectation here. The 30-40% speedup is a good step in the right direction. But it is not the speed up promosied in https://bugzilla.redhat.com/show_bug.cgi?id=1973833#c7 that mentioned to expect 2-3 seconds instead of 5-10 minutes. As a tester i would not have put it a performance improvement on Verified if i get only a 2x improvement, but the promised/expected performance improvement had to be 100-200x. In my setup with a high latency VPN link it takes still 200 seconds. Looking at the tcpdump it shows there are ~12000 PDUs used to initialize logger even with the introduce PDU types 0x7016 and 0x7017 being used in the early steps (frame 49 and 56): ~~~ Frame 49: 430 bytes on wire (3440 bits), 430 bytes captured (3440 bits) Ethernet II, Src: Cisco_b1:70:43 (70:6b:b9:b1:70:43), Dst: VMware_97:ae:98 (00:50:56:97:ae:98) Internet Protocol Version 4, Src: 10.208.0.165, Dst: 10.92.14.229 Transmission Control Protocol, Src Port: 52166, Dst Port: 44321, Seq: 23841, Ack: 3141, Len: 364 [3 Reassembled TCP Segments (2996 bytes): #46(1316), #47(1316), #49(364)] Performance Co-Pilot PDU Length: 2996 Type: Unknown (0x00007016) From: 0 [Expert Info (Warning/Undecoded): Unimplemented Packet Type] ~~~ ~~~ Frame 56: 1382 bytes on wire (11056 bits), 1382 bytes captured (11056 bits) Ethernet II, Src: VMware_97:ae:98 (00:50:56:97:ae:98), Dst: Cisco_9f:f0:01 (00:00:0c:9f:f0:01) Internet Protocol Version 4, Src: 10.92.14.229, Dst: 10.208.0.165 Transmission Control Protocol, Src Port: 44321, Dst Port: 52166, Seq: 16721, Ack: 24205, Len: 1316 [6 Reassembled TCP Segments (14896 bytes): #50(2716), #51(2716), #52(2716), #53(2716), #54(2716), #56(1316)] Performance Co-Pilot PDU Length: 14896 Type: Unknown (0x00007017) From: 0 [Expert Info (Warning/Undecoded): Unimplemented Packet Type] ~~~ After this there is still for metric another 10 (5 in both directions) PDUs used. As you can see from the timing in my real-world case that is still taking 0.16sec per metric ~~~ No. Time Source Destination Protocol Length Info 840 13.710756 10.208.0.165 10.92.14.229 PCP 90 Client > Server [PMNS_IDS] 841 13.710805 10.92.14.229 10.208.0.165 PCP 118 Server > Client [PMNS_NAMES] 842 13.745550 10.208.0.165 10.92.14.229 PCP 86 Client > Server [LABEL_REQ] 843 13.745630 10.92.14.229 10.208.0.165 PCP 94 Server > Client [LABEL] 844 13.782138 10.208.0.165 10.92.14.229 PCP 86 Client > Server [LABEL_REQ] 845 13.782252 10.92.14.229 10.208.0.165 PCP 94 Server > Client [LABEL] 846 13.839308 10.208.0.165 10.92.14.229 PCP 86 Client > Server [TEXT_REQ] 847 13.839431 10.92.14.229 10.208.0.165 PCP 134 Server > Client [TEXT] 848 13.873962 10.208.0.165 10.92.14.229 PCP 86 Client > Server [TEXT_REQ] 849 13.874083 10.92.14.229 10.208.0.165 PCP 178 Server > Client [TEXT] ~~~ In case 02952167 i have attached the complete tcpdump (In reply to Peter Vreman from comment #13) > As a tester i would not [...] Please choose your words more carefully if you wish to continue participating in conversations directly with Red Hat engineering teams Peter. It is not in any way appropriate for you to suggest Jan is not doing his job appropriately. In the context of a stable RHEL 8.6 release, his actions are spot on - we've got to be very cautious about the changes we make to not introduce regression and its clear this is a decent initial step, and he's indicated there remains scope for further improvement in future releases. We're not ignoring that. It is also not appropriate for you to set NEEDINFO on bugzillas like this; it is quite disruptive to us, as you and I have discussed before. Consider this a final caution - if we see it again we may have to close these BZs to public discussion with you, as we have alot of work to get on with, and insulting us is no way to work with us either. We are open by default but you're crossing over professional boundaries and haven't heeded requests to change behaviour. > It is about expectation here. The 30-40% speedup is a good step in the right > direction. But it is not the speed up promosied in > https://bugzilla.redhat.com/show_bug.cgi?id=1973833#c7 that mentioned to > expect 2-3 seconds instead of 5-10 minutes. There are many, many factors here. The good thing about these measurements is they are completely reproducible when you start at the same place as the original engineers comments. Those comments focus specifically on pmlogger itself (the actual binary). A TCP dump at an endpoint observing a pmlogger *service* startup is not measuring the same thing. > In my setup with a high latency VPN link it takes still 200 seconds. Looking > at the tcpdump it shows there are ~12000 PDUs used to initialize logger even This is no longer comparing to the same thing as the original measurements so its not hugely surprising. There are other processes involved with a service start, in particular you probably have pmlogconf doing probing. You can setup logger farms to not do this by removing the pmlogconf line at the start of any generated pmlogger configuration (once you are happy with the generated config you don't need to keep re-generating it every service start - this just adds needless latency in many cases). Over time we expect to continue to make progress in this general area - some plans are wider impact though, and will likely only be possible in a major version bump (e.g. using parallel systemd pmlogger services, being considered for early in the life of RHEL9). Other smaller, incremental changes (e.g. in pmlogconf) we may be able to make in 8.7. Either way, once again, its not appropriate at all for you to suggest Jan has not acted appropriately here. cheers. Following commit is merged upstream and planned for PCP v6. It's a candidate for back-porting.
commit 6b907e90c74fce82d6b712493d8b362bdd1a1ec1
Author: Nathan Scott <nathans>
Date: Tue Dec 14 08:54:14 2021 +1100
pmlogconf: switch to the bulk pmLookupDescs(3) interface
No functional change, all existing regression tests pass.
Related to Red Hat BZ #1973833.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pcp bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1765 |