1892326 – pmlogger stops working after some time when collecting data from remote nodes

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1892326 - pmlogger stops working after some time when collecting data from remote nodes

Summary: pmlogger stops working after some time when collecting data from remote nodes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pcp
Sub Component:
Version:	8.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	8.0
Assignee:	Mark Goodwin
QA Contact:	Jan Kurik
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1919950 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-28 13:51 UTC by Carlos Santos
Modified:	2023-09-15 00:50 UTC (History)
CC List:	9 users (show)
Fixed In Version:	pcp-5.2.2-1.el8
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-18 15:19:49 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Carlos Santos 2020-10-28 13:51:08 UTC

Description of problem:

User has pcp installed on RHEL-8.2 (pcpcol host).

Customer uses pmlogger to collect data from several nodes, with oscar02¨ being
one OSP cluster and oscar05 being another one (systems with the same "prefix"
were deployed in the exact same way through OSP director):

$ cat etc/pcp/pmlogger/control.d/remote
oscar05com001	n	n	PCP_LOG_DIR/pmlogger/oscar05com001 -r -T24h10m -c config.remote
oscar05com002	n	n	PCP_LOG_DIR/pmlogger/oscar05com002 -r -T24h10m -c config.remote
oscar05com003	n	n	PCP_LOG_DIR/pmlogger/oscar05com003 -r -T24h10m -c config.remote
oscar05com004	n	n	PCP_LOG_DIR/pmlogger/oscar05com004 -r -T24h10m -c config.remote
oscar05ctr001	n	n	PCP_LOG_DIR/pmlogger/oscar05ctr001 -r -T24h10m -c config.remote
oscar05ctr002	n	n	PCP_LOG_DIR/pmlogger/oscar05ctr002 -r -T24h10m -c config.remote
oscar05ctr003	n	n	PCP_LOG_DIR/pmlogger/oscar05ctr003 -r -T24h10m -c config.remote
oscar05cep001	n	n	PCP_LOG_DIR/pmlogger/oscar05cep001 -r -T24h10m -c config.remote
oscar05cep002	n	n	PCP_LOG_DIR/pmlogger/oscar05cep002 -r -T24h10m -c config.remote
oscar05cep003	n	n	PCP_LOG_DIR/pmlogger/oscar05cep003 -r -T24h10m -c config.remote
oscar02com001	n	n	PCP_LOG_DIR/pmlogger/oscar02com001 -r -T24h10m -c config.remote
oscar02com002	n	n	PCP_LOG_DIR/pmlogger/oscar02com002 -r -T24h10m -c config.remote
oscar02com003	n	n	PCP_LOG_DIR/pmlogger/oscar02com003 -r -T24h10m -c config.remote
oscar02com004	n	n	PCP_LOG_DIR/pmlogger/oscar02com004 -r -T24h10m -c config.remote
oscar02ctr001	n	n	PCP_LOG_DIR/pmlogger/oscar02ctr001 -r -T24h10m -c config.remote
oscar02ctr002	n	n	PCP_LOG_DIR/pmlogger/oscar02ctr002 -r -T24h10m -c config.remote
oscar02ctr003	n	n	PCP_LOG_DIR/pmlogger/oscar02ctr003 -r -T24h10m -c config.remote
oscar02cep001	n	n	PCP_LOG_DIR/pmlogger/oscar02cep001 -r -T24h10m -c config.remote
oscar02cep002	n	n	PCP_LOG_DIR/pmlogger/oscar02cep002 -r -T24h10m -c config.remote
oscar02cep004	n	n	PCP_LOG_DIR/pmlogger/oscar02cep004 -r -T24h10m -c config.remote

Initial customer complaints are

  - pmlogger seems to randomly stop working.

  - Even when data is there it doesn't seem to always be uploaded to redis, or
    at least the redis pcp datasource can't find it.

As and example let's take two clients called oscar02com001 and oscar02com002

  - oscar02com001 has all data in logs, in redis, and everything can be
    visualised using grafana-pcp with the redis datasource
  - oscar02com002 has data up to 06/03, then collection seems to have stopped.
    Even for the days prior to 06/03 nothing can be visualised in grafana

root@pcpcol:/var/log/pcp/pmlogger/oscar02com001# pmdumptext -Xlimu -t 10m -S @09:00 -T @10:00 'kernel.all.load[1]'  -a 20200605.00.11.0
[ 1] 20200605.00.11.0/kernel.all.load["1 minute"]

             Column          1
             Source     oscar0
             Metric       load
               Inst     1 minu
              Units       none
Fri Jun  5 09:00:00      0.08
Fri Jun  5 09:10:00      0.01

root@pcpcol:/var/log/pcp/pmlogger/oscar02com002# pmdumptext -Xlimu -t 10m -S @09:00 -T @10:00 'kernel.all.load[1]'  -a 20200603.0
[ 1] 20200603.0/kernel.all.load["1 minute"]

             Column          1
             Source     oscar0
             Metric       load
               Inst     1 minu
              Units       none
Wed Jun  3 09:00:00      0.13
Wed Jun  3 09:10:00      0.52

This was the last file available when the customer reported the problem. It was
not visible using grafana-pcp, not sure if it was ever uploaded to redis.

root@pcpcol:/var/log/pcp/pmlogger/oscar02com002# cat pmlogger.log
Log for pmlogger on pcpcol started Fri Jun  5 12:56:07 2020

preprocessor cmd: /usr/libexec/pcp/bin/pmcpp -rs /var/lib/pcp/config/pmlogger/config.remote -I /var/lib/pcp/config/pmlogger
Warning [/var/lib/pcp/config/pmlogger/config.remote, line 582]
Problem with lookup for metric "network.tcp" ... logging not activated
Reason: Timeout waiting for a response from PMCD
Warning [/var/lib/pcp/config/pmlogger/config.remote, line 597]
Problem with lookup for metric "network.udp6.ignoredmulti" ... logging not activated
Reason: Timeout waiting for a response from PMCD

There were anso only two pmloggers left

root@pcpcol:/var/log/pcp/pmlogger/oscar02com002# ps -ef|grep pml
pcp      2087325       1  0 00:11 ?        00:00:02 /usr/libexec/pcp/bin/pmlogger -h oscar02com001 -r -T24h10m -c config.remote -m pmlogger_check 20200605.00.11
pcp      2114208       1  0 00:14 ?        00:00:02 /usr/libexec/pcp/bin/pmlogger -h oscar02com003 -r -T24h10m -c config.remote -m pmlogger_check 20200605.00.14
root     2856909 2835579  0 13:25 pts/0    00:00:00 grep --color=auto pml

Initial findings were that the connection between pmlogger on host pcpcol and
pmcd on remote hosts (such as oscar05com001) timed out. It was found that one
of them (oscar02cep004) had been removed and it was removed from the config.

Customer observed a curious pattern on the grafana dashboard:

  - for cluster oscar02 we have host entries with the shortname (eg oscar02com001)
    that "work", or at least these have data since the last reboot of the collector
    server;
  - for the same cluster there are also entries with their FQDN, these have no data
  - for cluster oscar05 we have only shortnames, with no data at all
  - looking at /var/log/pcp/pmlogger there are only subdirs with shortnames

Network bandwidth and latency between pcpcol and the nodes seem to be nornmal,
as tested by customer:

    root@pcpcol:~# qperf -ip 44322 -lp 44321 oscar05com002 tcp_bw
    tcp_bw:
        bw  =  109 MB/sec

    root@pcpcol:~# qperf -ip 44322 -lp 44321 oscar05com002 tcp_lat
    tcp_lat:
        latency  =  236 us

Original customer questions:

- Why does pmlogger suddenly stop being able to connect to the remote pmcd(s)?

  This is the subject of this bug report.

- Why no data is seen in Grafana for the oscar05 cluster?

  This will be investigated in a separate bug if we do'nt conclude that it has
  a different cause than the previous problem.

- Why there's data for the nodes in the oscar02 cluster with long names, only?

  This has already been solved by checking the value of kernel.uname.nodename
  in the pmlogger files. We hav seen that some nodes have "short" names whilst
  other have "long" names:

  $ sort -u -k 5 kernel.uname.nodename.txt
  Thu Oct  8 09:36:45	"oscar02cep001.tc.lab.int.corp"
  Thu Oct  8 09:36:50	"oscar02cep002.tc.lab.int.corp"
  Thu Oct  8 09:37:06	"oscar02com001.tc.lab.int.corp"
  Thu Oct  8 09:37:12	"oscar02com002.tc.lab.int.corp"
  Thu Oct  8 09:37:17	"oscar02com003.tc.lab.int.corp"
  Thu Oct  8 09:36:23	"oscar02com004.tc.lab.int.corp"
  Thu Oct  8 09:36:29	"oscar02ctr001.tc.lab.int.corp"
  Thu Oct  8 09:36:34	"oscar02ctr002.tc.lab.int.corp"
  Thu Oct  8 09:36:39	"oscar02ctr003.tc.lab.int.corp"
  Thu Oct  8 09:36:48	"oscar05cep001"
  Thu Oct  8 09:36:54	"oscar05cep002"
  Thu Oct  8 09:37:00	"oscar05cep003"
  Thu Oct  8 09:37:04	"oscar05com001"
  Thu Oct  8 09:37:10	"oscar05com002"
  Thu Oct  8 09:37:16	"oscar05com003"
  Thu Oct  8 09:36:22	"oscar05com004"
  Thu Oct  8 09:36:28	"oscar05ctr001"
  Thu Oct  8 09:36:35	"oscar05ctr002"
  Thu Oct  8 09:36:41	"oscar05ctr003"

After talking to the PCP team we recommended upgrading the PCP packages on
pcpcol to version 5.1.1-3. Soon after customer reported that

  at this moment I have no data at all anymore in grafana, not new nor
  historical, for any host

  Also, I  noticed there are no longer any pmlogger processes active,
  they get started every 1/2 hour but terminate immediately....archives
  are being created but they contain no data at all.


Version-Release number of selected component (if applicable):

  On the collector host

  -grafana-pcp-1.0.5-3.el8.noarch
  -pcp-5.0.2-5.el8.x86_64
  -pcp-conf-5.0.2-5.el8.x86_64
  -pcp-doc-5.0.2-5.el8.noarch
  -pcp-gui-5.0.2-5.el8.x86_64
  -pcp-libs-5.0.2-5.el8.x86_64
  -pcp-pmda-dm-5.0.2-5.el8.x86_64
  -pcp-pmda-nfsclient-5.0.2-5.el8.x86_64
  -pcp-selinux-5.0.2-5.el8.x86_64
  -pcp-system-tools-5.0.2-5.el8.x86_64
  -pcp-zeroconf-5.0.2-5.el8.x86_64
  -python3-pcp-5.0.2-5.el8.x86_64

  On the "oscar02*" nodes

  -pcp-4.3.2-3.el7_7.x86_64
  -python-pcp-4.3.2-3.el7_7.x86_64
  -pcp-selinux-4.3.2-3.el7_7.x86_64
  -pcp-conf-4.3.2-3.el7_7.x86_64
  -pcp-system-tools-4.3.2-3.el7_7.x86_64
  -pcp-libs-4.3.2-3.el7_7.x86_64

  On the "oscar05*" nodes

  -pcp-libs-4.3.2-7.el7_8.x86_64
  -pcp-4.3.2-7.el7_8.x86_64
  -pcp-system-tools-4.3.2-7.el7_8.x86_64
  -python-pcp-4.3.2-7.el7_8.x86_64
  -pcp-conf-4.3.2-7.el7_8.x86_64
  -pcp-selinux-4.3.2-7.el7_8.x86_64

How reproducible:

Always

Steps to Reproduce:

TBD

Actual results:

pmlogger stops running after some time

Expected results:

pmlogger should keep running

Additional info:

Customer reports that pmproxy cosumes a large amount of memory (around 6GB),
which is a known problem in pcp-5.1.1-3 (memory leak).

pmproxy reports a corrupt record in one if the data files:

[Wed Sep  2 17:48:57] pmproxy(190847) Error: pmNewContext failed for /var/log/pcp/pmlogger/oscar02ctr001/RIOyzF: Corrupted record in a PCP archive log

The corrupted archive (oscar02ctr001/RIOyzF)  is dated Aug 16, so it had been
causing issues for some time. We see this in older /var/loge/messages files:

Aug 18 00:13:30 pcpcol systemd[1]: pmproxy.service: Main process exited, code=killed, status=11/SEGV
[...]
Aug 19 00:25:03 pcpcol systemd[1]: pmproxy.service: Main process exited, code=killed, status=11/SEGV
[...]
Sep  2 17:48:34 pcpcol systemd[1]: pmproxy.service: Main process exited, code=killed, status=11/SEGV

Comment 9 Mark Goodwin 2020-11-05 23:35:30 UTC

Resolved upstream for pcp-5.2.2

commit de4fd50a072c41a7180856ca21682afe7f553acd
Author: Mark Goodwin <mgoodwin>
Date:   Thu Nov 5 16:41:57 2020 +1100

    logutil: fix pmlc/pmlogger race causing pmlogger_daily to hang
    
    When determining the current log volume, pmlogger_daily polls the
    pmlogger control port using pmlc. If PMLOGGER_REQUEST_TIMEOUT
    isn't set, pmlc can block indefinitely (default timeout is
    TIMEOUT_NEVER) if pmlogger is starting or re'execing but isn't
    yet listening on the control port.
    
    This race can cause the nightly log rotation to fail, especially
    on logger farms with many pmloggers under management (pmlc is
    more likely to win the race I guess). We then get cascading failures
    from there due to hung pmlogger_daily holding locks, log rotation
    and compression failing to complete, pmproxy discovery paused due
    to the locks .. so no new data making it's way into redis, etc.
    
    After QA/soak testing on a 10 logger test farm, the simple solution
    seems to be to just set PMLOGGER_REQUEST_TIMEOUT to a default value
    (if it is not already set) prior to pmlogger_daily running pmlc.
    All the other recent logctl script updates and new utilities such
    as pmlogctl all seem to be working fine otherwise and are a huge
    improvement.
    
    Resolves: RHBZ#1892326 (apart from some pmproxy memory management
    issues, which will be delt with separately)

Comment 12 Jan Kurik 2020-11-14 15:07:08 UTC

Verifying as SanityOnly due to missing reliable reproducer.
On the tested build pcp-5.2.2-1.el8 the pmlogger have not shown any issues described in this bugreport.

Comment 27 Mark Goodwin 2021-02-01 22:59:35 UTC

*** Bug 1919950 has been marked as a duplicate of this bug. ***

Comment 29 errata-xmlrpc 2021-05-18 15:19:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pcp bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1754

Comment 30 Red Hat Bugzilla 2023-09-15 00:50:15 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.