Bug 2216124

Summary: pmlogger/config.default wrongly generated when pmlogger is restarted together with pmcd
Product: Red Hat Enterprise Linux 8 Reporter: Evgeni Golov <egolov>
Component: pcpAssignee: Nathan Scott <nathans>
Status: ASSIGNED --- QA Contact: Jan Kurik <jkurik>
Severity: unspecified Docs Contact: Jacob Taylor Valdez <jvaldez>
Priority: unspecified    
Version: 8.8CC: chorn, jkurik, kenj, nathans, scox
Target Milestone: rcKeywords: Bugfix, Triaged
Target Release: 8.10   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Evgeni Golov 2023-06-20 08:28:18 UTC
Description of problem:
I have a Satellite 6.13 system with PCP installed.

To monitor a few interesting processes, I have enabled hotproc:

# cat /var/lib/pcp/pmdas/proc/hotproc.conf
#pmdahotproc
Version 1.0

(
  fname == "java" ||
  fname == "redis-server" ||
  (fname == "postgres" && psargs ~ /-D/) ||
  psargs ~ /puma/ ||
  psargs ~ /pulpcore.app.wsgi:application/ ||
  psargs ~ /pulpcore.content:server/ ||
  psargs ~ /pulpcore-worker/ ||
  psargs ~ /smart-proxy/
)

# cat /var/lib/pcp/config/pmlogconf/foreman-hotproc/summary
#pmlogconf-setup 2.0
ident   foreman hotproc metrics
probe   hotproc.control.config != "" ? include : exclude
        hotproc.psinfo.psargs
        hotproc.psinfo.cnswap
        hotproc.psinfo.nswap
        hotproc.psinfo.rss
        hotproc.psinfo.vsize
        hotproc.psinfo.cstime
        hotproc.psinfo.cutime
        hotproc.psinfo.stime
        hotproc.psinfo.utime
        hotproc.io.write_bytes
        hotproc.io.read_bytes
        hotproc.schedstat.cpu_time
        hotproc.fd.count

When I restart pmlogger on its own, it generates the following snippet in /var/lib/pcp/config/pmlogger/config.default:

#+ foreman-hotproc/summary:y:default:
## foreman hotproc metrics
log advisory on default {
	hotproc.psinfo.psargs
	hotproc.psinfo.cnswap
	hotproc.psinfo.nswap
	hotproc.psinfo.rss
	hotproc.psinfo.vsize
	hotproc.psinfo.cstime
	hotproc.psinfo.cutime
	hotproc.psinfo.stime
	hotproc.psinfo.utime
	hotproc.io.write_bytes
	hotproc.io.read_bytes
	hotproc.schedstat.cpu_time
	hotproc.fd.count
}
#----

However, when I restart both pmcd and pmlogger together (`systemctl restart pmcd pmlogger`), I get:
#+ foreman-hotproc/summary:x::
#----

Thus no data is logged by pmlogger.

I could not trivially reproduce the issue on a different machine, so I assume it is some race condition where pmcd did not yet have properly initialized hotproc.control.config and pmlogger thinks it is not set and drops the include.

Version-Release number of selected component (if applicable):
pcp-5.3.7-16.el8.x86_64

How reproducible:
100% on one machine
0% on another ;/

Steps to Reproduce:
1. have a "loaded" pmcd with hotproc
2. systemctl restart pmcd pmlogger

Actual results:
pmlogger does not include configuration as it thinks hotproc.control.config is empty

Expected results:
pmlogger includes the config correctly

Additional info:

Comment 1 Nathan Scott 2023-06-21 03:12:56 UTC
Oh, that's interesting - I've not come across this issue before.  Auditing the
pmlogger, pmcd and pmdaproc code, I haven't been able to find a way this could
be explained.  It's not possible for a client to connect to pmcd before agents
have started and the startup code for pmdaproc reads the hotproc config, so we
should always have a valid value for that metric.

There is also a systemd unit file dependency between pmlogger and pmcd, so the
two should not be able to start in parallel AFAICT.  Unless using that command
line invocation style ignored these deps?  Not likely - maybe pmcd was running
before the restart though, and maybe there's a race where pmlogger starts just
as pmcd is being stopped & then started).  But again I can't see a race within
pmcd/pmdaproc that could allow the metric to have no value.

In the meantime, there's a workaround that avoids any races completely.  When
pmlogger is started initially, and generates a configuration file you want to
use, you can then set this as the config to always be used.  That is achieved
by editing the pmlogger config file and removing the #pmlogconf comment line,
i.e. the very first line.  This prevents pmlogconf from rewriting the file in
the future.

I'll continue to look for the underlying race though as this shouldn't happen.

Comment 2 Evgeni Golov 2023-06-21 08:15:17 UTC
Okay, so the "restart together" part was a bit of a red herring.
The issue also happens when pmlogger is restarted on its own, just with a delay and that's why I did not notice it when reporting.

I've added a audit rule to watch for the config file, and it seems to be touched *twice* when I call `systemctl restart pmlogger`:

type=PROCTITLE msg=audit(06/21/2023 07:24:00.528:1012176) : proctitle=sed 1q /var/lib/pcp/config/pmlogger/config.default 
type=PATH msg=audit(06/21/2023 07:24:00.528:1012176) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135792 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=CWD msg=audit(06/21/2023 07:24:00.528:1012176) : cwd=/var/log/pcp/pmlogger/sat.example.com 
type=SYSCALL msg=audit(06/21/2023 07:24:00.528:1012176) : arch=x86_64 syscall=openat success=yes exit=3 a0=AT_FDCWD a1=0x7ffee659b7a9 a2=O_RDONLY a3=0x0 items=1 ppid=2410665 pid=2410715 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=sed exe=/usr/bin/sed subj=system_u:system_r:initrc_t:s0 key=pmloggerconf 
----
type=PROCTITLE msg=audit(06/21/2023 07:24:00.529:1012177) : proctitle=grep ^# Auto-generated by pmlogconf /var/lib/pcp/config/pmlogger/config.default 
type=PATH msg=audit(06/21/2023 07:24:00.529:1012177) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135792 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=CWD msg=audit(06/21/2023 07:24:00.529:1012177) : cwd=/var/log/pcp/pmlogger/sat.example.com 
type=SYSCALL msg=audit(06/21/2023 07:24:00.529:1012177) : arch=x86_64 syscall=openat success=yes exit=3 a0=AT_FDCWD a1=0x7ffd20cda7a7 a2=O_RDONLY|O_NOCTTY a3=0x0 items=1 ppid=2410665 pid=2410717 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=grep exe=/usr/bin/grep subj=system_u:system_r:initrc_t:s0 key=pmloggerconf 
----
type=PROCTITLE msg=audit(06/21/2023 07:24:00.531:1012178) : proctitle=cp /var/lib/pcp/config/pmlogger/config.default /var/lib/pcp/config/pmlogger/config.default.tmp 
type=PATH msg=audit(06/21/2023 07:24:00.531:1012178) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135792 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=CWD msg=audit(06/21/2023 07:24:00.531:1012178) : cwd=/var/log/pcp/pmlogger/sat.example.com 
type=SYSCALL msg=audit(06/21/2023 07:24:00.531:1012178) : arch=x86_64 syscall=openat success=yes exit=3 a0=AT_FDCWD a1=0x7ffea507877b a2=O_RDONLY a3=0x0 items=1 ppid=2410665 pid=2410718 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=cp exe=/usr/bin/cp subj=system_u:system_r:initrc_t:s0 key=pmloggerconf 
----
type=PROCTITLE msg=audit(06/21/2023 07:24:00.599:1012179) : proctitle=mv /var/lib/pcp/config/pmlogger/config.default.tmp /var/lib/pcp/config/pmlogger/config.default 
type=PATH msg=audit(06/21/2023 07:24:00.599:1012179) : item=4 name=/var/lib/pcp/config/pmlogger/config.default inode=135767 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=CREATE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=PATH msg=audit(06/21/2023 07:24:00.599:1012179) : item=3 name=/var/lib/pcp/config/pmlogger/config.default inode=135792 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=DELETE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=PATH msg=audit(06/21/2023 07:24:00.599:1012179) : item=2 name=/var/lib/pcp/config/pmlogger/config.default.tmp inode=135767 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=DELETE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=PATH msg=audit(06/21/2023 07:24:00.599:1012179) : item=1 name=/var/lib/pcp/config/pmlogger/ inode=130398 dev=fd:00 mode=dir,775 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=PATH msg=audit(06/21/2023 07:24:00.599:1012179) : item=0 name=/var/lib/pcp/config/pmlogger/ inode=130398 dev=fd:00 mode=dir,775 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=CWD msg=audit(06/21/2023 07:24:00.599:1012179) : cwd=/var/log/pcp/pmlogger/sat.example.com 
type=SYSCALL msg=audit(06/21/2023 07:24:00.599:1012179) : arch=x86_64 syscall=rename success=yes exit=0 a0=0x7ffc4662e77b a1=0x7ffc4662e7ab a2=0x0 a3=0x100 items=5 ppid=2410665 pid=2410724 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=mv exe=/usr/bin/mv subj=system_u:system_r:initrc_t:s0 key=pmloggerconf 
----
type=PROCTITLE msg=audit(06/21/2023 07:24:00.624:1012180) : proctitle=/usr/libexec/pcp/bin/pmlogger -N -P -r -T24h10m -c config.default -v 100mb -m pmlogger_check %Y%m%d.%H.%M 
type=PATH msg=audit(06/21/2023 07:24:00.624:1012180) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135767 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=CWD msg=audit(06/21/2023 07:24:00.624:1012180) : cwd=/var/log/pcp/pmlogger/sat.example.com 
type=SYSCALL msg=audit(06/21/2023 07:24:00.624:1012180) : arch=x86_64 syscall=openat success=yes exit=4 a0=AT_FDCWD a1=0x562c84acab00 a2=O_RDONLY a3=0x0 items=1 ppid=1 pid=2410730 auid=unset uid=pcp gid=pcp euid=pcp suid=pcp fsuid=pcp egid=pcp sgid=pcp fsgid=pcp tty=(none) ses=unset comm=pmlogger exe=/usr/bin/pmlogger subj=system_u:system_r:pcp_pmlogger_t:s0 key=pmloggerconf 
----
type=PROCTITLE msg=audit(06/21/2023 07:24:00.630:1012181) : proctitle=pmcpp -rs -I /var/lib/pcp/config/pmlogger /var/lib/pcp/config/pmlogger/config.default 
type=PATH msg=audit(06/21/2023 07:24:00.630:1012181) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135767 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=CWD msg=audit(06/21/2023 07:24:00.630:1012181) : cwd=/var/log/pcp/pmlogger/sat.example.com 
type=SYSCALL msg=audit(06/21/2023 07:24:00.630:1012181) : arch=x86_64 syscall=openat success=yes exit=3 a0=AT_FDCWD a1=0x7ffcbdb0c78a a2=O_RDONLY a3=0x0 items=1 ppid=2410730 pid=2410734 auid=unset uid=pcp gid=pcp euid=pcp suid=pcp fsuid=pcp egid=pcp sgid=pcp fsgid=pcp tty=(none) ses=unset comm=pmcpp exe=/usr/libexec/pcp/bin/pmcpp subj=system_u:system_r:pcp_pmlogger_t:s0 key=pmloggerconf 
----
type=PROCTITLE msg=audit(06/21/2023 07:24:14.143:1012188) : proctitle=sed 1q /var/lib/pcp/config/pmlogger/config.default 
type=PATH msg=audit(06/21/2023 07:24:14.143:1012188) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135767 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=CWD msg=audit(06/21/2023 07:24:14.143:1012188) : cwd=/var/log/pcp/pmlogger/sat.example.com 
type=SYSCALL msg=audit(06/21/2023 07:24:14.143:1012188) : arch=x86_64 syscall=openat success=yes exit=3 a0=AT_FDCWD a1=0x7ffcc2e387a9 a2=O_RDONLY a3=0x0 items=1 ppid=2411202 pid=2411252 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=sed exe=/usr/bin/sed subj=system_u:system_r:initrc_t:s0 key=pmloggerconf 
----
type=PROCTITLE msg=audit(06/21/2023 07:24:14.145:1012189) : proctitle=grep ^# Auto-generated by pmlogconf /var/lib/pcp/config/pmlogger/config.default 
type=PATH msg=audit(06/21/2023 07:24:14.145:1012189) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135767 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=CWD msg=audit(06/21/2023 07:24:14.145:1012189) : cwd=/var/log/pcp/pmlogger/sat.example.com 
type=SYSCALL msg=audit(06/21/2023 07:24:14.145:1012189) : arch=x86_64 syscall=openat success=yes exit=3 a0=AT_FDCWD a1=0x7ffd9b2da7a7 a2=O_RDONLY|O_NOCTTY a3=0x0 items=1 ppid=2411202 pid=2411254 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=grep exe=/usr/bin/grep subj=system_u:system_r:initrc_t:s0 key=pmloggerconf 
----
type=PROCTITLE msg=audit(06/21/2023 07:24:14.147:1012190) : proctitle=cp /var/lib/pcp/config/pmlogger/config.default /var/lib/pcp/config/pmlogger/config.default.tmp 
type=PATH msg=audit(06/21/2023 07:24:14.147:1012190) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135767 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=CWD msg=audit(06/21/2023 07:24:14.147:1012190) : cwd=/var/log/pcp/pmlogger/sat.example.com 
type=SYSCALL msg=audit(06/21/2023 07:24:14.147:1012190) : arch=x86_64 syscall=openat success=yes exit=3 a0=AT_FDCWD a1=0x7ffed039477b a2=O_RDONLY a3=0x0 items=1 ppid=2411202 pid=2411255 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=cp exe=/usr/bin/cp subj=system_u:system_r:initrc_t:s0 key=pmloggerconf 
----
type=PROCTITLE msg=audit(06/21/2023 07:24:14.171:1012191) : proctitle=mv /var/lib/pcp/config/pmlogger/config.default.tmp /var/lib/pcp/config/pmlogger/config.default 
type=PATH msg=audit(06/21/2023 07:24:14.171:1012191) : item=4 name=/var/lib/pcp/config/pmlogger/config.default inode=135780 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=CREATE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=PATH msg=audit(06/21/2023 07:24:14.171:1012191) : item=3 name=/var/lib/pcp/config/pmlogger/config.default inode=135767 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=DELETE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=PATH msg=audit(06/21/2023 07:24:14.171:1012191) : item=2 name=/var/lib/pcp/config/pmlogger/config.default.tmp inode=135780 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=DELETE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=PATH msg=audit(06/21/2023 07:24:14.171:1012191) : item=1 name=/var/lib/pcp/config/pmlogger/ inode=130398 dev=fd:00 mode=dir,775 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=PATH msg=audit(06/21/2023 07:24:14.171:1012191) : item=0 name=/var/lib/pcp/config/pmlogger/ inode=130398 dev=fd:00 mode=dir,775 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=CWD msg=audit(06/21/2023 07:24:14.171:1012191) : cwd=/var/log/pcp/pmlogger/sat.example.com 
type=SYSCALL msg=audit(06/21/2023 07:24:14.171:1012191) : arch=x86_64 syscall=rename success=yes exit=0 a0=0x7ffcbce3a77b a1=0x7ffcbce3a7ab a2=0x0 a3=0x100 items=5 ppid=2411202 pid=2411260 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=mv exe=/usr/bin/mv subj=system_u:system_r:initrc_t:s0 key=pmloggerconf 
----
type=PROCTITLE msg=audit(06/21/2023 07:24:14.194:1012192) : proctitle=/usr/libexec/pcp/bin/pmlogger -N -P -r -T24h10m -c config.default -v 100mb -m pmlogger_check %Y%m%d.%H.%M 
type=PATH msg=audit(06/21/2023 07:24:14.194:1012192) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135780 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=CWD msg=audit(06/21/2023 07:24:14.194:1012192) : cwd=/var/log/pcp/pmlogger/sat.example.com 
type=SYSCALL msg=audit(06/21/2023 07:24:14.194:1012192) : arch=x86_64 syscall=openat success=yes exit=4 a0=AT_FDCWD a1=0x55be60415b00 a2=O_RDONLY a3=0x0 items=1 ppid=1 pid=2411266 auid=unset uid=pcp gid=pcp euid=pcp suid=pcp fsuid=pcp egid=pcp sgid=pcp fsgid=pcp tty=(none) ses=unset comm=pmlogger exe=/usr/bin/pmlogger subj=system_u:system_r:pcp_pmlogger_t:s0 key=pmloggerconf 
----
type=PROCTITLE msg=audit(06/21/2023 07:24:14.200:1012193) : proctitle=pmcpp -rs -I /var/lib/pcp/config/pmlogger /var/lib/pcp/config/pmlogger/config.default 
type=PATH msg=audit(06/21/2023 07:24:14.200:1012193) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135780 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=CWD msg=audit(06/21/2023 07:24:14.200:1012193) : cwd=/var/log/pcp/pmlogger/sat.example.com 
type=SYSCALL msg=audit(06/21/2023 07:24:14.200:1012193) : arch=x86_64 syscall=openat success=yes exit=3 a0=AT_FDCWD a1=0x7ffe0c2cc78a a2=O_RDONLY a3=0x0 items=1 ppid=2411266 pid=2411270 auid=unset uid=pcp gid=pcp euid=pcp suid=pcp fsuid=pcp egid=pcp sgid=pcp fsgid=pcp tty=(none) ses=unset comm=pmcpp exe=/usr/libexec/pcp/bin/pmcpp subj=system_u:system_r:pcp_pmlogger_t:s0 key=pmloggerconf 


On another machine (the one I said I could not repro this), the changes only happen once.

Does that make any sense to you?

Comment 3 Evgeni Golov 2023-06-21 08:33:21 UTC
It's also visible in var/log/pcp/NOTICES after enabling logging:

08:26:51.065 begin pid:2414287 /usr/libexec/pcp/lib/pmlogger args:stop-systemd [systemd(1)---pmlogger(2414287)]
08:26:51.155 stop pmlogger from /usr/libexec/pcp/lib/pmlogger
08:26:51.162 end pid:2414287 /usr/libexec/pcp/lib/pmlogger status=0
08:26:51.295 begin pid:2414353 /usr/libexec/pcp/lib/pmlogger args:start-systemd [systemd(1)---pmlogger(2414353)]
08:26:51.501 begin pid:2414415 pmlogger_check args:--quick -V --only-primary [systemd(1)---pmlogger(2414353)---pmlogger_check(2414415)]
08:26:51.809 start pmlogger from pmlogger_check for host local:
08:26:52.516 end pid:2414415 pmlogger_check status=0
08:26:52.518 begin pid:2414682 pmlogger_check args:--skip-primary --quick [systemd(1)---pmlogger_farm(2414667)---pmlogger_check(2414682)]
08:26:52.530 end pid:2414353 /usr/libexec/pcp/lib/pmlogger status=0
08:26:52.723 end pid:2414682 pmlogger_check status=0
08:27:06.073 begin pid:2414869 /usr/libexec/pcp/lib/pmlogger args:start-systemd [systemd(1)---pmlogger(2414869)]
08:27:06.285 begin pid:2414930 pmlogger_check args:--quick -V --only-primary [systemd(1)---pmlogger(2414869)---pmlogger_check(2414930)]
08:27:06.536 start pmlogger from pmlogger_check for host local:
08:27:06.851 end pid:2414930 pmlogger_check status=0
08:27:06.864 begin pid:2415143 pmlogger_check args:--skip-primary --quick [systemd(1)---pmlogger_farm(2415130)---pmlogger_check(2415143)]
08:27:06.864 end pid:2414869 /usr/libexec/pcp/lib/pmlogger status=0
08:27:07.066 end pid:2415143 pmlogger_check status=0

And systemd logs a FAILURE inbetween?!

Jun 21 08:26:50 sat.example.com systemd[1]: Stopping Performance Metrics Archive Logger...
Jun 21 08:26:51 sat.example.com systemd[1]: pmlogger.service: Succeeded.
Jun 21 08:26:51 sat.example.com systemd[1]: Stopped Performance Metrics Archive Logger.
Jun 21 08:26:51 sat.example.com systemd[1]: Starting Performance Metrics Archive Logger...
Jun 21 08:26:52 sat.example.com systemd[1]: Started Performance Metrics Archive Logger.
Jun 21 08:27:05 sat.example.com systemd[1]: pmlogger.service: Main process exited, code=exited, status=1/FAILURE
Jun 21 08:27:05 sat.example.com systemd[1]: pmlogger.service: Failed with result 'exit-code'.
Jun 21 08:27:05 sat.example.com systemd[1]: pmlogger.service: Service RestartSec=100ms expired, scheduling restart.
Jun 21 08:27:05 sat.example.com systemd[1]: pmlogger.service: Scheduled restart job, restart counter is at 1.
Jun 21 08:27:05 sat.example.com systemd[1]: Stopped Performance Metrics Archive Logger.
Jun 21 08:27:05 sat.example.com systemd[1]: Starting Performance Metrics Archive Logger...
Jun 21 08:27:06 sat.example.com systemd[1]: Started Performance Metrics Archive Logger.

Let's look at the logs of pmlogger itself…

pmlogger.log.prev:

Log for pmlogger on sat.example.com started Wed Jun 21 08:26:51 2023

preprocessor cmd: /usr/libexec/pcp/bin/pmcpp -rs -I /var/lib/pcp/config/pmlogger /var/lib/pcp/config/pmlogger/config.default
Config parsed
Starting primary logger for host "sat.example.com" via "local:"
Archive basename: 20230621.08.26

Group [81 metrics] {
…
} logged once: 14532 bytes
pmGetInDom(144.8236): Operation not permitted

Log finished Wed Jun 21 08:27:05 2023

pmlogger.log:
Log for pmlogger on sat.example.com started Wed Jun 21 08:27:06 2023

preprocessor cmd: /usr/libexec/pcp/bin/pmcpp -rs -I /var/lib/pcp/config/pmlogger /var/lib/pcp/config/pmlogger/config.default
…
Config parsed
Starting primary logger for host "sat.example.com" via "local:"
Archive basename: 20230621.08.27

Group [69 metrics] {
…
} logged once: 14208 bytes

Group [136 metrics] {
…
} logged every 60 sec: 61276 bytes or 84.15 Mbytes/day

Comment 4 Evgeni Golov 2023-06-21 09:57:59 UTC
pmGetInDom(144.8236) led me to /var/lib/pcp/config/pmda/144.8236.py which was a dump of the openmetrics config for the process exporter.
after disabling that one openmetrics endpoint (and keeping several others), pmlogger starts correctly and fetches all logs.
re-enabling process exporter kills it again.

great. I can, for now, live without process-exporter.

what data would you need from the affected system to analyze this further?

Comment 5 Nathan Scott 2023-06-22 07:43:36 UTC
| On another machine (the one I said I could not repro this), the changes only happen once.
| 
| Does that make any sense to you?

Hmm, I wonder if they might be on slightly different RHEL versions, or zstream patch levels?

I do recall a PCP-systemd interaction buglet that sounds a little like your description (pmlogger starting & getting promptly timed out / killed, then restarted) that someone else fixed a ways back - perhaps that fix is absent there.

ps: I'm away on PTO shortly, back on 3rd July.

Comment 6 Evgeni Golov 2023-06-22 08:07:29 UTC
It's pcp-5.3.7-16.el8.x86_64 (RHEL 8.8) vs pcp-5.3.7-17.el8.x86_64 (CentOS Stream 8), but the diff doesn't look any suspicious to me in that regard.

I think what happens is more:
- it tries to start
- crashes because of the openmetrics "issue" I mention in Comment 4
- systemd restarts it (pmlogger has Restart=always in /usr/lib/systemd/system/pmlogger.service)
- during this (second) start it for some reason generates and loads fewer rules and succeeds

Comment 7 Ken McDonell 2023-06-26 22:26:03 UTC
Hi Evgeni.

Your analysis about pmlogger being started twice looks sound.

The pmGetInDom failure will kill pmlogger as it cannot proceed without the instance domain details.
I have no clue what InDom 144.8236 is for the openmetrics PMDA (the 8236 looks suspiciously large), and the /var/lib/pcp/config/pmda/144.8236.py file is a mistake in the openmetrics PMDA implementation, that file should not be living in that directory ... but I don't think this would trigger the pmGetInDom failure.

But even if pmlogger is being restarted twice (that's a bug we need to fix in the openmetrics PMDA), it does not explain your original pmlogconf issue.

Around the time of your pmlogconf problem, can you check to see if pmcd was restarted?
$ pminfo -f pmcd.pid
before and after the systemctl restart pmlogger would prove this one way or the other.

This would help us narrow down the focus ... (a) did the hotproc PMDA fail to return a value, or (b) did pmcd die.  I doubt (b) but it would be good to eliminate this possibility.

After that, we're going to need more diagnostics I suspect.  Are you in a position to run a modified version of the pmlogger_check shell script on this system?

Comment 8 Evgeni Golov 2023-06-27 07:48:24 UTC
Hi Ken,

(In reply to Ken McDonell from comment #7)
> Hi Evgeni.
> 
> Your analysis about pmlogger being started twice looks sound.
> 
> The pmGetInDom failure will kill pmlogger as it cannot proceed without the
> instance domain details.
> I have no clue what InDom 144.8236 is for the openmetrics PMDA (the 8236
> looks suspiciously large), and the /var/lib/pcp/config/pmda/144.8236.py file
> is a mistake in the openmetrics PMDA implementation, that file should not be
> living in that directory ... but I don't think this would trigger the
> pmGetInDom failure.

What is the meaning of the "8236" here?
Is that the number of metrics? I can totally see there to be 8k-ish metrics coming via openmetrics here.
( if that's a problem in itself, we need to talk, but outside of this BZ ;-) )

> But even if pmlogger is being restarted twice (that's a bug we need to fix
> in the openmetrics PMDA), it does not explain your original pmlogconf issue.

Agreed.

> Around the time of your pmlogconf problem, can you check to see if pmcd was
> restarted?
> $ pminfo -f pmcd.pid
> before and after the systemctl restart pmlogger would prove this one way or
> the other.

While the "bad" openmetrics config is deactivated, the value remains stable, so no restart.
Same after I enable it back and see "pmGetInDom(144.8236): Operation not permitted" again.

So I'd say no restarts here.

> This would help us narrow down the focus ... (a) did the hotproc PMDA fail
> to return a value, or (b) did pmcd die.  I doubt (b) but it would be good to
> eliminate this possibility.
> 
> After that, we're going to need more diagnostics I suspect.  Are you in a
> position to run a modified version of the pmlogger_check shell script on
> this system?

Absolutely. Just throw either a brew build or that file as a whole at me and I'll try out.