Description of problem: I have a Satellite 6.13 system with PCP installed. To monitor a few interesting processes, I have enabled hotproc: # cat /var/lib/pcp/pmdas/proc/hotproc.conf #pmdahotproc Version 1.0 ( fname == "java" || fname == "redis-server" || (fname == "postgres" && psargs ~ /-D/) || psargs ~ /puma/ || psargs ~ /pulpcore.app.wsgi:application/ || psargs ~ /pulpcore.content:server/ || psargs ~ /pulpcore-worker/ || psargs ~ /smart-proxy/ ) # cat /var/lib/pcp/config/pmlogconf/foreman-hotproc/summary #pmlogconf-setup 2.0 ident foreman hotproc metrics probe hotproc.control.config != "" ? include : exclude hotproc.psinfo.psargs hotproc.psinfo.cnswap hotproc.psinfo.nswap hotproc.psinfo.rss hotproc.psinfo.vsize hotproc.psinfo.cstime hotproc.psinfo.cutime hotproc.psinfo.stime hotproc.psinfo.utime hotproc.io.write_bytes hotproc.io.read_bytes hotproc.schedstat.cpu_time hotproc.fd.count When I restart pmlogger on its own, it generates the following snippet in /var/lib/pcp/config/pmlogger/config.default: #+ foreman-hotproc/summary:y:default: ## foreman hotproc metrics log advisory on default { hotproc.psinfo.psargs hotproc.psinfo.cnswap hotproc.psinfo.nswap hotproc.psinfo.rss hotproc.psinfo.vsize hotproc.psinfo.cstime hotproc.psinfo.cutime hotproc.psinfo.stime hotproc.psinfo.utime hotproc.io.write_bytes hotproc.io.read_bytes hotproc.schedstat.cpu_time hotproc.fd.count } #---- However, when I restart both pmcd and pmlogger together (`systemctl restart pmcd pmlogger`), I get: #+ foreman-hotproc/summary:x:: #---- Thus no data is logged by pmlogger. I could not trivially reproduce the issue on a different machine, so I assume it is some race condition where pmcd did not yet have properly initialized hotproc.control.config and pmlogger thinks it is not set and drops the include. Version-Release number of selected component (if applicable): pcp-5.3.7-16.el8.x86_64 How reproducible: 100% on one machine 0% on another ;/ Steps to Reproduce: 1. have a "loaded" pmcd with hotproc 2. systemctl restart pmcd pmlogger Actual results: pmlogger does not include configuration as it thinks hotproc.control.config is empty Expected results: pmlogger includes the config correctly Additional info:
Oh, that's interesting - I've not come across this issue before. Auditing the pmlogger, pmcd and pmdaproc code, I haven't been able to find a way this could be explained. It's not possible for a client to connect to pmcd before agents have started and the startup code for pmdaproc reads the hotproc config, so we should always have a valid value for that metric. There is also a systemd unit file dependency between pmlogger and pmcd, so the two should not be able to start in parallel AFAICT. Unless using that command line invocation style ignored these deps? Not likely - maybe pmcd was running before the restart though, and maybe there's a race where pmlogger starts just as pmcd is being stopped & then started). But again I can't see a race within pmcd/pmdaproc that could allow the metric to have no value. In the meantime, there's a workaround that avoids any races completely. When pmlogger is started initially, and generates a configuration file you want to use, you can then set this as the config to always be used. That is achieved by editing the pmlogger config file and removing the #pmlogconf comment line, i.e. the very first line. This prevents pmlogconf from rewriting the file in the future. I'll continue to look for the underlying race though as this shouldn't happen.
Okay, so the "restart together" part was a bit of a red herring. The issue also happens when pmlogger is restarted on its own, just with a delay and that's why I did not notice it when reporting. I've added a audit rule to watch for the config file, and it seems to be touched *twice* when I call `systemctl restart pmlogger`: type=PROCTITLE msg=audit(06/21/2023 07:24:00.528:1012176) : proctitle=sed 1q /var/lib/pcp/config/pmlogger/config.default type=PATH msg=audit(06/21/2023 07:24:00.528:1012176) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135792 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=CWD msg=audit(06/21/2023 07:24:00.528:1012176) : cwd=/var/log/pcp/pmlogger/sat.example.com type=SYSCALL msg=audit(06/21/2023 07:24:00.528:1012176) : arch=x86_64 syscall=openat success=yes exit=3 a0=AT_FDCWD a1=0x7ffee659b7a9 a2=O_RDONLY a3=0x0 items=1 ppid=2410665 pid=2410715 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=sed exe=/usr/bin/sed subj=system_u:system_r:initrc_t:s0 key=pmloggerconf ---- type=PROCTITLE msg=audit(06/21/2023 07:24:00.529:1012177) : proctitle=grep ^# Auto-generated by pmlogconf /var/lib/pcp/config/pmlogger/config.default type=PATH msg=audit(06/21/2023 07:24:00.529:1012177) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135792 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=CWD msg=audit(06/21/2023 07:24:00.529:1012177) : cwd=/var/log/pcp/pmlogger/sat.example.com type=SYSCALL msg=audit(06/21/2023 07:24:00.529:1012177) : arch=x86_64 syscall=openat success=yes exit=3 a0=AT_FDCWD a1=0x7ffd20cda7a7 a2=O_RDONLY|O_NOCTTY a3=0x0 items=1 ppid=2410665 pid=2410717 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=grep exe=/usr/bin/grep subj=system_u:system_r:initrc_t:s0 key=pmloggerconf ---- type=PROCTITLE msg=audit(06/21/2023 07:24:00.531:1012178) : proctitle=cp /var/lib/pcp/config/pmlogger/config.default /var/lib/pcp/config/pmlogger/config.default.tmp type=PATH msg=audit(06/21/2023 07:24:00.531:1012178) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135792 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=CWD msg=audit(06/21/2023 07:24:00.531:1012178) : cwd=/var/log/pcp/pmlogger/sat.example.com type=SYSCALL msg=audit(06/21/2023 07:24:00.531:1012178) : arch=x86_64 syscall=openat success=yes exit=3 a0=AT_FDCWD a1=0x7ffea507877b a2=O_RDONLY a3=0x0 items=1 ppid=2410665 pid=2410718 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=cp exe=/usr/bin/cp subj=system_u:system_r:initrc_t:s0 key=pmloggerconf ---- type=PROCTITLE msg=audit(06/21/2023 07:24:00.599:1012179) : proctitle=mv /var/lib/pcp/config/pmlogger/config.default.tmp /var/lib/pcp/config/pmlogger/config.default type=PATH msg=audit(06/21/2023 07:24:00.599:1012179) : item=4 name=/var/lib/pcp/config/pmlogger/config.default inode=135767 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=CREATE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=PATH msg=audit(06/21/2023 07:24:00.599:1012179) : item=3 name=/var/lib/pcp/config/pmlogger/config.default inode=135792 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=DELETE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=PATH msg=audit(06/21/2023 07:24:00.599:1012179) : item=2 name=/var/lib/pcp/config/pmlogger/config.default.tmp inode=135767 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=DELETE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=PATH msg=audit(06/21/2023 07:24:00.599:1012179) : item=1 name=/var/lib/pcp/config/pmlogger/ inode=130398 dev=fd:00 mode=dir,775 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=PATH msg=audit(06/21/2023 07:24:00.599:1012179) : item=0 name=/var/lib/pcp/config/pmlogger/ inode=130398 dev=fd:00 mode=dir,775 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=CWD msg=audit(06/21/2023 07:24:00.599:1012179) : cwd=/var/log/pcp/pmlogger/sat.example.com type=SYSCALL msg=audit(06/21/2023 07:24:00.599:1012179) : arch=x86_64 syscall=rename success=yes exit=0 a0=0x7ffc4662e77b a1=0x7ffc4662e7ab a2=0x0 a3=0x100 items=5 ppid=2410665 pid=2410724 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=mv exe=/usr/bin/mv subj=system_u:system_r:initrc_t:s0 key=pmloggerconf ---- type=PROCTITLE msg=audit(06/21/2023 07:24:00.624:1012180) : proctitle=/usr/libexec/pcp/bin/pmlogger -N -P -r -T24h10m -c config.default -v 100mb -m pmlogger_check %Y%m%d.%H.%M type=PATH msg=audit(06/21/2023 07:24:00.624:1012180) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135767 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=CWD msg=audit(06/21/2023 07:24:00.624:1012180) : cwd=/var/log/pcp/pmlogger/sat.example.com type=SYSCALL msg=audit(06/21/2023 07:24:00.624:1012180) : arch=x86_64 syscall=openat success=yes exit=4 a0=AT_FDCWD a1=0x562c84acab00 a2=O_RDONLY a3=0x0 items=1 ppid=1 pid=2410730 auid=unset uid=pcp gid=pcp euid=pcp suid=pcp fsuid=pcp egid=pcp sgid=pcp fsgid=pcp tty=(none) ses=unset comm=pmlogger exe=/usr/bin/pmlogger subj=system_u:system_r:pcp_pmlogger_t:s0 key=pmloggerconf ---- type=PROCTITLE msg=audit(06/21/2023 07:24:00.630:1012181) : proctitle=pmcpp -rs -I /var/lib/pcp/config/pmlogger /var/lib/pcp/config/pmlogger/config.default type=PATH msg=audit(06/21/2023 07:24:00.630:1012181) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135767 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=CWD msg=audit(06/21/2023 07:24:00.630:1012181) : cwd=/var/log/pcp/pmlogger/sat.example.com type=SYSCALL msg=audit(06/21/2023 07:24:00.630:1012181) : arch=x86_64 syscall=openat success=yes exit=3 a0=AT_FDCWD a1=0x7ffcbdb0c78a a2=O_RDONLY a3=0x0 items=1 ppid=2410730 pid=2410734 auid=unset uid=pcp gid=pcp euid=pcp suid=pcp fsuid=pcp egid=pcp sgid=pcp fsgid=pcp tty=(none) ses=unset comm=pmcpp exe=/usr/libexec/pcp/bin/pmcpp subj=system_u:system_r:pcp_pmlogger_t:s0 key=pmloggerconf ---- type=PROCTITLE msg=audit(06/21/2023 07:24:14.143:1012188) : proctitle=sed 1q /var/lib/pcp/config/pmlogger/config.default type=PATH msg=audit(06/21/2023 07:24:14.143:1012188) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135767 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=CWD msg=audit(06/21/2023 07:24:14.143:1012188) : cwd=/var/log/pcp/pmlogger/sat.example.com type=SYSCALL msg=audit(06/21/2023 07:24:14.143:1012188) : arch=x86_64 syscall=openat success=yes exit=3 a0=AT_FDCWD a1=0x7ffcc2e387a9 a2=O_RDONLY a3=0x0 items=1 ppid=2411202 pid=2411252 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=sed exe=/usr/bin/sed subj=system_u:system_r:initrc_t:s0 key=pmloggerconf ---- type=PROCTITLE msg=audit(06/21/2023 07:24:14.145:1012189) : proctitle=grep ^# Auto-generated by pmlogconf /var/lib/pcp/config/pmlogger/config.default type=PATH msg=audit(06/21/2023 07:24:14.145:1012189) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135767 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=CWD msg=audit(06/21/2023 07:24:14.145:1012189) : cwd=/var/log/pcp/pmlogger/sat.example.com type=SYSCALL msg=audit(06/21/2023 07:24:14.145:1012189) : arch=x86_64 syscall=openat success=yes exit=3 a0=AT_FDCWD a1=0x7ffd9b2da7a7 a2=O_RDONLY|O_NOCTTY a3=0x0 items=1 ppid=2411202 pid=2411254 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=grep exe=/usr/bin/grep subj=system_u:system_r:initrc_t:s0 key=pmloggerconf ---- type=PROCTITLE msg=audit(06/21/2023 07:24:14.147:1012190) : proctitle=cp /var/lib/pcp/config/pmlogger/config.default /var/lib/pcp/config/pmlogger/config.default.tmp type=PATH msg=audit(06/21/2023 07:24:14.147:1012190) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135767 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=CWD msg=audit(06/21/2023 07:24:14.147:1012190) : cwd=/var/log/pcp/pmlogger/sat.example.com type=SYSCALL msg=audit(06/21/2023 07:24:14.147:1012190) : arch=x86_64 syscall=openat success=yes exit=3 a0=AT_FDCWD a1=0x7ffed039477b a2=O_RDONLY a3=0x0 items=1 ppid=2411202 pid=2411255 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=cp exe=/usr/bin/cp subj=system_u:system_r:initrc_t:s0 key=pmloggerconf ---- type=PROCTITLE msg=audit(06/21/2023 07:24:14.171:1012191) : proctitle=mv /var/lib/pcp/config/pmlogger/config.default.tmp /var/lib/pcp/config/pmlogger/config.default type=PATH msg=audit(06/21/2023 07:24:14.171:1012191) : item=4 name=/var/lib/pcp/config/pmlogger/config.default inode=135780 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=CREATE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=PATH msg=audit(06/21/2023 07:24:14.171:1012191) : item=3 name=/var/lib/pcp/config/pmlogger/config.default inode=135767 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=DELETE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=PATH msg=audit(06/21/2023 07:24:14.171:1012191) : item=2 name=/var/lib/pcp/config/pmlogger/config.default.tmp inode=135780 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=DELETE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=PATH msg=audit(06/21/2023 07:24:14.171:1012191) : item=1 name=/var/lib/pcp/config/pmlogger/ inode=130398 dev=fd:00 mode=dir,775 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=PATH msg=audit(06/21/2023 07:24:14.171:1012191) : item=0 name=/var/lib/pcp/config/pmlogger/ inode=130398 dev=fd:00 mode=dir,775 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=CWD msg=audit(06/21/2023 07:24:14.171:1012191) : cwd=/var/log/pcp/pmlogger/sat.example.com type=SYSCALL msg=audit(06/21/2023 07:24:14.171:1012191) : arch=x86_64 syscall=rename success=yes exit=0 a0=0x7ffcbce3a77b a1=0x7ffcbce3a7ab a2=0x0 a3=0x100 items=5 ppid=2411202 pid=2411260 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=mv exe=/usr/bin/mv subj=system_u:system_r:initrc_t:s0 key=pmloggerconf ---- type=PROCTITLE msg=audit(06/21/2023 07:24:14.194:1012192) : proctitle=/usr/libexec/pcp/bin/pmlogger -N -P -r -T24h10m -c config.default -v 100mb -m pmlogger_check %Y%m%d.%H.%M type=PATH msg=audit(06/21/2023 07:24:14.194:1012192) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135780 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=CWD msg=audit(06/21/2023 07:24:14.194:1012192) : cwd=/var/log/pcp/pmlogger/sat.example.com type=SYSCALL msg=audit(06/21/2023 07:24:14.194:1012192) : arch=x86_64 syscall=openat success=yes exit=4 a0=AT_FDCWD a1=0x55be60415b00 a2=O_RDONLY a3=0x0 items=1 ppid=1 pid=2411266 auid=unset uid=pcp gid=pcp euid=pcp suid=pcp fsuid=pcp egid=pcp sgid=pcp fsgid=pcp tty=(none) ses=unset comm=pmlogger exe=/usr/bin/pmlogger subj=system_u:system_r:pcp_pmlogger_t:s0 key=pmloggerconf ---- type=PROCTITLE msg=audit(06/21/2023 07:24:14.200:1012193) : proctitle=pmcpp -rs -I /var/lib/pcp/config/pmlogger /var/lib/pcp/config/pmlogger/config.default type=PATH msg=audit(06/21/2023 07:24:14.200:1012193) : item=0 name=/var/lib/pcp/config/pmlogger/config.default inode=135780 dev=fd:00 mode=file,644 ouid=pcp ogid=pcp rdev=00:00 obj=system_u:object_r:pcp_var_lib_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 type=CWD msg=audit(06/21/2023 07:24:14.200:1012193) : cwd=/var/log/pcp/pmlogger/sat.example.com type=SYSCALL msg=audit(06/21/2023 07:24:14.200:1012193) : arch=x86_64 syscall=openat success=yes exit=3 a0=AT_FDCWD a1=0x7ffe0c2cc78a a2=O_RDONLY a3=0x0 items=1 ppid=2411266 pid=2411270 auid=unset uid=pcp gid=pcp euid=pcp suid=pcp fsuid=pcp egid=pcp sgid=pcp fsgid=pcp tty=(none) ses=unset comm=pmcpp exe=/usr/libexec/pcp/bin/pmcpp subj=system_u:system_r:pcp_pmlogger_t:s0 key=pmloggerconf On another machine (the one I said I could not repro this), the changes only happen once. Does that make any sense to you?
It's also visible in var/log/pcp/NOTICES after enabling logging: 08:26:51.065 begin pid:2414287 /usr/libexec/pcp/lib/pmlogger args:stop-systemd [systemd(1)---pmlogger(2414287)] 08:26:51.155 stop pmlogger from /usr/libexec/pcp/lib/pmlogger 08:26:51.162 end pid:2414287 /usr/libexec/pcp/lib/pmlogger status=0 08:26:51.295 begin pid:2414353 /usr/libexec/pcp/lib/pmlogger args:start-systemd [systemd(1)---pmlogger(2414353)] 08:26:51.501 begin pid:2414415 pmlogger_check args:--quick -V --only-primary [systemd(1)---pmlogger(2414353)---pmlogger_check(2414415)] 08:26:51.809 start pmlogger from pmlogger_check for host local: 08:26:52.516 end pid:2414415 pmlogger_check status=0 08:26:52.518 begin pid:2414682 pmlogger_check args:--skip-primary --quick [systemd(1)---pmlogger_farm(2414667)---pmlogger_check(2414682)] 08:26:52.530 end pid:2414353 /usr/libexec/pcp/lib/pmlogger status=0 08:26:52.723 end pid:2414682 pmlogger_check status=0 08:27:06.073 begin pid:2414869 /usr/libexec/pcp/lib/pmlogger args:start-systemd [systemd(1)---pmlogger(2414869)] 08:27:06.285 begin pid:2414930 pmlogger_check args:--quick -V --only-primary [systemd(1)---pmlogger(2414869)---pmlogger_check(2414930)] 08:27:06.536 start pmlogger from pmlogger_check for host local: 08:27:06.851 end pid:2414930 pmlogger_check status=0 08:27:06.864 begin pid:2415143 pmlogger_check args:--skip-primary --quick [systemd(1)---pmlogger_farm(2415130)---pmlogger_check(2415143)] 08:27:06.864 end pid:2414869 /usr/libexec/pcp/lib/pmlogger status=0 08:27:07.066 end pid:2415143 pmlogger_check status=0 And systemd logs a FAILURE inbetween?! Jun 21 08:26:50 sat.example.com systemd[1]: Stopping Performance Metrics Archive Logger... Jun 21 08:26:51 sat.example.com systemd[1]: pmlogger.service: Succeeded. Jun 21 08:26:51 sat.example.com systemd[1]: Stopped Performance Metrics Archive Logger. Jun 21 08:26:51 sat.example.com systemd[1]: Starting Performance Metrics Archive Logger... Jun 21 08:26:52 sat.example.com systemd[1]: Started Performance Metrics Archive Logger. Jun 21 08:27:05 sat.example.com systemd[1]: pmlogger.service: Main process exited, code=exited, status=1/FAILURE Jun 21 08:27:05 sat.example.com systemd[1]: pmlogger.service: Failed with result 'exit-code'. Jun 21 08:27:05 sat.example.com systemd[1]: pmlogger.service: Service RestartSec=100ms expired, scheduling restart. Jun 21 08:27:05 sat.example.com systemd[1]: pmlogger.service: Scheduled restart job, restart counter is at 1. Jun 21 08:27:05 sat.example.com systemd[1]: Stopped Performance Metrics Archive Logger. Jun 21 08:27:05 sat.example.com systemd[1]: Starting Performance Metrics Archive Logger... Jun 21 08:27:06 sat.example.com systemd[1]: Started Performance Metrics Archive Logger. Let's look at the logs of pmlogger itself… pmlogger.log.prev: Log for pmlogger on sat.example.com started Wed Jun 21 08:26:51 2023 preprocessor cmd: /usr/libexec/pcp/bin/pmcpp -rs -I /var/lib/pcp/config/pmlogger /var/lib/pcp/config/pmlogger/config.default Config parsed Starting primary logger for host "sat.example.com" via "local:" Archive basename: 20230621.08.26 Group [81 metrics] { … } logged once: 14532 bytes pmGetInDom(144.8236): Operation not permitted Log finished Wed Jun 21 08:27:05 2023 pmlogger.log: Log for pmlogger on sat.example.com started Wed Jun 21 08:27:06 2023 preprocessor cmd: /usr/libexec/pcp/bin/pmcpp -rs -I /var/lib/pcp/config/pmlogger /var/lib/pcp/config/pmlogger/config.default … Config parsed Starting primary logger for host "sat.example.com" via "local:" Archive basename: 20230621.08.27 Group [69 metrics] { … } logged once: 14208 bytes Group [136 metrics] { … } logged every 60 sec: 61276 bytes or 84.15 Mbytes/day
pmGetInDom(144.8236) led me to /var/lib/pcp/config/pmda/144.8236.py which was a dump of the openmetrics config for the process exporter. after disabling that one openmetrics endpoint (and keeping several others), pmlogger starts correctly and fetches all logs. re-enabling process exporter kills it again. great. I can, for now, live without process-exporter. what data would you need from the affected system to analyze this further?
| On another machine (the one I said I could not repro this), the changes only happen once. | | Does that make any sense to you? Hmm, I wonder if they might be on slightly different RHEL versions, or zstream patch levels? I do recall a PCP-systemd interaction buglet that sounds a little like your description (pmlogger starting & getting promptly timed out / killed, then restarted) that someone else fixed a ways back - perhaps that fix is absent there. ps: I'm away on PTO shortly, back on 3rd July.
It's pcp-5.3.7-16.el8.x86_64 (RHEL 8.8) vs pcp-5.3.7-17.el8.x86_64 (CentOS Stream 8), but the diff doesn't look any suspicious to me in that regard. I think what happens is more: - it tries to start - crashes because of the openmetrics "issue" I mention in Comment 4 - systemd restarts it (pmlogger has Restart=always in /usr/lib/systemd/system/pmlogger.service) - during this (second) start it for some reason generates and loads fewer rules and succeeds
Hi Evgeni. Your analysis about pmlogger being started twice looks sound. The pmGetInDom failure will kill pmlogger as it cannot proceed without the instance domain details. I have no clue what InDom 144.8236 is for the openmetrics PMDA (the 8236 looks suspiciously large), and the /var/lib/pcp/config/pmda/144.8236.py file is a mistake in the openmetrics PMDA implementation, that file should not be living in that directory ... but I don't think this would trigger the pmGetInDom failure. But even if pmlogger is being restarted twice (that's a bug we need to fix in the openmetrics PMDA), it does not explain your original pmlogconf issue. Around the time of your pmlogconf problem, can you check to see if pmcd was restarted? $ pminfo -f pmcd.pid before and after the systemctl restart pmlogger would prove this one way or the other. This would help us narrow down the focus ... (a) did the hotproc PMDA fail to return a value, or (b) did pmcd die. I doubt (b) but it would be good to eliminate this possibility. After that, we're going to need more diagnostics I suspect. Are you in a position to run a modified version of the pmlogger_check shell script on this system?
Hi Ken, (In reply to Ken McDonell from comment #7) > Hi Evgeni. > > Your analysis about pmlogger being started twice looks sound. > > The pmGetInDom failure will kill pmlogger as it cannot proceed without the > instance domain details. > I have no clue what InDom 144.8236 is for the openmetrics PMDA (the 8236 > looks suspiciously large), and the /var/lib/pcp/config/pmda/144.8236.py file > is a mistake in the openmetrics PMDA implementation, that file should not be > living in that directory ... but I don't think this would trigger the > pmGetInDom failure. What is the meaning of the "8236" here? Is that the number of metrics? I can totally see there to be 8k-ish metrics coming via openmetrics here. ( if that's a problem in itself, we need to talk, but outside of this BZ ;-) ) > But even if pmlogger is being restarted twice (that's a bug we need to fix > in the openmetrics PMDA), it does not explain your original pmlogconf issue. Agreed. > Around the time of your pmlogconf problem, can you check to see if pmcd was > restarted? > $ pminfo -f pmcd.pid > before and after the systemctl restart pmlogger would prove this one way or > the other. While the "bad" openmetrics config is deactivated, the value remains stable, so no restart. Same after I enable it back and see "pmGetInDom(144.8236): Operation not permitted" again. So I'd say no restarts here. > This would help us narrow down the focus ... (a) did the hotproc PMDA fail > to return a value, or (b) did pmcd die. I doubt (b) but it would be good to > eliminate this possibility. > > After that, we're going to need more diagnostics I suspect. Are you in a > position to run a modified version of the pmlogger_check shell script on > this system? Absolutely. Just throw either a brew build or that file as a whole at me and I'll try out.