Description of problem:
pmnewlog creates a temporary directory containing a config file for a new instance of pmlogger. Unfortunately this directory is not accessible by pcp:pcp user, so the new instance of pmlogger exits immediately because it can not read the config file.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Install pcp on a fresh machine with RHEL-8.2
# yum install -y pcp-zeroconf
2. Run pmnewlog script
# /usr/libexec/pcp/bin/pmnewlog -V <archive>
* An error message is printed on terminal:
"Sorry, but this is fatal. No new pmlogger instance has been started."
* File "/var/log/pcp/pmlogger/$(hostname)/pmlogger.log.prior" contains a message similar to this one:
pmlogger: Cannot open config file "/tmp/pcp.IHPHeoWXJ/config": Permission denied
* No new instance of pmlogger is started
* No error message is printed on terminal
* No error message in log files
* A new pmlogger instance is started
This is strange, however having the pmnewlog script fixed will lead to possibly more complicated situation.
The reason is that pmnewlog (triggered from cron via pmlogger_daily script) starts a new instance of pmlooger which is unfortunately outside of systemd control. As such, a successful run of pmnewlog confuses systemd which starts a new instance of pmlogger, in parallel to the one started by pmnewlog. Such behaviour leads to parallel run on two primary pmlogger instances.
Proposed fix: https://github.com/performancecopilot/pcp/pull/858
(In reply to Jan Kurik from comment #0)
> Additional info:
> This is strange, however having the pmnewlog script fixed will lead to
> possibly more complicated situation.
> The reason is that pmnewlog (triggered from cron via pmlogger_daily script)
> starts a new instance of pmlooger which is unfortunately outside of systemd
> control. As such, a successful run of pmnewlog confuses systemd which starts
> a new instance of pmlogger, in parallel to the one started by pmnewlog. Such
> behaviour leads to parallel run on two primary pmlogger instances.
Here's what actually happens - almost the same as above but in much more detail:
pmlogger_daily is started by the pmlogger_daily.timer (or the poll unit) usually at 00:10am and creates a lock file, and then runs pmnewlog to roll the logs
pmnewlog kills the existing pmlogger and then (initially) sleeps for 5 seconds
meanwhile, the pmlogger systemd service unit (which has restart=always) notices pmlogger has exited
systemd launches a new pmlogger rc script (/usr/share/pcp/lib/pmlogger), which blocks on the lock file (still held by pmlogger_daily)
pmnewlog wakes up and loops until the pmlogger pid that it killed is dead
once it's dead, pmnewlog then determines the new archive name and does a bit of other stuff and then directly runs a new pmlogger process (NOT via the rc script), which forks and daemonises.
pmnewlog then loops waiting for a pmlc connection to the new pmlogger to succeed.
the pmlogger launched by pmnewlog writes it's pidfile and enters it's main loop, which then allows pmnewlog's pmlc to connect and so pmnewlog resumes.
meanwhile, the pmlogger rc script launched by systemd times out (usually, but not always). systemd notices and reads the pidfile and kills (with SIGINT) the pmlogger that was directly started by pmnewlog (this is where systemd gets confused).
systemd then launches a new pmlogger via the rc script, which first waits until pmlogger_daily releases the lock (which happens once pmnewlog returns)
the new pmlogger runs under systemd until the next pmlogger_daily cycle ....
Incredibly, the above is actually benign - we end up with a running pmlogger service under systemd's management. And it also works on non-systemd platforms. But clearly it's ridiculously convoluted!
The fix (on systemd platforms only) will be for pmnewlog to NOT launch a rogue pmlogger after killing the existing process. It will be launched by systemd via the pmlogger rc script soon after pmnewlog kills the existing pmlogger. The newly launched pmlogger rc script will wait on the lock held by pmlogger_daily, which will be released when pmnewlog returns. Once the lock is released, the rc script will launch a new pmlogger process under systemd's control.
Before that fix however, we need to change to type=notify for all PCP services (not just pmlogger). I have patches almost ready for that. Once that's all working, we have to make additional changes using systemd templates for pmlogger farms (and similar for pmie farms too).
After we've finished re-working the pmlogger logutil scripts and systemd unit configs, it's likely the problematic pmnewlog script will be retired (instead pmlogger will essentially logroll on SIGUSR1, so there will be no need to launch another one). But we'll keep this open to be sure we set the right modes on the config.
pmnewlog was deprecated in pcp-5.1.0 as part of the systemd unit config overhaul to change to type=notify. Hence DUP'ing to the RHEL-8.3 PCP rebase to pcp-5.1+
*** This bug has been marked as a duplicate of bug 1792971 ***
Verified on pcp-5.1.1-2.el8 build:
# usr/libexec/pcp/bin/pmnewlog -V <archive>
pmnewlog: functionality retired and script will be removed in a future release
Because this is retired now, marking the bug as DEFERRED.