Bug 1832262

Summary: pmie_daily.service runs as pcp but tries to do root things
Product: Red Hat Enterprise Linux 8 Reporter: Uwe Menges <uwe.menges>
Component: pcpAssignee: Nathan Scott <nathans>
Status: CLOSED ERRATA QA Contact: Jan Kurik <jkurik>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.1CC: agerstmayr, an.andreas+fedora, ariaanbruinsma, frank.ramsay, jkurik, kenj, mgoodwin, nathans, patrickm, rja
Target Milestone: rcKeywords: Bugfix, Triaged
Target Release: 8.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcp-5.1.1 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-04 03:00:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Uwe Menges 2020-05-06 12:20:31 UTC
Description of problem:
The pmie_daily.service file has "User=pcp" but tries to do things that only root can do, eg. see in /var/log/pcp/pmie/pmie_daily.log:
/usr/libexec/pcp/bin/pmie_daily: line 334: cd: /root: Permission denied

Version-Release number of selected component (if applicable):
pcp-4.3.2-2.el8.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install RHEL 8.1 with eg. server pattern that brings pcp
2. Run the system
3. Observe pmie_daily.service fail, find reason in /var/log/pcp/pmie/pmie_daily.log

Actual results:
pmie_daily.service does not work and gives errors

Expected results:
pmie_daily.service works

Additional info:

Comment 1 Nathan Scott 2020-05-07 01:18:50 UTC
Thanks Uwe - I can see the problem, but don't understand why noone else (myself included) has ever observed this issue before!  (makes me wonder if something has changed in systemd in this area, for this specific release).  Anyway...

The pmie_daily script changes to `pwd` in its inner loop and is indeed expecting to have been started with that set to somewhere readable for the user its run as (pcp).  That would typically be:

$ grep pcp: /etc/passwd
pcp:x:386:386:Performance Co-Pilot:/var/lib/pcp:/sbin/nologin

... but clearly it has /root as `pwd`, so its in a half-baked state wrt running as pcp user.

I see there is a way to direct systemd unit files to force the working directory, perhaps that's our best option here.

Comment 2 Nathan Scott 2020-05-07 03:00:40 UTC
I've pushed the fix below to upstream PCP and we'll ensure its in the next RHEL-8 release Uwe.

In the meantime you should be able to workaround the issue locally by adding WorkingDirectory= lines to your local pmie_daily and pmie_check systemd unit files (/usr/lib/systemd/system/pmie_check.service and /usr/lib/systemd/system/pmie_daily.service) and then running 'systemctl daemon-reload'.  Since its not failing for me, it would be excellent if you could confirm this resolves the problem as we expect.


commit 9cc8e9ba741249d1d53813669ded2c247063d762
Author: Nathan Scott <nathans>
Date:   Thu May 7 12:54:32 2020 +1000

    pmie,pmlogger_check,daily: set systemd working directory
    
    We've had a report of failure in pmie_daily related to attempting
    to set cwd to an unreadable directory (/root).  Although not able
    to reproduce it locally, I believe the problem can be resolved by
    forcing systemd to set the working directory to pcp user accounts
    homedir (/var/lib/pcp) which is always readable by pcp.
    
    Resolves Red Hat BZ #1832262.

Comment 3 Uwe Menges 2020-05-07 09:32:25 UTC
Thanks!

I edited both services with `systemctl edit`. Note that pmie_check.service wasn't listed in `systemctl --failed`.

I no longer have the `cd` error message in /var/log/pcp/pmie/pmie_daily.log.
However, the pmie_daily.service still fails and the log contains this:

# cat /var/log/pcp/pmie/pmie_daily.log
pmie_daily: Error: no pmie instance running for host "local:"
[/etc/pcp/pmie/control.d/local:5] ... inference engine for host "local:" unchanged

# systemctl status pmie_daily.service 
● pmie_daily.service - Process PMIE logs
   Loaded: loaded (/usr/lib/systemd/system/pmie_daily.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/pmie_daily.service.d
           └─override.conf
   Active: failed (Result: exit-code) since Thu 2020-05-07 11:28:35 CEST; 57s ago
     Docs: man:pmie_daily(1)
  Process: 926361 ExecStart=/usr/libexec/pcp/bin/pmie_daily $PMIE_DAILY_PARAMS (code=exited, status=1/FAILURE)
 Main PID: 926361 (code=exited, status=1/FAILURE)

May 07 11:28:35 (edited) systemd[1]: Starting Process PMIE logs...
May 07 11:28:35 (edited) systemd[1]: pmie_daily.service: Main process exited, code=exited, status=1/FAILURE
May 07 11:28:35 (edited) systemd[1]: pmie_daily.service: Failed with result 'exit-code'.
May 07 11:28:35 (edited) systemd[1]: Failed to start Process PMIE logs.


# systemctl cat pmie_daily.service
# /usr/lib/systemd/system/pmie_daily.service
[Unit]
Description=Process PMIE logs
Documentation=man:pmie_daily(1)
ConditionPathExists=!/etc/cron.d/pcp-pmie

[Service]
Type=oneshot
Environment="PMIE_DAILY_PARAMS=-X xz -x 3"
EnvironmentFile=-/etc/sysconfig/pmie_timers
ExecStart=/usr/libexec/pcp/bin/pmie_daily $PMIE_DAILY_PARAMS
User=pcp

[Install]
RequiredBy=pmie.service

# /etc/systemd/system/pmie_daily.service.d/override.conf
[Service]
WorkingDirectory=/var/lib/pcp

Comment 4 Nathan Scott 2020-05-07 22:31:28 UTC
This is saying that pmie service is enabled but the local pmie was noticed to be not running at daily log processing time.

What's the output from 'systemctl status pmie pmie_check'? - these are the services that should be ensuring this pmie is running at the time daily log processing starts.

Comment 5 Uwe Menges 2020-05-08 11:11:39 UTC
I have good news.

First I'll answer your question:

# systemctl status pmie pmie_check
● pmie.service - Performance Metrics Inference Engine
   Loaded: loaded (/usr/lib/systemd/system/pmie.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2020-05-07 12:19:15 CEST; 24h ago
     Docs: man:pmie(1)
  Process: 26304 ExecStart=/usr/share/pcp/lib/pmie start (code=exited, status=0/SUCCESS)
 Main PID: 27216 (pmie)
    Tasks: 1 (limit: 3355442)
   CGroup: /system.slice/pmie.service
           └─27216 /usr/bin/pmie -b -P -l /var/log/pcp/pmie/(edited)/pmie.log -c config.default

May 07 12:19:14 (edited) systemd[1]: Starting Performance Metrics Inference Engine...
May 07 12:19:15 (edited) pmie[26304]: Starting pmie ...
May 07 12:19:15 (edited) systemd[1]: pmie.service: Can't open PID file /run/pcp/pmie.pid (yet?) after start: No such file or directory
May 07 12:19:15 (edited) systemd[1]: pmie.service: Supervising process 27216 which is not our child. We'll most likely not notice when it exits.
May 07 12:19:15 (edited) systemd[1]: Started Performance Metrics Inference Engine.

● pmie_check.service - Check PMIE instances are running
   Loaded: loaded (/usr/lib/systemd/system/pmie_check.service; static; vendor preset: disabled)
  Drop-In: /etc/systemd/system/pmie_check.service.d
           └─override.conf
   Active: inactive (dead) since Fri 2020-05-08 12:58:00 CEST; 1min 57s ago
     Docs: man:pmie_check(1)
  Process: 472455 ExecStart=/usr/libexec/pcp/bin/pmie_check $PMIE_CHECK_PARAMS (code=exited, status=0/SUCCESS)
 Main PID: 472455 (code=exited, status=0/SUCCESS)

May 08 12:58:00 (edited) systemd[1]: Starting Check PMIE instances are running...
May 08 12:58:00 (edited) systemd[1]: Started Check PMIE instances are running.

Also the pmie_daily.service didn't fail when called by timer tonight:

# systemctl status pmie_daily.timer pmie_daily.service
● pmie_daily.timer - Daily processing of PMIE logs
   Loaded: loaded (/usr/lib/systemd/system/pmie_daily.timer; disabled; vendor preset: disabled)
   Active: active (waiting) since Thu 2020-05-07 12:19:05 CEST; 24h ago
  Trigger: Sat 2020-05-09 00:08:00 CEST; 11h left

May 07 12:19:05 (edited) systemd[1]: Started Daily processing of PMIE logs.

● pmie_daily.service - Process PMIE logs
   Loaded: loaded (/usr/lib/systemd/system/pmie_daily.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/pmie_daily.service.d
           └─override.conf
   Active: inactive (dead) since Fri 2020-05-08 00:08:00 CEST; 12h ago
     Docs: man:pmie_daily(1)
  Process: 263279 ExecStart=/usr/libexec/pcp/bin/pmie_daily $PMIE_DAILY_PARAMS (code=exited, status=0/SUCCESS)
 Main PID: 263279 (code=exited, status=0/SUCCESS)

May 08 00:08:00 (edited) systemd[1]: Starting Process PMIE logs...
May 08 00:08:00 (edited) systemd[1]: Started Process PMIE logs.

The log `/var/log/pcp/pmie/pmie_daily.log` doesn't exist anymore, and yesterday's log has been rotated to `.prev`.

So I think the change works!
I don't know why it didn't work when (re)starting it manually yesterday.

Yours, Uwe

Comment 8 Jan Kurik 2020-07-13 10:06:09 UTC
Verifid on pcp-5.1.1-3.el8 build.

Comment 11 errata-xmlrpc 2020-11-04 03:00:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pcp bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4684

Comment 12 Ariaan 2023-05-31 13:04:51 UTC
I seem to have the same issue in Fedora 38 Server.

After I enabled it in Cockpit in:

Overview -> Usage ("View metrics and history") -> Enable Metrics

Haven't found a way to disable it though. (remove service through one click)

Comment 13 Andreas Waschinski 2023-06-08 00:07:17 UTC
(In reply to Ariaan from comment #12)
> I seem to have the same issue in Fedora 38 Server.
> 
> After I enabled it in Cockpit in:
> 
> Overview -> Usage ("View metrics and history") -> Enable Metrics
> 
> Haven't found a way to disable it though. (remove service through one click)

I encountered the same problem. It seems to be related to the log rotation and one file in /var/log/pcp/pmie/YOURPCNAME/ ending up as being owned by root instead of pcp. After changing the owner accordingly, the service runs fine until at some point a new file is generated incorrectly again (after a server restart?).

Comment 14 Nathan Scott 2023-06-08 01:03:09 UTC
Hmm, thanks - can you say which specifically file it was in that directory, owned as root?

Comment 15 Andreas Waschinski 2023-06-08 01:16:00 UTC
(In reply to Nathan Scott from comment #14)
> Hmm, thanks - can you say which specifically file it was in that directory,
> owned as root?

Sure, today it was: pmie.log.20230607

Previously, on May 31 to be exact, I even had the last two rotated log files owned by root:
-rw-r--r--. 1 pcp  pcp  100 31. Mai 11:13 pmie.log
-rw-r--r--. 1 pcp  pcp  158 30. Mai 00:08 pmie.log.20230529
-rw-r--r--. 1 root root 380 30. Mai 14:48 pmie.log.20230530
-rw-r--r--. 1 root root 194 31. Mai 11:13 pmie.log.20230531

Comment 16 Nathan Scott 2023-06-08 02:34:27 UTC
Thanks Andreas, I'm auditing the code and discussing with upstream developers to see if we can find where/how this is happening.

Comment 17 Ken McDonell 2023-06-10 03:07:40 UTC
This is really a new issue, but I'll continue here as the discussion was started here.

The pmie.log.YYYYMMDD renaming of pmie's log files was introduced via commit
https://github.com/performancecopilot/pcp/commit/d85efb855b94615eebed7accf4c68d696ba1d25e
on 21 Mar 2023 and this introduced a new bug as described above in https://bugzilla.redhat.com/show_bug.cgi?id=1832262#c12

What's being observed here is a corner-case when pmie_check is run (usually from systemd) and pmie.log exists, but there is not pmie.log.YYYYMMDD file for TODAY.  In other circumstances pmie_check does the "right" thing, and the related pmie_daily script is also doing the "right" thing.

The problem has been fixed upstream in commit
https://github.com/performancecopilot/pcp/commit/5cab8aa53e204f3c2b67491b97851b0499816463
this morning.