1954576 – Increase robustness of pmie startup script

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1954576 - Increase robustness of pmie startup script

Summary: Increase robustness of pmie startup script

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pcp
Sub Component:
Version:	8.3
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	8.5
Assignee:	Mark Goodwin
QA Contact:	Jan Kurik
Docs Contact:	Apurva Bhide
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-28 12:03 UTC by Renaud Métrich
Modified:	2024-06-14 01:23 UTC (History)
CC List:	6 users (show)
Fixed In Version:	pcp-5.3.1-1.el8
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-11-09 17:49:39 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2021:4171	0	None	None	None	2021-11-09 17:49:56 UTC

Description Renaud Métrich 2021-04-28 12:03:23 UTC

Description of problem:

For some reason, if /usr/bin/pmie fails to start, the service unit pmie.service will die in timeout, after 120 seconds, without any helpful message to troubleshoot the issue.
We need to deliver something robust here, from the strace I can see that /usr/bin/pmie error messages are just sent to the void (/dev/null ...).

Version-Release number of selected component (if applicable):

pcp-5.1.1-4.el8_3.x86_64

How reproducible:

Always

Steps to Reproduce:
1. Intentionally restrict /var/log permissions to 700

# chmod 700 /var/log

2. Start pmie service

# systemctl start pmie
Job for pmie.service failed because a timeout was exceeded.
See "systemctl status pmie.service" and "journalctl -xe" for details.

3. Check the status

# systemctl status pmie

Actual results:

Apr 28 13:50:31 vm-rhel8 systemd[1]: Starting Performance Metrics Inference Engine...
Apr 28 13:52:31 vm-rhel8 systemd[1]: pmie.service: start operation timed out. Terminating.
Apr 28 13:52:31 vm-rhel8 pmie[8552]: Starting pmie ...
Apr 28 13:52:31 vm-rhel8 systemd[1]: pmie.service: Failed with result 'timeout'.
Apr 28 13:52:31 vm-rhel8 systemd[1]: Failed to start Performance Metrics Inference Engine.

Expected results:

Clear indication of why it failed

Additional info:

This simple reproducer is not far away from the reality: usually customers hardening systems will not change permission of /var/log to 0700 but create a group for people that can see logs (e.g. "logger" group) and set /var/log for that group to access the log (e.g. chgrp logger /var/log).
In such case, if "pcp" user is not in the "logger" group, the pmie service will not start.

Comment 1 Renaud Métrich 2021-04-28 12:06:45 UTC

Stracing systemd while performing "systemctl start pmie" shows the following:

1. /usr/bin/pmie process executes then dies due to not being able to access /var/log/pcp
2. the error message is sent to the void
3. the main startup script loops until systemd eventually kills it on timeout


# grep -w ^8746 systemd.strace | tail
8746  13:50:33.045595 write(2</dev/null<char 1:3>>, "pmie: cannot open log \"/var/log/pcp/pmie/vm-rhel8/pmie.log\" for writing : Permission denied\nLog for pmie on vm-rhel8 started Wed Apr 28 13:50:33 2021\n", 150) = 150 <0.000008>
8746  13:50:33.045622 write(2</dev/null<char 1:3>>, "\n", 1) = 1 <0.000007>
8746  13:50:33.045656 lstat("/var", {st_dev=makedev(0xfd, 0), st_ino=50331777, st_mode=S_IFDIR|0755, st_nlink=30, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=4096, st_atime=1619608507 /* 2021-04-28T13:15:07.750633555+0200 */, st_atime_nsec=750633555, st_mtime=1619082660 /* 2021-04-22T11:11:00.191009848+0200 */, st_mtime_nsec=191009848, st_ctime=1619607615 /* 2021-04-28T13:00:15.299000000+0200 */, st_ctime_nsec=299000000}) = 0 <0.000008>
8746  13:50:33.045687 lstat("/var/log", {st_dev=makedev(0xfd, 0), st_ino=33575586, st_mode=S_IFDIR|0700, st_nlink=28, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=16, st_size=4096, st_atime=1619610393 /* 2021-04-28T13:46:33.365516815+0200 */, st_atime_nsec=365516815, st_mtime=1619594341 /* 2021-04-28T09:19:01.568721021+0200 */, st_mtime_nsec=568721021, st_ctime=1619610623 /* 2021-04-28T13:50:23.144832189+0200 */, st_ctime_nsec=144832189}) = 0 <0.000007>


BELOW the failing lstat():

8746  13:50:33.045715 lstat("/var/log/pcp", 0x7fff10c2a130) = -1 EACCES (Permission denied) <0.000007>

8746  13:50:33.045744 write(2</dev/null<char 1:3>>, "pmie: cannot find realpath for log /var/log/pcp/pmie/vm-rhel8/pmie.log: Permission denied\n", 90) = 90 <0.000007>
8746  13:50:33.045773 write(2</dev/null<char 1:3>>, "\n", 1) = 1 <0.000006>
8746  13:50:33.045796 write(2</dev/null<char 1:3>>, "Log finished Wed Apr 28 13:50:33 2021\n", 38) = 38 <0.000004>
8746  13:50:33.045982 exit_group(1)     = ?
8746  13:50:33.046252 +++ exited with 1 +++



All this needs hardening:
- startup script should detect /usr/bin/pmie died and not wait until it gets killed by systemd
- the error messages should be printed back to the journal

Comment 2 Anthony Zone 2021-04-28 16:29:55 UTC

I'll also add that if we remove the Type=notify and pmie can't write to /var/log, not only do we not get a valid error, but if we attempt to rule out systemd killing the service, then pmie is stuck endlessly trying to restart and never gives any kind of indication what's wrong, due to the Restart=always.

    # systemctl cat pmie.service 
    # /usr/lib/systemd/system/pmie.service
    [Unit]
    Description=Performance Metrics Inference Engine
    Documentation=man:pmie(1)
    After=network-online.target pmcd.service
    After=pmie_check.timer pmie_daily.timer
    BindsTo=pmie_check.timer pmie_daily.timer
    Wants=pmcd.service

    [Service]
    #Type=notify
    Restart=always
    NotifyAccess=all
    TimeoutSec=120

ps output just shows we're trying to start:

    root      252082       1  0 12:09 ?        00:00:00 /bin/sh /usr/share/pcp/lib/pmie start-systemd
    root      252148  252082  0 12:09 ?        00:00:00 /bin/sh /usr/libexec/pcp/bin/pmie_check
    root      252239  252148  0 12:09 ?        00:00:00 /bin/sh /usr/libexec/pcp/bin/pmie_check

Also, with SELinux disabled, PMIE never indicates why it won't start when calling it from the CLI:

    # /usr/share/pcp/lib/pmie start-systemd
    Starting pmie ... 

Attempting to strace an active running pmie start doesn't lend us any information:

    269153 12:23:39.920055 wait4(-1,  <detached ...>

It just eventually times out:

    # time /usr/share/pcp/lib/pmie start-systemd
    Starting pmie ... 

    real	1m53.825s
    user	0m4.054s
    sys	0m4.738s

So we really need a clear indication there as to why we're failing to start.

Comment 3 Mark Goodwin 2021-05-25 23:02:23 UTC

Fixed upstream for pcp-5.3.1 (devel phase) :

commit 9adf9d4790f90cb7ec64525dd517053fd07017b0
Author: Mark Goodwin <mgoodwin>
Date:   Wed May 19 12:29:29 2021 +1000

    pmieutil: do not discard pmie stderr when starting pmie service
    
    Retain and check pmie stderr messages for analysis when the
    service fails to start. Previously stderr was sent to /dev/null,
    which made it very difficult to debug (with users resorting to
    strace, ltrace and even stap scripts). Any unexpected service
    start failure messages now appear in $PCP_LOG_DIR/pmie/pmie_check.log.
    
    Have checked QA for the pmie, pmieutil and pmiectl groups.
    
    Resolves: Red Hat BZ#1954576 - Increase robustness of pmie startup script


commit 869f2635bdf0c70b7463eea3a55d736a49fd9191
Author: Mark Goodwin <mgoodwin>
Date:   Tue May 25 16:45:20 2021 +1000

    docs: update pmie_check(1) and pmie_daily(1) man page
    
    Document pmie_check.log and pmie_daily.log, particularly if the
    pmie service fails to start or exits early.
    
    Related: RHBZ#1954576


Moving to MODIFIED until we have a build for QE

Comment 9 errata-xmlrpc 2021-11-09 17:49:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pcp bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4171

Note You need to log in before you can comment on or make changes to this bug.