Bug 1954576
| Summary: | Increase robustness of pmie startup script | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Renaud Métrich <rmetrich> |
| Component: | pcp | Assignee: | Mark Goodwin <mgoodwin> |
| Status: | CLOSED ERRATA | QA Contact: | Jan Kurik <jkurik> |
| Severity: | high | Docs Contact: | Apurva Bhide <abhide> |
| Priority: | high | ||
| Version: | 8.3 | CC: | agerstmayr, azone, jkurik, mgoodwin, nathans, patrickm |
| Target Milestone: | rc | Keywords: | Bugfix, Triaged |
| Target Release: | 8.5 | Flags: | pm-rhel:
mirror+
|
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | pcp-5.3.1-1.el8 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-11-09 17:49:39 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Renaud Métrich
2021-04-28 12:03:23 UTC
Stracing systemd while performing "systemctl start pmie" shows the following:
1. /usr/bin/pmie process executes then dies due to not being able to access /var/log/pcp
2. the error message is sent to the void
3. the main startup script loops until systemd eventually kills it on timeout
# grep -w ^8746 systemd.strace | tail
8746 13:50:33.045595 write(2</dev/null<char 1:3>>, "pmie: cannot open log \"/var/log/pcp/pmie/vm-rhel8/pmie.log\" for writing : Permission denied\nLog for pmie on vm-rhel8 started Wed Apr 28 13:50:33 2021\n", 150) = 150 <0.000008>
8746 13:50:33.045622 write(2</dev/null<char 1:3>>, "\n", 1) = 1 <0.000007>
8746 13:50:33.045656 lstat("/var", {st_dev=makedev(0xfd, 0), st_ino=50331777, st_mode=S_IFDIR|0755, st_nlink=30, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=4096, st_atime=1619608507 /* 2021-04-28T13:15:07.750633555+0200 */, st_atime_nsec=750633555, st_mtime=1619082660 /* 2021-04-22T11:11:00.191009848+0200 */, st_mtime_nsec=191009848, st_ctime=1619607615 /* 2021-04-28T13:00:15.299000000+0200 */, st_ctime_nsec=299000000}) = 0 <0.000008>
8746 13:50:33.045687 lstat("/var/log", {st_dev=makedev(0xfd, 0), st_ino=33575586, st_mode=S_IFDIR|0700, st_nlink=28, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=16, st_size=4096, st_atime=1619610393 /* 2021-04-28T13:46:33.365516815+0200 */, st_atime_nsec=365516815, st_mtime=1619594341 /* 2021-04-28T09:19:01.568721021+0200 */, st_mtime_nsec=568721021, st_ctime=1619610623 /* 2021-04-28T13:50:23.144832189+0200 */, st_ctime_nsec=144832189}) = 0 <0.000007>
BELOW the failing lstat():
8746 13:50:33.045715 lstat("/var/log/pcp", 0x7fff10c2a130) = -1 EACCES (Permission denied) <0.000007>
8746 13:50:33.045744 write(2</dev/null<char 1:3>>, "pmie: cannot find realpath for log /var/log/pcp/pmie/vm-rhel8/pmie.log: Permission denied\n", 90) = 90 <0.000007>
8746 13:50:33.045773 write(2</dev/null<char 1:3>>, "\n", 1) = 1 <0.000006>
8746 13:50:33.045796 write(2</dev/null<char 1:3>>, "Log finished Wed Apr 28 13:50:33 2021\n", 38) = 38 <0.000004>
8746 13:50:33.045982 exit_group(1) = ?
8746 13:50:33.046252 +++ exited with 1 +++
All this needs hardening:
- startup script should detect /usr/bin/pmie died and not wait until it gets killed by systemd
- the error messages should be printed back to the journal
I'll also add that if we remove the Type=notify and pmie can't write to /var/log, not only do we not get a valid error, but if we attempt to rule out systemd killing the service, then pmie is stuck endlessly trying to restart and never gives any kind of indication what's wrong, due to the Restart=always.
# systemctl cat pmie.service
# /usr/lib/systemd/system/pmie.service
[Unit]
Description=Performance Metrics Inference Engine
Documentation=man:pmie(1)
After=network-online.target pmcd.service
After=pmie_check.timer pmie_daily.timer
BindsTo=pmie_check.timer pmie_daily.timer
Wants=pmcd.service
[Service]
#Type=notify
Restart=always
NotifyAccess=all
TimeoutSec=120
ps output just shows we're trying to start:
root 252082 1 0 12:09 ? 00:00:00 /bin/sh /usr/share/pcp/lib/pmie start-systemd
root 252148 252082 0 12:09 ? 00:00:00 /bin/sh /usr/libexec/pcp/bin/pmie_check
root 252239 252148 0 12:09 ? 00:00:00 /bin/sh /usr/libexec/pcp/bin/pmie_check
Also, with SELinux disabled, PMIE never indicates why it won't start when calling it from the CLI:
# /usr/share/pcp/lib/pmie start-systemd
Starting pmie ...
Attempting to strace an active running pmie start doesn't lend us any information:
269153 12:23:39.920055 wait4(-1, <detached ...>
It just eventually times out:
# time /usr/share/pcp/lib/pmie start-systemd
Starting pmie ...
real 1m53.825s
user 0m4.054s
sys 0m4.738s
So we really need a clear indication there as to why we're failing to start.
Fixed upstream for pcp-5.3.1 (devel phase) :
commit 9adf9d4790f90cb7ec64525dd517053fd07017b0
Author: Mark Goodwin <mgoodwin>
Date: Wed May 19 12:29:29 2021 +1000
pmieutil: do not discard pmie stderr when starting pmie service
Retain and check pmie stderr messages for analysis when the
service fails to start. Previously stderr was sent to /dev/null,
which made it very difficult to debug (with users resorting to
strace, ltrace and even stap scripts). Any unexpected service
start failure messages now appear in $PCP_LOG_DIR/pmie/pmie_check.log.
Have checked QA for the pmie, pmieutil and pmiectl groups.
Resolves: Red Hat BZ#1954576 - Increase robustness of pmie startup script
commit 869f2635bdf0c70b7463eea3a55d736a49fd9191
Author: Mark Goodwin <mgoodwin>
Date: Tue May 25 16:45:20 2021 +1000
docs: update pmie_check(1) and pmie_daily(1) man page
Document pmie_check.log and pmie_daily.log, particularly if the
pmie service fails to start or exits early.
Related: RHBZ#1954576
Moving to MODIFIED until we have a build for QE
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pcp bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4171 |