Bug 1737091 - pmlogger slows down boot significantly
Summary: pmlogger slows down boot significantly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: pcp
Version: 31
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Mark Goodwin
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-02 15:06 UTC by Zbigniew Jędrzejewski-Szmek
Modified: 2020-03-03 17:12 UTC (History)
8 users (show)

Fixed In Version: pcp-4.3.4-1.fc30 pcp-4.3.4-1.fc29
Clone Of:
Environment:
Last Closed: 2019-08-20 01:48:45 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Zbigniew Jędrzejewski-Szmek 2019-08-02 15:06:59 UTC
Description of problem:
On my F30 laptop:
$ systemd-analyze blame
37.869s pmlogger.service    <---- !!!!                        
30.174s pmlogger_daily.service        <---- !!!!                                                           
 9.781s dkms.service        <--- this one is compiling stuff, so it's OK that it's slow. It only runs once.                                                                    
 7.960s systemd-cryptsetup@luks\x2d2d9b648a\x2d15b1\x2d4204\x2d988b\x2dec085089f8ce.service    <--- I need to type the password here, so it's OK to be a little slow too

pmlogger.service is declared as dependency on multi-user.target, so it part of the normal boot. Anything above a few seconds for a service that runs on every boot is just bad. Even a few seconds is already noticable for users.

CPUAccounting says that pmlogger.service consumed 20.796s of CPU time (that's actual CPU usage, not just wall clock time)!

Something needs to be done to make this much less costly. Right now pcp services are an order of magnitude more costly then the next service running during startup (lvm2-pvscan).

Version-Release number of selected component (if applicable):
pcp-4.3.2-1.fc30.x86_64

How reproducible:
Consistent on my laptop. On another machine with F29 it's only 8s.
On a F29 rpi3: 1min 17.988s pmlogger.service, 38.305s pmlogger_daily.service!

Comment 1 Mark Goodwin 2019-08-05 05:18:39 UTC
Hi Zbigniew, thanks for the report - this is quite likely a DUP of BZ #1721223 or at least related, but I will not mark it as such just yet without further investigation. The pmlogger and pmlogger_daily services are supposed to start in the background and should not take anywhere near the times being reported.

Regards

Comment 2 Zbigniew Jędrzejewski-Szmek 2019-08-05 06:56:16 UTC
I get the messages in #1721223 too:
MPLv2.0 contains the same clause that makes "or later" redundant the GPLv3 has:
> 10.2. Effect of New Versions
>
> You may distribute the Covered Software under the terms of the version of the License under which You originally received the Covered Software, or under the terms of any subsequent version published by the license steward.

-- Reboot --
Jul 22 15:26:33 krowka systemd[1]: pmlogger.service: Supervising process 8744 which is not our child. We'll most likely not notice when it exits.
Jul 23 09:23:08 krowka systemd[1]: pmlogger.service: New main PID 46543 does not belong to service, and PID file is not owned by root. Refusing.
Jul 23 09:23:08 krowka systemd[1]: pmlogger.service: New main PID 46543 does not belong to service, and PID file is not owned by root. Refusing.
Jul 23 09:24:33 krowka systemd[1]: pmlogger.service: start operation timed out. Terminating.
Jul 23 09:24:33 krowka systemd[1]: pmlogger.service: Failed with result 'timeout'.
Jul 23 09:24:33 krowka systemd[1]: Failed to start pmlogger.service.
Jul 23 09:24:40 krowka systemd[1]: pmlogger.service: Supervising process 54624 which is not our child. We'll most likely not notice when it exits.
Jul 24 08:07:40 krowka systemd[1]: pmlogger.service: New main PID 89367 does not belong to service, and PID file is not owned by root. Refusing.
Jul 24 08:07:40 krowka systemd[1]: pmlogger.service: New main PID 89367 does not belong to service, and PID file is not owned by root. Refusing.
Jul 24 08:09:05 krowka systemd[1]: pmlogger.service: start operation timed out. Terminating.
Jul 24 08:09:05 krowka systemd[1]: pmlogger.service: Failed with result 'timeout'.
Jul 24 08:09:05 krowka systemd[1]: Failed to start pmlogger.service.
Jul 24 08:09:12 krowka systemd[1]: pmlogger.service: Supervising process 96731 which is not our child. We'll most likely not notice when it exits.
Jul 25 10:55:16 krowka systemd[1]: pmlogger.service: New main PID 168597 does not belong to service, and PID file is not owned by root. Refusing.
Jul 25 10:55:16 krowka systemd[1]: pmlogger.service: New main PID 168597 does not belong to service, and PID file is not owned by root. Refusing.
Jul 25 10:56:46 krowka systemd[1]: pmlogger.service: start operation timed out. Terminating.
Jul 25 10:56:46 krowka systemd[1]: pmlogger.service: Failed with result 'timeout'.
Jul 25 10:56:46 krowka systemd[1]: Failed to start pmlogger.service.
Jul 25 10:56:55 krowka systemd[1]: pmlogger.service: Supervising process 174987 which is not our child. We'll most likely not notice when it exits.
Jul 26 10:17:14 krowka systemd[1]: pmlogger.service: New main PID 215230 does not belong to service, and PID file is not owned by root. Refusing.
Jul 26 10:18:44 krowka systemd[1]: pmlogger.service: start operation timed out. Terminating.
Jul 26 10:18:44 krowka systemd[1]: pmlogger.service: Failed with result 'timeout'.
Jul 26 10:18:44 krowka systemd[1]: Failed to start pmlogger.service.
Jul 26 10:18:51 krowka systemd[1]: pmlogger.service: Supervising process 226902 which is not our child. We'll most likely not notice when it exits.
Jul 27 10:45:24 krowka systemd[1]: pmlogger.service: New main PID 656858 does not belong to service, and PID file is not owned by root. Refusing.
Jul 27 10:45:24 krowka systemd[1]: pmlogger.service: New main PID 656858 does not belong to service, and PID file is not owned by root. Refusing.
Jul 27 10:46:49 krowka systemd[1]: pmlogger.service: start operation timed out. Terminating.
Jul 27 10:46:49 krowka systemd[1]: pmlogger.service: Failed with result 'timeout'.
Jul 27 10:46:49 krowka systemd[1]: Failed to start pmlogger.service.
Jul 27 10:46:57 krowka systemd[1]: pmlogger.service: Supervising process 664857 which is not our child. We'll most likely not notice when it exits.
Jul 28 00:10:20 krowka systemd[1]: pmlogger.service: New main PID 1519640 does not belong to service, and PID file is not owned by root. Refusing.
Jul 28 00:10:20 krowka systemd[1]: pmlogger.service: New main PID 1519640 does not belong to service, and PID file is not owned by root. Refusing.
Jul 28 00:10:45 krowka systemd[1]: pmlogger.service: Supervising process 1525693 which is not our child. We'll most likely not notice when it exits.
Jul 28 15:59:09 krowka systemd[1]: pmlogger.service: Consumed 13.430s CPU time.
-- Reboot --
Jul 28 16:00:11 krowka systemd[1]: pmlogger.service: Supervising process 8715 which is not our child. We'll most likely not notice when it exits.
Jul 28 16:06:19 krowka systemd[1]: pmlogger.service: Consumed 9.468s CPU time.
-- Reboot --
Jul 28 16:07:00 krowka systemd[1]: pmlogger.service: Supervising process 8701 which is not our child. We'll most likely not notice when it exits.
Jul 28 16:33:30 krowka systemd[1]: pmlogger.service: Consumed 8.991s CPU time.
-- Reboot --
Jul 28 16:35:54 krowka systemd[1]: pmlogger.service: Supervising process 8688 which is not our child. We'll most likely not notice when it exits.
Jul 28 16:44:16 krowka systemd[1]: pmlogger.service: Consumed 9.076s CPU time.
-- Reboot --
Jul 28 16:45:18 krowka systemd[1]: pmlogger.service: Supervising process 8708 which is not our child. We'll most likely not notice when it exits.
Jul 29 08:06:33 krowka systemd[1]: pmlogger.service: Consumed 8.994s CPU time.
Jul 29 08:06:33 krowka systemd[1]: pmlogger.service: Consumed 8.994s CPU time.
Jul 29 08:06:40 krowka systemd[1]: pmlogger.service: New main PID 20047 does not belong to service, and PID file is not owned by root. Refusing.
Jul 29 08:06:40 krowka systemd[1]: pmlogger.service: New main PID 20047 does not belong to service, and PID file is not owned by root. Refusing.
Jul 29 08:07:00 krowka systemd[1]: pmlogger_daily.service: Consumed 2.081s CPU time.
Jul 29 08:08:04 krowka systemd[1]: pmlogger.service: start operation timed out. Terminating.
Jul 29 08:08:04 krowka systemd[1]: pmlogger.service: Failed with result 'timeout'.
Jul 29 08:08:04 krowka systemd[1]: Failed to start Performance Metrics Archive Logger.
Jul 29 08:08:04 krowka systemd[1]: pmlogger.service: Consumed 13.033s CPU time.
Jul 29 08:08:04 krowka systemd[1]: pmlogger.service: Consumed 13.033s CPU time.
Jul 29 08:08:10 krowka systemd[1]: pmlogger.service: Supervising process 29303 which is not our child. We'll most likely not notice when it exits.
Jul 30 00:10:25 krowka systemd[1]: pmlogger.service: Consumed 8.948s CPU time.
Jul 30 00:10:25 krowka systemd[1]: pmlogger.service: Consumed 8.948s CPU time.
Jul 30 00:10:30 krowka systemd[1]: pmlogger.service: New main PID 279182 does not belong to service, and PID file is not owned by root. Refusing.
Jul 30 00:10:30 krowka systemd[1]: pmlogger.service: New main PID 279182 does not belong to service, and PID file is not owned by root. Refusing.
Jul 30 00:10:48 krowka systemd[1]: pmlogger_daily.service: Consumed 2.459s CPU time.
Jul 30 00:10:55 krowka systemd[1]: pmlogger.service: Supervising process 285219 which is not our child. We'll most likely not notice when it exits.
Jul 31 00:10:17 krowka systemd[1]: pmlogger.service: Consumed 14.707s CPU time.
Jul 31 00:10:18 krowka systemd[1]: pmlogger.service: Consumed 14.707s CPU time.
Jul 31 00:10:22 krowka systemd[1]: pmlogger.service: New main PID 1288971 does not belong to service, and PID file is not owned by root. Refusing.
Jul 31 00:10:22 krowka systemd[1]: pmlogger.service: New main PID 1288971 does not belong to service, and PID file is not owned by root. Refusing.
Jul 31 00:10:40 krowka systemd[1]: pmlogger_daily.service: Consumed 2.826s CPU time.
Jul 31 00:10:46 krowka systemd[1]: pmlogger.service: Supervising process 1295040 which is not our child. We'll most likely not notice when it exits.
Aug 01 00:10:16 krowka systemd[1]: pmlogger.service: Consumed 13.996s CPU time.
Aug 01 00:10:16 krowka systemd[1]: pmlogger.service: Consumed 13.996s CPU time.
Aug 01 00:10:20 krowka systemd[1]: pmlogger.service: New main PID 1339592 does not belong to service, and PID file is not owned by root. Refusing.
Aug 01 00:10:20 krowka systemd[1]: pmlogger.service: New main PID 1339592 does not belong to service, and PID file is not owned by root. Refusing.
Aug 01 00:10:38 krowka systemd[1]: pmlogger_daily.service: Consumed 2.731s CPU time.
Aug 01 00:10:45 krowka systemd[1]: pmlogger.service: Supervising process 1345690 which is not our child. We'll most likely not notice when it exits.
Aug 01 17:42:27 krowka systemd[1]: pmlogger.service: Consumed 12.746s CPU time.
-- Reboot --
Aug 01 17:44:19 krowka systemd[1]: pmlogger.service: Supervising process 8502 which is not our child. We'll most likely not notice when it exits.
Aug 01 17:49:56 krowka systemd[1]: pmlogger.service: Consumed 10.052s CPU time.
-- Reboot --
Aug 02 00:10:12 krowka systemd[1]: pmlogger.service: New main PID 29661 does not belong to service, and PID file is not owned by root. Refusing.
Aug 02 00:10:12 krowka systemd[1]: pmlogger.service: New main PID 29661 does not belong to service, and PID file is not owned by root. Refusing.
Aug 02 00:10:30 krowka systemd[1]: pmlogger_daily.service: Consumed 3.930s CPU time.
Aug 02 00:10:44 krowka systemd[1]: pmlogger.service: Supervising process 35782 which is not our child. We'll most likely not notice when it exits.
Aug 03 00:10:06 krowka systemd[1]: pmlogger.service: Consumed 21.355s CPU time.
Aug 03 00:10:06 krowka systemd[1]: pmlogger.service: Consumed 21.355s CPU time.

Nevertheless, I assume it's a different issue, because the times vary and they do not seem to be related to any systemd service timeout.
What kind of information should I provide?

Comment 3 Ben Cotton 2019-08-13 16:49:36 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to '31'.

Comment 4 Fedora Update System 2019-08-16 01:52:51 UTC
FEDORA-2019-97183bed56 has been submitted as an update to Fedora 30. https://bodhi.fedoraproject.org/updates/FEDORA-2019-97183bed56

Comment 5 Fedora Update System 2019-08-16 01:53:23 UTC
FEDORA-2019-44b383ec91 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2019-44b383ec91

Comment 6 Fedora Update System 2019-08-17 01:27:29 UTC
pcp-4.3.4-1.fc30 has been pushed to the Fedora 30 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-97183bed56

Comment 7 Fedora Update System 2019-08-17 02:23:36 UTC
pcp-4.3.4-1.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-44b383ec91

Comment 8 Fedora Update System 2019-08-20 01:48:45 UTC
pcp-4.3.4-1.fc30 has been pushed to the Fedora 30 stable repository. If problems still persist, please make note of it in this bug report.

Comment 9 Fedora Update System 2019-08-25 03:03:09 UTC
pcp-4.3.4-1.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.

Comment 10 Mark Goodwin 2019-08-30 02:53:46 UTC
Just for the record, several systemd service units updates were made for pcp-4.3.4, see below. On testing, these reduce the boot time for PCP services considerably, and fix other issues too. The pmlogger.service startup time is still too long (5 to 10 seconds), but it's far better than it was. In the pcp-5.x timeframe, we're intending to improve this further by changing over to type=notify instead of type=forking. This should eliminate the benign messages about missing run/pid files and allow pmlogger (and other service daemons) to notify systemd directly once they have started using the sd_notify API.

Here are the relevant changes that went into pcp-4.3.4 :

commit 9d378a81dddd5b34f8e864b1ebf3e545e66365f6
Author: Mark Goodwin <mgoodwin>
Date:   Wed Aug 21 14:48:43 2019 +1000

    logutil: add 120s timeout in all systemd service configs
    
    Azure/CI has been failing with service timeouts (mainly qa/1124).

commit 666a3b9cd17d6dfe4d4c8cb7d43ffe960be69a08
Author: Mark Goodwin <mgoodwin>
Date:   Wed Aug 21 14:42:55 2019 +1000

    logutil: pmlogger_check.service is not required by pmlogger.service
    
    Remove RequiredBy=pmlogger.service in pmlogger_check.service.
    It causes both services to run in parallel.

commit 3800e24afbffba5854907d7238dc882e9e67570e
Author: Mark Goodwin <mgoodwin>
Date:   Fri Aug 9 16:00:31 2019 +1000

    logutil: fix pmlogger and pmlogger_daily systemd services
    
    Fedora BZ #1737091 - pmlogger slows down boot significantly
    Fedora BZ #1721223 - pmlogger.service: Failed with result 'timeout'
    
    The systemd configs for pmlogger_daily (both service and timer) should
    NOT be required by the pmlogger service - otherwise, when the pmlogger
    service attempts to start up, the pmlogger_daily service is started
    too. The pmlogger_daily service fails if the pmlogger service is not
    running, which then causes the pmlogger service to be terminated (this
    is the systemd behaviour with RequiredBy=pmlogger.service).
    
    So we end up with a lot of confusing log messages, timeouts and super
    slow boots, and ... usually no pmlogger services running at all!
    The previous commit concerning broken pipe messages is related too.
    
    With these fixes, the pmlogger service now takes 6 to 10 seconds to
    start (compared to ~ 30 seconds delay during boot and then failing to
    start at all). This is still too long, so there will need to be
    followup work to improve that for BZ#1737091.

commit 8ab6096ee8b8792627780aee5ce9c0a3fe4ee682
Author: Mark Goodwin <mgoodwin>
Date:   Fri Aug 9 10:37:26 2019 +1000

    log management - fix broken pipe errors from _is_archive() in utilproc.sh
    
    mkaf, pmafm and other utilities used by pmlogger_daily and other
    log management scripts often partially fail with mysterious broken
    pipe messages in the logs :
    
    Warnings/errors from mkaf ...
    cat: write error: Broken pipe
    
    This seems to be stderr from the _is_archive() shell function in
    utilproc.sh, which runs various decompression utils (or cat) on it's
    argument, which is then piped into dd to read the first 7 bytes (only)
    and then into awk to check for PCP file signatures. This results in
    the start of that pipeline exiting with a broken pipe, which sometimes
    (it's racy) propagates stderr messages through to the original
    _is_archive caller, causing it to think mkaf (or whatever is calling
    _is_archive) has failed .. sometimes.
    
    This patch directs stderr to /dev/null after the first stage in the
    _is_archive pipeline, and now pmlogger_daily and friends are reliably
    happier.

Comment 11 Mark Goodwin 2019-08-30 02:55:56 UTC
*** Bug 1721223 has been marked as a duplicate of this bug. ***

Comment 12 Zbigniew Jędrzejewski-Szmek 2019-08-30 10:19:21 UTC
I can confirm that the startup is much quicker now:
systemd-analyze blame says "11.052s pmlogger.service".
Thanks!

Comment 13 Mark Goodwin 2019-09-02 00:43:09 UTC
(In reply to Zbigniew Jędrzejewski-Szmek from comment #12)
> I can confirm that the startup is much quicker now:
> systemd-analyze blame says "11.052s pmlogger.service".
> Thanks!

Thanks for the testing Zbigniew. We'll be improving this further when we switch to type=notify (expect it to be less than one second). This work will likely be done for the next PCP release (pcp-5.0.0).

Regards


Note You need to log in before you can comment on or make changes to this bug.