Bug 1073658 - intermittent pmdasystemd failure at pmcd startup during system boot
Summary: intermittent pmdasystemd failure at pmcd startup during system boot
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: pcp
Version: 19
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Nathan Scott
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-03-06 21:42 UTC by Frank Ch. Eigler
Modified: 2014-07-01 23:30 UTC (History)
6 users (show)

Fixed In Version: pcp-3.9.5-1.fc20
Clone Of:
Environment:
Last Closed: 2014-05-30 17:08:23 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Frank Ch. Eigler 2014-03-06 21:42:15 UTC
A somewhat repeatable failure affects my f19 workstation,
wherein if pmcd is started during normal system boot,
pmdasystemd can often fail (see logs below).  If pmcd is
restarted later, no further problems appear.

/var/log/pmcd/pmdasystemd.log:

Log for pmdasystemd on very.elastic.org started Thu Mar  6 07:48:06 2014

[Thu Mar  6 07:48:17] pmdasystemd(1002) Critical: __pmdaSetupPDU: PMDA pmdasystemd send creds: Broken pipe
[Thu Mar  6 07:48:17] pmdasystemd(1002) Error: PMDA Initialisation Failed

Log finished Thu Mar  6 07:48:17 2014

/var/log/pmcd/pmcd.log:

[...]
[Thu Mar  6 07:48:09] pmcd(987) Warning: pduread: timeout (after 3.000 sec) while attempting to read 12 bytes out of 12 in HDR on fd=17
pmcd: error at initial PDU exchange with systemd PMDA: Timeout waiting for a response from PMCD
[...]

journalctl -b extracts:

Mar 06 07:47:40 very.elastic.org systemd-journal[60]: Journal started
Mar 06 07:47:42 very.elastic.org systemd-journal[60]: Journal stopped
Mar 06 07:47:48 very.elastic.org systemd-journal[295]: Allowing runtime journal files to grow to 197.6M.
Mar 06 07:47:48 very.elastic.org systemd-journald[60]: Received SIGTERM
Mar 06 07:47:48 very.elastic.org systemd-journal[295]: Journal started
Mar 06 07:48:05 very.elastic.org systemd-journal[295]: Allowing system journal files to grow to 4.0G.
[...]
Mar 06 07:48:09 very.elastic.org systemd-journal[295]: Forwarding to syslog missed 90 messages.
Mar 06 07:48:11 very.elastic.org ntpd[491]: 0.0.0.0 c61c 0c clock_step -0.546935 s
Mar 06 07:48:10 very.elastic.org ntpd[491]: 0.0.0.0 c614 04 freq_mode
Mar 06 07:48:10 very.elastic.org systemd[1]: Time has been changed
Mar 06 07:48:11 very.elastic.org ntpd[491]: 0.0.0.0 c618 08 no_sys_peer


It appears to be during this clock-changing, journald-restarty time that
pmcd happens to start pmdasystemd.  For reasons unknown, pmdasystemd
appears to pmcd to take longer than 3.0s to reply to some early init
packet, and in accordance with http://oss.sgi.com/bugzilla/show_bug.cgi?id=1036
forever gives up on the pmda. :-(

I'm not sure what the best fix might be.  Delaying systemd journal
connections, switching to multithreading, ... or letting pmcd wait
a little longer during startup?

Comment 1 Nathan Scott 2014-03-10 21:46:12 UTC
Possibly a situation where PM_ERR_PMDANOTREADY might help us?
See src/pmdas/rpm/rpm.c::notready() for an example, or pmdasample.

cheers.

Comment 2 Frank Ch. Eigler 2014-04-12 00:21:10 UTC
pcpfans.git branch fche/systemd-slow contains commit 82a8aa97,
which should fix this problem.  I've been unable to reproduce
the original problem on a test computer.  (Unfortunately,
since the problem is intermittent and relies on systemd/journald
doing some mysterious concurrent activity, this is not proof
that the problem is gone.)

Comment 3 Ken McDonell 2014-04-12 00:32:33 UTC
Thanks for giving it a spin Frank.

If the problem is observed again, it would be really helpful to have the tail of pmcd.log and the pmda's systemd.log files attached to the bug.

Comment 4 Frank Ch. Eigler 2014-04-12 00:49:46 UTC
Ken, certainly will report if/when the problem reappears.
Log extracts from the original failures are here at comment #0.

Comment 5 Ken McDonell 2014-04-13 22:23:03 UTC
There are 2 different timeout scenarios, and I think the connect_pmcd() change will almost certainly fix one of them (the one documented in Comment 0 above).

But if the PMDA fails early in life again it will be important to know if this is the _other_ scenario, or my brave assertion is false.

Hence the request for the logs if a new failure is observed.

Comment 6 Fedora Update System 2014-05-15 08:11:20 UTC
pcp-3.9.4-1.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/pcp-3.9.4-1.fc20

Comment 7 Fedora Update System 2014-05-15 08:11:48 UTC
pcp-3.9.4-1.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/pcp-3.9.4-1.fc19

Comment 8 Fedora Update System 2014-05-15 08:12:25 UTC
pcp-3.9.4-1.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/pcp-3.9.4-1.el6

Comment 9 Fedora Update System 2014-05-16 03:02:32 UTC
Package pcp-3.9.4-1.el6:
* should fix your issue,
* was pushed to the Fedora EPEL 6 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=epel-testing pcp-3.9.4-1.el6'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-EPEL-2014-1423/pcp-3.9.4-1.el6
then log in and leave karma (feedback).

Comment 10 Fedora Update System 2014-05-30 17:08:23 UTC
pcp-3.9.4-1.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 11 Fedora Update System 2014-06-18 20:21:07 UTC
pcp-3.9.5-1.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/pcp-3.9.5-1.fc19

Comment 12 Fedora Update System 2014-06-18 20:22:50 UTC
pcp-3.9.5-1.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/pcp-3.9.5-1.fc20

Comment 13 Fedora Update System 2014-07-01 23:27:30 UTC
pcp-3.9.5-1.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 14 Fedora Update System 2014-07-01 23:30:30 UTC
pcp-3.9.5-1.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.