Bug 1001987

Summary: pacemaker tries to start a resource too often
Product: [Fedora] Fedora Reporter: lav
Component: pacemakerAssignee: Andrew Beekhof <andrew>
Status: CLOSED CANTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 18CC: abeekhof, andrew, fdinitto, lhh
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-14 05:47:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
/var/log/messages none

Description lav 2013-08-28 09:20:08 UTC
Description of problem:
When a resource fails to start, pacemaker repeatedly tries to start it without any delay. For service: resource systemd refuses even to try to start it because of rate limit.


Version-Release number of selected component (if applicable):
pacemaker-1.1.9-0.1.70ad9fa.git.fc18.i686


How reproducible:
always

Steps to Reproduce:
1. create a service: resource that fails to start (in my case it was service:named with a typo in config) with op monitor=30s
2. try to manage it (pcs resource manage named)
3. see in the log /var/log/message lots of messages, and systemctl shows that it refuses to start named.service because of rate limit.

Actual results:
repeated attempts to start the resource without any delay

Expected results:
I believe it should delay additional attempts to start a resource.

Additional info:

Comment 1 Andrew Beekhof 2013-08-28 09:44:09 UTC
At the very least we need logs.  Even better would be a crm_report archive

Comment 2 lav 2013-08-28 11:36:58 UTC
Created attachment 791337 [details]
/var/log/messages

I have fixed my problem by changing "monitor interval=30s" to "monitor interval=30s start-delay=10s timeout=20s".

Anyway, here is the log.

Comment 3 Andrew Beekhof 2013-11-13 06:06:47 UTC
There's not a lot pacemaker can do here.

systemd claims that the start completed without error (clearly untrue) and then we find the resource stopped in the recurring monitor so we try to recover it.

We do supply a migration-threshold option though. This would cause the cluster to give up after the indicated number of failures.

Unless you object, I'll close this for now.

Comment 4 Fedora End Of Life 2013-12-21 14:31:54 UTC
This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '18'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 18's end of life.

Thank you for reporting this issue and we are sorry that we may not be 
able to fix it before Fedora 18 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior to Fedora 18's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.