Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1373981

Summary:

Make crm_mon wording more user-friendly

Product:

Red Hat Enterprise Linux 8

Reporter:

Robert Peterson <rpeterso>

Component:

pcs

Assignee:

Tomas Jelinek <tojeline>

Status:

CLOSED WONTFIX

QA Contact:

cluster-qe <cluster-qe>

Severity:

high

Docs Contact:

Priority:

medium

Version:

8.0

CC:

cfeist, cluster-maint, idevat, jruemker, kgaillot, mnovacek, omular, phagara, tojeline

Target Milestone:

pre-dev-freeze

Keywords:

FutureFeature

Target Release:

---

Flags:

pm-rhel: mirror+

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-11-01 03:03:03 UTC

Type:

Enhancement

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1682116

Bug Blocks:

Attachments:

Description	Flags
Log of terminal session where this happened	none

Description Robert Peterson 2016-09-07 15:32:12 UTC

Description of problem:
pacemaker won't let me start a file system resource.

[root@west-02 ~]# pcs resource enable --wait=100 west0
Error: unable to start: 'west0', please check logs for failure information
Resource 'west0' is not running on any node

I saw no pertinent messages in pacemaker's logs, /var/log/messages,
or kernel messages.

I'm a novice user of pacemaker. I've used it in the past to configure
gfs2 file systems, but I'm not very experienced in it.

I have several pacemaker complaints here:
<rant warning>

1. It said "please check logs for failure information" but it did
not tell me what logs to check. Does it mean /var/log/messages?
Does it mean /var/log/cluster/corosync.log? Does it mean another
log? I didn't see pertinent information in either.
2. "pcs resource show" did not indicate there was a problem (it
said the resources are stopped. That's all). Similarly,
"pcs resource show west0" did not indicate there was a problem
either. However, "pcs status" showed several "failed actions."
3. "pcs status" showed several failed actions, but:
* fence-west-04_start_0 on west-04 'unknown error' (1): call=19, status=Error, exitreason='none',
My complaint here is "unknown error". I absolutely HATE any
application that claims to have an "unknown error".
It's just plain bad programming. No program should ever say that.
At least they should say "sent this request, got this bad
response." What does "status=Error" mean?
4. It says "exitreason='none'". Really? It exited with an error
for no reason?
5. "pcs status" also said:
* west0_start_0 on west-02 'not installed' (5): call=40, status=complete, exitreason='Couldn't find device [/dev/west/west0]. Expected /dev/??? to exist',
But despite this, /dev/west/west0 DOES exist, and I can manually
mount it just fine:
[root@west-02 ~]# mount -tgfs2 /dev/west/west0 /mnt/west0
[root@west-02 ~]# echo $?
0
[root@west-02 ~]#
So why is pacemaker unable to do that? Why does it not give an
intelligent reason it can't?

See the log I will attach shortly.

Version-Release number of selected component (if applicable):
RHEL7.3
pacemaker-1.1.15-10.el7.x86_64

How reproducible:
Seems consistent

Steps to Reproduce:
See attachment

Actual results:
Error: unable to start: 'west0', please check logs for failure information
Resource 'west0' is not running on any node

Expected results:
The resource should either start or tell me why it didn't start.

Additional info:
I suspect the problems are leftovers from yesterday. I made some
mistakes creating the resource, and accidentally used the wrong
locking protocol, which caused gfs2 to withdraw. Still, I rebooted
the entire cluster, deleted and recreated those resources.

I'm trying to test an experimental gfs2 module, so you'll see
tainted kernel messages and such in the log which I will attach.

Comment 1 Robert Peterson 2016-09-07 15:34:31 UTC

Created attachment 1198769 [details]
Log of terminal session where this happened

This is a copy/paste of my terminal session where I did most
of the commands. Maybe it will explain what went wrong.

Comment 2 Robert Peterson 2016-09-07 16:16:54 UTC

It looks as if previous failures caused the failcounts to be
infinity, and pacemaker refused to enable the resource after that:

[root@west-02 ~]# pcs resource failcount show west0
Failcounts for west0
 west-02: INFINITY
 west-03: INFINITY
 west-04: INFINITY

A simple:

pcs resource cleanup west0

allowed the resource to operate normally.

We discussed this on IRC and decided we could use this bz as
a forum for improving the customer experience.

(1) Perhaps disabling the resource should automatically reset
its failcount?
(2) Perhaps the error messages can be improved?
(3) Perhaps the failcounts should be reported by pcs resource show?
etc.

Comment 3 John Ruemker 2016-09-07 16:23:34 UTC

I agree with Bob that we could stand to improve the experience here.  Customers hit these types of problems often, and they're usually left guessing as to what they need to do to get their resources running again. 

According to Bob's captured session and his report that this is using 7.3 packages, it looks like we already have a few improvements in pcs's output that should help slightly in this area.  One is that it seems to more clearly show resources as "disabled" rather than target-role:Stopped (which a novice user might not understand).  Another is the output at the top of his status output:

>> 3 nodes and 18 resources configured: 6 resources DISABLED and 0 BLOCKED from being started due to failures

That said, this doesn't seem to be working correctly, given that that same output lists the resource failures that contributed to the INFINITY failcount which was then blocking these from starting. So we might need to investigate if that's malfunctioning.

The idea to reset failcounts on disable seems like a good one.  If we did not do that, then I think 'pcs resource enable' should check for failcounts and report details of that or log a warning if it finds any.  

Finally, I think it would be useful to consider adding a field of some kind that indicates if a resource operation is in progress.  I don't think this would have helped in Bob's case, but often we'll see customers 'pcs resource enable' and then freak out when it hasn't started 30 seconds later and then will start taking further action that disrupts any possibility of it starting; for instance, if their resource start op is taking a long time, there's usually no way to tell anything at all is happening in the status output, and then the customer may go and disable the resource or start mucking with the underlying resources manually when they don't see any activity.  If we had a "Starting" state string or some operation-in-progress notifier, that might avoid that problem.

Comment 4 Ken Gaillot 2016-09-07 20:23:05 UTC

Let's focus this BZ on the status output (crm_mon at the pacemaker level, which is pcs status at the pcs level). Comments:

* The 7.3 packages have significant improvements in crm_mon, and this is an active area of further improvement. Related BZs include Bug 1194761, Bug 1230294, Bug 1298581, Bug 1300597, Bug 1300604, and Bug 1363907.

* The new "N resources DISABLED and N BLOCKED" output will get a small improvement in 7.4: each will only be printed if N > 0, so it won't be distracting if everything is enabled. Be aware that "blocked" here is used in a specific technical sense -- in Pacemaker configuration, "block" means unmanage the resource, i.e. don't start or stop it. A resource is counted as blocked if it failed and has on-fail=block configured, or multiple instances are (unintentionally) running and multiple-active=block is configured. It does not mean the conversational sense of "blocked" as "prevented from running". I can see how this might be confusing, and suggestions for improvement are welcome.

* crm_mon already has the ability to print fail counts (with the -f option). If we want pcs to display these by default, we can clone this bz for pcs so it can be modified to use the option. The crm_mon fail count output could probably be improved; an example:

Migration Summary:
* Node rhel7-2:
* Node rhel7-3:
migrator: migration-threshold=1000000 fail-count=1 last-failure='Wed Sep 7 14:07:47 2016'
* Node rhel7-4:
* Node rhel7-5:
* Node rhel7-1:

Failed Actions:
* migrator_monitor_10000 on rhel7-3 'not running' (7): call=63, status=complete, exitreason='none',
last-rc-change='Wed Sep 7 14:07:47 2016', queued=0ms, exec=0ms

* The "please check logs" message is coming from pcs, so if we want to change that, we can mention it when cloning this bz for pcs. I'm not sure there's a better alternative, since logging is highly configurable. Users may have enabled or disabled syslog logging for corosync and/or pacemaker, and the pacemaker detail log may be /var/log/pacemaker.log, /var/log/cluster/corosync.log, some other location, or disabled, according to the configuration. And while resource agents usually log via syslog, they are external scripts, so they could use any logging they want. Perhaps "please check the system log, pacemaker detail log, and/or resource agent logs" would be the best we could do. The usefulness of those logs is another matter entirely.

* Currently, the various pcs "show" commands are intended strictly for displaying the configuration, while the "status" commands are intended for state information, so that's why "pcs resource show" didn't show errors. I believe "show" is being renamed to "config" to make this clearer. But we can raise this issue in the pcs clone bz.

* Let's break down a failure line to see what we can do better:

fence-west-04_start_0 on west-04 'unknown error' (1): call=19, status=Error, exitreason='none',

+ fence-west-04_start_0: This is the operation that failed. It is the name of the resource that failed (fence-west-04) plus the operation (start) plus the operation's recurring interval (0, meaning non-recurring). This format is useful for searching in the detail log, but we could trade that off for being more user-friendly, something like "start operation for fence-west-04". Opinions?

+ on west-04: the node the operation failed on. I think this is straightforward enough, although we could say "on node 'west-04'" if that's preferred.

+ 'unknown error' (1): the resource agent exited with status code 1 (the constant is PCMK_OCF_UNKNOWN_ERROR in the code), which indicates an error (without anything more specific). Error code 1 is used in many places, so I'm not sure there's a more helpful text we can show. Maybe we could just drop 'unknown', and say "returned 'error' (1)". By way of example, another common value is "'not installed' (5)", indicating that the agent could not find all its dependencies.

+ call=19: an internal incremental ID that can be used to look for the failure in the detail log.

+ status=Error: This is the status of pacemaker's internal view of the operation, as opposed to the return status of the external agent itself. For example, this could be 'Not installed', which would mean that pacemaker could not find the agent itself (contrast with 'not installed' above, where the agent itself was there and executed, but could not find its dependencies). Suggestions for better wording are welcome.

+ exitreason='none': OCF resource agents may give a textual reason for an error by printing the magic string 'ocf-exit-reason:' in their output. If the agent did so, pacemaker will print it here, otherwise it will print 'none'. Probably we should just omit it entirely if not given. Perhaps if it is given, we should print it closer to the main error message, before the call ID.

Putting all that together, the same failure line might look like:

start operation for fence-west-04 on node 'west-04' returned 'error' (1): call=19, status=Error

* I don't think we should clear fail counts when a resource is disabled. It's logically unrelated. Someone might disable a resource via a script, and not realize they are hiding errors. Also, errors occur on a particular node, but disabling a resource is cluster-wide, so there's a scope inconsistency there.

* Pacemaker does have the existing capability to show whether an operation is in progress, however this is a configuration option: record-pending must be set to true for the operation (or more likely, as an operation default). It defaults to false. In all RHEL versions, crm_mon requires the -j option to display the pending state, even when the option is enabled in the configuration -- which means pcs status won't ever show it. In the next upstream release of pacemaker (expected in 7.4), crm_mon will show pending operations by default, so enabling it in the configuration will be sufficient to make it show up in pcs status. I'm not sure why record-pending defaults to false, or how we might raise its visibility among users.

Comment 5 Andrew Beekhof 2016-09-07 23:55:18 UTC

Some different perspectives inline

(In reply to Robert Peterson from comment #0)

> I have several pacemaker complaints here:
> <rant warning>
> 
> 1. It said "please check logs for failure information" but it did
>    not tell me what logs to check. Does it mean /var/log/messages?
>    Does it mean /var/log/cluster/corosync.log? Does it mean another
>    log? I didn't see pertinent information in either.

It doesn't say because it cannot know.

Does the application log at all? To syslog? A custom file? Called?  
Or maybe it was a problem in or detected by the agent, then it will be in the pacemaker logs but we can't even say for sure where syslog is directed to on your system or where our debug logs are kept - if at all (yes, we all use RHEL but the code can't assume that).

So, yes its a very bland message, but given the variation in the kinds of errors and applications managed by the cluster, it can't be more specific.

> 2. "pcs resource show" did not indicate there was a problem (it
>    said the resources are stopped. That's all). Similarly,
>    "pcs resource show west0" did not indicate there was a problem
>    either. However, "pcs status" showed several "failed actions."

This is a good point. I think ?optionally? including the list of failed actions in "pcs resource show (node)" would be a nice improvement and shouldn't be that hard.

> 3. "pcs status" showed several failed actions, but:
>    * fence-west-04_start_0 on west-04 'unknown error' (1): call=19,
> status=Error, exitreason='none',
>    My complaint here is "unknown error". I absolutely HATE any
>    application that claims to have an "unknown error".
>    It's just plain bad programming. No program should ever say that.
>    At least they should say "sent this request, got this bad
>    response." What does "status=Error" mean?

One could argue that s/unknown/generic/ would be better in that output.
All pacemaker gets from the agent is a return code, '1' in this case which is enough to know there was an error but we have no way of knowing the root cause.

Other kinds of agents (not fencing ones) have a standard set of a dozen or so different return codes that allow us to provide more information in some cases (ie. binary not found, permission denied) but even they can't cover everything.

The logs will include any output from the agent, but we do not store that in the cluster status (that can be a very long string) nor do we plan to go down the rabbit hole of trying to parse every possible agent's output and figure out a meaning.

> 4. It says "exitreason='none'". Really? It exited with an error
>    for no reason?

Granted this could be clearer, but it means for no _known_ or specific reason.

> 5. "pcs status" also said:
>    * west0_start_0 on west-02 'not installed' (5): call=40, status=complete,
> exitreason='Couldn't find device [/dev/west/west0]. Expected /dev/??? to
> exist',

This is an example of the additional error codes I mentioned above.

>    But despite this, /dev/west/west0 DOES exist, and I can manually
>    mount it just fine:

Did it exist at the time the command was run though?
The agent claims it didn't, did you check to see what the agent does or retest it with pcs resource debug-start west0 --all  ?

> [root@west-02 ~]# mount -tgfs2 /dev/west/west0 /mnt/west0
> [root@west-02 ~]# echo $?
> 0
> [root@west-02 ~]#
>    So why is pacemaker unable to do that? Why does it not give an
>    intelligent reason it can't?

It did

Comment 6 Andrew Beekhof 2016-09-08 00:03:38 UTC

(In reply to Ken Gaillot from comment #4)

> 
> * I don't think we should clear fail counts when a resource is disabled.
> It's logically unrelated. Someone might disable a resource via a script, and
> not realize they are hiding errors. Also, errors occur on a particular node,
> but disabling a resource is cluster-wide, so there's a scope inconsistency
> there.

Agreed

> * Pacemaker does have the existing capability to show whether an operation
> is in progress, however this is a configuration option: record-pending must
> be set to true for the operation (or more likely, as an operation default).
> It defaults to false.

I think we're nearing the point where we should at least consider turning this on by default.

> I'm not sure why record-pending defaults to false,

The short answer is twofold:

1. It doubles the work needed to maintain the status section.
   Every operation produces two updates, one to say "i kicked it off" and other to say "complete/failed".

2. It introduces a whole set of new corner cases for the PE which now has to deal with in-progress operations.

Comment 7 Ken Gaillot 2017-03-15 17:25:51 UTC

Unfortunately, due to capacity constraints, this will not be ready for 7.4. It will be a priority for 7.5.

Comment 8 Ken Gaillot 2017-08-29 21:18:44 UTC

Due to a short time frame and limited capacity, this will not make 7.5.

Comment 9 Ken Gaillot 2018-02-21 17:58:00 UTC

(In reply to Andrew Beekhof from comment #6)
> (In reply to Ken Gaillot from comment #4)
> > * Pacemaker does have the existing capability to show whether an operation
> > is in progress, however this is a configuration option: record-pending must
> > be set to true for the operation (or more likely, as an operation default).
> > It defaults to false.
> 
> I think we're nearing the point where we should at least consider turning
> this on by default.

record-pending will default to true starting with 7.5 (Bug 1461976)

Comment 11 Ken Gaillot 2018-12-18 20:38:19 UTC

Moving to RHEL 8 only, as this will not make 7.7, which will be the last RHEL 7 feature release

Comment 12 Ken Gaillot 2019-03-27 19:40:28 UTC

We now have a separate Bug 1470834 to cover the failed action display, and Bug 1458953 covers fixing the disabled resources count. I believe that covers everything in this BZ that could be handled by crm_mon, so I am reassigning this to pcs for consideration of the remaining points that were brought up:

* "please check logs for failure information" is vague; maybe something like "please check the system log, pacemaker detail log, and/or resource agent logs" or even parsing the logging configuration to refer to specific locations

* "pcs resource enable" could check for reasons the resource might still not start, such as fail-counts ("crm_resource --why" could be useful)

Additionally, there was some confusion why "pcs resource show" didn't show fail counts, but this was renamed to "pcs resource config" in RHEL 8, which hopefully will be sufficient to avoid that.

Comment 16 RHEL Program Management 2020-11-01 03:03:03 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.