Hide Forgot
Description of problem: pacemaker won't let me start a file system resource. [root@west-02 ~]# pcs resource enable --wait=100 west0 Error: unable to start: 'west0', please check logs for failure information Resource 'west0' is not running on any node I saw no pertinent messages in pacemaker's logs, /var/log/messages, or kernel messages. I'm a novice user of pacemaker. I've used it in the past to configure gfs2 file systems, but I'm not very experienced in it. I have several pacemaker complaints here: <rant warning> 1. It said "please check logs for failure information" but it did not tell me what logs to check. Does it mean /var/log/messages? Does it mean /var/log/cluster/corosync.log? Does it mean another log? I didn't see pertinent information in either. 2. "pcs resource show" did not indicate there was a problem (it said the resources are stopped. That's all). Similarly, "pcs resource show west0" did not indicate there was a problem either. However, "pcs status" showed several "failed actions." 3. "pcs status" showed several failed actions, but: * fence-west-04_start_0 on west-04 'unknown error' (1): call=19, status=Error, exitreason='none', My complaint here is "unknown error". I absolutely HATE any application that claims to have an "unknown error". It's just plain bad programming. No program should ever say that. At least they should say "sent this request, got this bad response." What does "status=Error" mean? 4. It says "exitreason='none'". Really? It exited with an error for no reason? 5. "pcs status" also said: * west0_start_0 on west-02 'not installed' (5): call=40, status=complete, exitreason='Couldn't find device [/dev/west/west0]. Expected /dev/??? to exist', But despite this, /dev/west/west0 DOES exist, and I can manually mount it just fine: [root@west-02 ~]# mount -tgfs2 /dev/west/west0 /mnt/west0 [root@west-02 ~]# echo $? 0 [root@west-02 ~]# So why is pacemaker unable to do that? Why does it not give an intelligent reason it can't? See the log I will attach shortly. Version-Release number of selected component (if applicable): RHEL7.3 pacemaker-1.1.15-10.el7.x86_64 How reproducible: Seems consistent Steps to Reproduce: See attachment Actual results: Error: unable to start: 'west0', please check logs for failure information Resource 'west0' is not running on any node Expected results: The resource should either start or tell me why it didn't start. Additional info: I suspect the problems are leftovers from yesterday. I made some mistakes creating the resource, and accidentally used the wrong locking protocol, which caused gfs2 to withdraw. Still, I rebooted the entire cluster, deleted and recreated those resources. I'm trying to test an experimental gfs2 module, so you'll see tainted kernel messages and such in the log which I will attach.
Created attachment 1198769 [details] Log of terminal session where this happened This is a copy/paste of my terminal session where I did most of the commands. Maybe it will explain what went wrong.
It looks as if previous failures caused the failcounts to be infinity, and pacemaker refused to enable the resource after that: [root@west-02 ~]# pcs resource failcount show west0 Failcounts for west0 west-02: INFINITY west-03: INFINITY west-04: INFINITY A simple: pcs resource cleanup west0 allowed the resource to operate normally. We discussed this on IRC and decided we could use this bz as a forum for improving the customer experience. (1) Perhaps disabling the resource should automatically reset its failcount? (2) Perhaps the error messages can be improved? (3) Perhaps the failcounts should be reported by pcs resource show? etc.
I agree with Bob that we could stand to improve the experience here. Customers hit these types of problems often, and they're usually left guessing as to what they need to do to get their resources running again. According to Bob's captured session and his report that this is using 7.3 packages, it looks like we already have a few improvements in pcs's output that should help slightly in this area. One is that it seems to more clearly show resources as "disabled" rather than target-role:Stopped (which a novice user might not understand). Another is the output at the top of his status output: >> 3 nodes and 18 resources configured: 6 resources DISABLED and 0 BLOCKED from being started due to failures That said, this doesn't seem to be working correctly, given that that same output lists the resource failures that contributed to the INFINITY failcount which was then blocking these from starting. So we might need to investigate if that's malfunctioning. The idea to reset failcounts on disable seems like a good one. If we did not do that, then I think 'pcs resource enable' should check for failcounts and report details of that or log a warning if it finds any. Finally, I think it would be useful to consider adding a field of some kind that indicates if a resource operation is in progress. I don't think this would have helped in Bob's case, but often we'll see customers 'pcs resource enable' and then freak out when it hasn't started 30 seconds later and then will start taking further action that disrupts any possibility of it starting; for instance, if their resource start op is taking a long time, there's usually no way to tell anything at all is happening in the status output, and then the customer may go and disable the resource or start mucking with the underlying resources manually when they don't see any activity. If we had a "Starting" state string or some operation-in-progress notifier, that might avoid that problem.
Let's focus this BZ on the status output (crm_mon at the pacemaker level, which is pcs status at the pcs level). Comments: * The 7.3 packages have significant improvements in crm_mon, and this is an active area of further improvement. Related BZs include Bug 1194761, Bug 1230294, Bug 1298581, Bug 1300597, Bug 1300604, and Bug 1363907. * The new "N resources DISABLED and N BLOCKED" output will get a small improvement in 7.4: each will only be printed if N > 0, so it won't be distracting if everything is enabled. Be aware that "blocked" here is used in a specific technical sense -- in Pacemaker configuration, "block" means unmanage the resource, i.e. don't start or stop it. A resource is counted as blocked if it failed and has on-fail=block configured, or multiple instances are (unintentionally) running and multiple-active=block is configured. It does not mean the conversational sense of "blocked" as "prevented from running". I can see how this might be confusing, and suggestions for improvement are welcome. * crm_mon already has the ability to print fail counts (with the -f option). If we want pcs to display these by default, we can clone this bz for pcs so it can be modified to use the option. The crm_mon fail count output could probably be improved; an example: Migration Summary: * Node rhel7-2: * Node rhel7-3: migrator: migration-threshold=1000000 fail-count=1 last-failure='Wed Sep 7 14:07:47 2016' * Node rhel7-4: * Node rhel7-5: * Node rhel7-1: Failed Actions: * migrator_monitor_10000 on rhel7-3 'not running' (7): call=63, status=complete, exitreason='none', last-rc-change='Wed Sep 7 14:07:47 2016', queued=0ms, exec=0ms * The "please check logs" message is coming from pcs, so if we want to change that, we can mention it when cloning this bz for pcs. I'm not sure there's a better alternative, since logging is highly configurable. Users may have enabled or disabled syslog logging for corosync and/or pacemaker, and the pacemaker detail log may be /var/log/pacemaker.log, /var/log/cluster/corosync.log, some other location, or disabled, according to the configuration. And while resource agents usually log via syslog, they are external scripts, so they could use any logging they want. Perhaps "please check the system log, pacemaker detail log, and/or resource agent logs" would be the best we could do. The usefulness of those logs is another matter entirely. * Currently, the various pcs "show" commands are intended strictly for displaying the configuration, while the "status" commands are intended for state information, so that's why "pcs resource show" didn't show errors. I believe "show" is being renamed to "config" to make this clearer. But we can raise this issue in the pcs clone bz. * Let's break down a failure line to see what we can do better: fence-west-04_start_0 on west-04 'unknown error' (1): call=19, status=Error, exitreason='none', + fence-west-04_start_0: This is the operation that failed. It is the name of the resource that failed (fence-west-04) plus the operation (start) plus the operation's recurring interval (0, meaning non-recurring). This format is useful for searching in the detail log, but we could trade that off for being more user-friendly, something like "start operation for fence-west-04". Opinions? + on west-04: the node the operation failed on. I think this is straightforward enough, although we could say "on node 'west-04'" if that's preferred. + 'unknown error' (1): the resource agent exited with status code 1 (the constant is PCMK_OCF_UNKNOWN_ERROR in the code), which indicates an error (without anything more specific). Error code 1 is used in many places, so I'm not sure there's a more helpful text we can show. Maybe we could just drop 'unknown', and say "returned 'error' (1)". By way of example, another common value is "'not installed' (5)", indicating that the agent could not find all its dependencies. + call=19: an internal incremental ID that can be used to look for the failure in the detail log. + status=Error: This is the status of pacemaker's internal view of the operation, as opposed to the return status of the external agent itself. For example, this could be 'Not installed', which would mean that pacemaker could not find the agent itself (contrast with 'not installed' above, where the agent itself was there and executed, but could not find its dependencies). Suggestions for better wording are welcome. + exitreason='none': OCF resource agents may give a textual reason for an error by printing the magic string 'ocf-exit-reason:' in their output. If the agent did so, pacemaker will print it here, otherwise it will print 'none'. Probably we should just omit it entirely if not given. Perhaps if it is given, we should print it closer to the main error message, before the call ID. Putting all that together, the same failure line might look like: start operation for fence-west-04 on node 'west-04' returned 'error' (1): call=19, status=Error * I don't think we should clear fail counts when a resource is disabled. It's logically unrelated. Someone might disable a resource via a script, and not realize they are hiding errors. Also, errors occur on a particular node, but disabling a resource is cluster-wide, so there's a scope inconsistency there. * Pacemaker does have the existing capability to show whether an operation is in progress, however this is a configuration option: record-pending must be set to true for the operation (or more likely, as an operation default). It defaults to false. In all RHEL versions, crm_mon requires the -j option to display the pending state, even when the option is enabled in the configuration -- which means pcs status won't ever show it. In the next upstream release of pacemaker (expected in 7.4), crm_mon will show pending operations by default, so enabling it in the configuration will be sufficient to make it show up in pcs status. I'm not sure why record-pending defaults to false, or how we might raise its visibility among users.
Some different perspectives inline (In reply to Robert Peterson from comment #0) > I have several pacemaker complaints here: > <rant warning> > > 1. It said "please check logs for failure information" but it did > not tell me what logs to check. Does it mean /var/log/messages? > Does it mean /var/log/cluster/corosync.log? Does it mean another > log? I didn't see pertinent information in either. It doesn't say because it cannot know. Does the application log at all? To syslog? A custom file? Called? Or maybe it was a problem in or detected by the agent, then it will be in the pacemaker logs but we can't even say for sure where syslog is directed to on your system or where our debug logs are kept - if at all (yes, we all use RHEL but the code can't assume that). So, yes its a very bland message, but given the variation in the kinds of errors and applications managed by the cluster, it can't be more specific. > 2. "pcs resource show" did not indicate there was a problem (it > said the resources are stopped. That's all). Similarly, > "pcs resource show west0" did not indicate there was a problem > either. However, "pcs status" showed several "failed actions." This is a good point. I think ?optionally? including the list of failed actions in "pcs resource show (node)" would be a nice improvement and shouldn't be that hard. > 3. "pcs status" showed several failed actions, but: > * fence-west-04_start_0 on west-04 'unknown error' (1): call=19, > status=Error, exitreason='none', > My complaint here is "unknown error". I absolutely HATE any > application that claims to have an "unknown error". > It's just plain bad programming. No program should ever say that. > At least they should say "sent this request, got this bad > response." What does "status=Error" mean? One could argue that s/unknown/generic/ would be better in that output. All pacemaker gets from the agent is a return code, '1' in this case which is enough to know there was an error but we have no way of knowing the root cause. Other kinds of agents (not fencing ones) have a standard set of a dozen or so different return codes that allow us to provide more information in some cases (ie. binary not found, permission denied) but even they can't cover everything. The logs will include any output from the agent, but we do not store that in the cluster status (that can be a very long string) nor do we plan to go down the rabbit hole of trying to parse every possible agent's output and figure out a meaning. > 4. It says "exitreason='none'". Really? It exited with an error > for no reason? Granted this could be clearer, but it means for no _known_ or specific reason. > 5. "pcs status" also said: > * west0_start_0 on west-02 'not installed' (5): call=40, status=complete, > exitreason='Couldn't find device [/dev/west/west0]. Expected /dev/??? to > exist', This is an example of the additional error codes I mentioned above. > But despite this, /dev/west/west0 DOES exist, and I can manually > mount it just fine: Did it exist at the time the command was run though? The agent claims it didn't, did you check to see what the agent does or retest it with pcs resource debug-start west0 --all ? > [root@west-02 ~]# mount -tgfs2 /dev/west/west0 /mnt/west0 > [root@west-02 ~]# echo $? > 0 > [root@west-02 ~]# > So why is pacemaker unable to do that? Why does it not give an > intelligent reason it can't? It did
(In reply to Ken Gaillot from comment #4) > > * I don't think we should clear fail counts when a resource is disabled. > It's logically unrelated. Someone might disable a resource via a script, and > not realize they are hiding errors. Also, errors occur on a particular node, > but disabling a resource is cluster-wide, so there's a scope inconsistency > there. Agreed > * Pacemaker does have the existing capability to show whether an operation > is in progress, however this is a configuration option: record-pending must > be set to true for the operation (or more likely, as an operation default). > It defaults to false. I think we're nearing the point where we should at least consider turning this on by default. > I'm not sure why record-pending defaults to false, The short answer is twofold: 1. It doubles the work needed to maintain the status section. Every operation produces two updates, one to say "i kicked it off" and other to say "complete/failed". 2. It introduces a whole set of new corner cases for the PE which now has to deal with in-progress operations.
Unfortunately, due to capacity constraints, this will not be ready for 7.4. It will be a priority for 7.5.
Due to a short time frame and limited capacity, this will not make 7.5.
(In reply to Andrew Beekhof from comment #6) > (In reply to Ken Gaillot from comment #4) > > * Pacemaker does have the existing capability to show whether an operation > > is in progress, however this is a configuration option: record-pending must > > be set to true for the operation (or more likely, as an operation default). > > It defaults to false. > > I think we're nearing the point where we should at least consider turning > this on by default. record-pending will default to true starting with 7.5 (Bug 1461976)
Moving to RHEL 8 only, as this will not make 7.7, which will be the last RHEL 7 feature release
We now have a separate Bug 1470834 to cover the failed action display, and Bug 1458953 covers fixing the disabled resources count. I believe that covers everything in this BZ that could be handled by crm_mon, so I am reassigning this to pcs for consideration of the remaining points that were brought up: * "please check logs for failure information" is vague; maybe something like "please check the system log, pacemaker detail log, and/or resource agent logs" or even parsing the logging configuration to refer to specific locations * "pcs resource enable" could check for reasons the resource might still not start, such as fail-counts ("crm_resource --why" could be useful) Additionally, there was some confusion why "pcs resource show" didn't show fail counts, but this was renamed to "pcs resource config" in RHEL 8, which hopefully will be sufficient to avoid that.
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.