Bug 1395508

Summary: The command "pcs constraint location some_resource prefers this_node" does not start resources on "this_node"
Product: Red Hat Enterprise Linux 7 Reporter: vlad.socaciu
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED NOTABUG QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.2CC: abeekhof, cluster-maint
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-16 17:46:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
corosync.log
none
corosync.log from node uis1ccg1
none
corosync.log from node uis1ccg2
none
corosync.log
none
One of the resource agents none

Description vlad.socaciu 2016-11-16 04:22:01 UTC
Created attachment 1221027 [details]
corosync.log

Description of problem:

The cluster has two nodes. Ten lsb resources, named "resource_ccg_gid1"-"resource_ccg_gid10" were started on one cluster node. The resource agents are shell scripts which start our own application program. Consistently, two resources start on the peer node. In this case, they are "resource_ccg_gid3" and "resource_ccg_gid7". The last such incident, occurred around 19:48:39. We have evidence that the cluster software may not have even attempted to start "resource_ccg_gid3" and "resource_ccg_gid7", but may have some "memory" from a previous runs which makes it believe that it should not start the resources on this node.

In some previous runs of these resources however, both "resource_ccg_gid3" and "resource_ccg_gid7" were started and then stopped precisely after about three seconds. All the resources are started with these arguments: "... meta migration-threshold="1" op monitor interval=30s op start timeout=45s on-fail=ignore".

The file corosync.log will be uploaded.

Version-Release number of selected component (if applicable):

1.1.13-10.el7_2.4.x86_64

Comment 2 Ken Gaillot 2016-11-16 23:27:53 UTC
After a large number of resources are added at 10:14:44, the node starts seeing high CPU load, which persists for a long time:

Nov 15 10:15:14 [1675] uis1ccg1.uismag.com       crmd:   notice: throttle_handle_load:  High CPU load detected: 4.140000

The load leads to corosync being starved for CPU, for example:

[1497] uis1ccg1.uismag.com corosyncwarning [MAIN  ] Corosync main process was not scheduled for 1083.4751 ms (threshold is 800.0000 ms). Consider token timeout increase.

This causes brief interruptions in the cluster communication, although it appears transient enough that neither node is fenced. Occasionally, a monitor fails, for example:

Nov 15 15:20:39 [1675] uis1ccg1.uismag.com       crmd:     info: process_lrm_event:     Operation resource_ccg_gid74_monitor_10000: not running (node=uis1ccg1-app, call=1276, rc=7, cib-update=809, confirmed=false)

The resources move due to the monitor failures, because migration-threshold=1:

Nov 15 19:48:52 [1672] uis1ccg1.uismag.com    pengine:     info: get_failcount_full:    resource_ccg_gid3 has failed 1 times on uis1ccg1-app
Nov 15 19:48:52 [1672] uis1ccg1.uismag.com    pengine:  warning: common_apply_stickiness:       Forcing resource_ccg_gid3 away from uis1ccg1-app after 1 failures (max=1)
Nov 15 19:48:52 [1672] uis1ccg1.uismag.com    pengine:     info: get_failcount_full:    resource_ccg_gid7 has failed 1 times on uis1ccg1-app
Nov 15 19:48:52 [1672] uis1ccg1.uismag.com    pengine:  warning: common_apply_stickiness:       Forcing resource_ccg_gid7 away from uis1ccg1-app after 1 failures (max=1)

So, Pacemaker is responding appropriately to the situation.

I think a support case will be more helpful in this case than bugzilla, because support can look at the system as a whole (rather than just pacemaker) to address the load and corosync issues. See:

  https://access.redhat.com/start/how-to-engage-red-hat-support

Comment 3 vlad.socaciu 2016-11-18 04:52:43 UTC
Created attachment 1221710 [details]
corosync.log from node uis1ccg1

Comment 4 vlad.socaciu 2016-11-18 04:53:59 UTC
Created attachment 1221711 [details]
corosync.log from node uis1ccg2

Comment 5 vlad.socaciu 2016-11-18 05:19:45 UTC
Here is what we did:

1. We doubled the number of CPU(s) on each cluster node.

2. I modified the resource agents such that when invoked with the "start" command line argument, they will not return until it is guaranteed that a further invocation with the "status" argument will return immediately and print "result: 0".

3. We create the resources with the command line arguments: "... op monitor timeout=30s interval=30s op start timeout=45s on-fail=ignore". The "timeout" values are exaggerated.

4. Before starting each resource, we clear all location constraints ("pcs constraint location remove ..."), and then create a location constraint to the cluster node we would like the resource agent to run on. 

We started 25 resources on the node uis1ccg2-app by running the start command on uis1ccg1-app. All but resource no. 13 (the one with the bad luck), started on the expected node. Resource 13 started on uis1ccg1-app. The corosync logs were posted earlier. The VCENTER program showed a CPU usage lower than 90% on both cluster nodes, at the time when the resources were brought up. The start action took place around 7:55 PM.

There is an even more troubling behavior, which seems intentional, but it is extremely inconvenient to us: if we stop and start again that resource, indicating that we would want it to run on uis1ccg2-app, the resource starts again on uis1ccg1-app, by virtue of some kind of remembrance of a previous bad experience on uis1ccg2-app, probably. The only solution is to delete the resource, and recreate it, which would be unacceptable when the product is deployed in the field.

To clarify: each cluster node is connected to two "sub-nets", and each has a different name on each sub-net, respectively: uis1ccg1, uis1ccg1-app and uis1ccg2, uis1ccg2-app. The cluster software was configured to use the "-app" node names.

Thank you very much!

Comment 6 vlad.socaciu 2016-11-20 05:12:48 UTC
I really do not understand the mentality of the cluster software: once it decided that a resource could not run on a particular node (for a reason unbeknownst to the casual user), IT WILL REFUSE TO START IT ON THAT NODE, no matter what the user commands demand.

Before executing 

    pcs resource enable...

we call 

    pcs constraint location remove ...

for every existing location constraint, 

   pcs resource clear ...

(the reason for having two commands, "location remove" and "clear" with relatively similar purposes escapes me; in our case, this seems definitely a nuisance), and finally

   pcs constraint location ... prefers uis1ccg2-app

Nonetheless, the cluster software ignores our preference, and stubbornly starts the resource (named "resource_ccg_gid16" in this case) on the opposite node, uis1ccg1-app. The log file from uis1ccg2 will be posted. I think that the last such incident occurred around Nov 19, 19:08. 

Note that around 19:09, there are complaints about high CPU load, which we refute. There was nothing CPU intensive running during that period of time; the VCENTER control shows less than moderate CPU activity for the entire day, and our expectations, based on what we know that it is running on that virtual machine, are precisely the same. 

We suspect that the CPU usage reported by the cluster is inaccurate.

Comment 7 vlad.socaciu 2016-11-20 05:19:42 UTC
Created attachment 1222177 [details]
corosync.log

"resource_ccg_gid16" refusing to start on uis1ccg2-app and questionable reports about high CPU load

Comment 8 Andrew Beekhof 2016-11-20 22:49:22 UTC
(In reply to vlad.socaciu from comment #6)
> I really do not understand the mentality of the cluster software: once it
> decided that a resource could not run on a particular node (for a reason
> unbeknownst to the casual user), IT WILL REFUSE TO START IT ON THAT NODE, no
> matter what the user commands demand.

# pcs resource cleanup --help 

Usage: pcs resource cleanup...
    cleanup [<resource id>]
        Cleans up the resource in the lrmd (useful to reset the resource
        status and failcount).  This tells the cluster to forget the
        operation history of a resource and re-detect its current state.
        This can be useful to purge knowledge of past failures that have
        since been resolved. If a resource id is not specified then all
        resources/stonith devices will be cleaned up.

Comment 9 vlad.socaciu 2016-11-21 00:49:44 UTC
(In reply to Andrew Beekhof from comment #8)
> (In reply to vlad.socaciu from comment #6)
> > I really do not understand the mentality of the cluster software: once it
> > decided that a resource could not run on a particular node (for a reason
> > unbeknownst to the casual user), IT WILL REFUSE TO START IT ON THAT NODE, no
> > matter what the user commands demand.
> 
> # pcs resource cleanup --help 
> 
> Usage: pcs resource cleanup...
>     cleanup [<resource id>]
>         Cleans up the resource in the lrmd (useful to reset the resource
>         status and failcount).  This tells the cluster to forget the
>         operation history of a resource and re-detect its current state.
>         This can be useful to purge knowledge of past failures that have
>         since been resolved. If a resource id is not specified then all
>         resources/stonith devices will be cleaned up.

Thank you very much.

Comment 10 Ken Gaillot 2016-11-21 21:56:31 UTC
(In reply to vlad.socaciu from comment #5)
> Here is what we did:
> 
> 1. We doubled the number of CPU(s) on each cluster node.
> 
> 2. I modified the resource agents such that when invoked with the "start"
> command line argument, they will not return until it is guaranteed that a
> further invocation with the "status" argument will return immediately and
> print "result: 0".
> 
> 3. We create the resources with the command line arguments: "... op monitor
> timeout=30s interval=30s op start timeout=45s on-fail=ignore". The "timeout"
> values are exaggerated.
> 
> 4. Before starting each resource, we clear all location constraints ("pcs
> constraint location remove ..."), and then create a location constraint to
> the cluster node we would like the resource agent to run on. 
> 
> We started 25 resources on the node uis1ccg2-app by running the start
> command on uis1ccg1-app. All but resource no. 13 (the one with the bad
> luck), started on the expected node. Resource 13 started on uis1ccg1-app.
> The corosync logs were posted earlier. The VCENTER program showed a CPU
> usage lower than 90% on both cluster nodes, at the time when the resources
> were brought up. The start action took place around 7:55 PM.

In the logs from Nov 17, we see this sequence:

1. resource_ccg_gid13 is added to the configuration with target-role=Stopped
2. A location constraint is added for the resource to prefer uis1ccg2-app
3. target-role is cleared
4. The cluster starts the resource successfully on uis1ccg2-app, but an immediate monitor fails:

Nov 17 19:55:35 [1902] uis1ccg1.uismag.com       crmd:     info: match_graph_event:     Action resource_ccg_gid13_start_0 (44) confirmed on uis1ccg2-app (rc=0)
...
Nov 17 19:55:35 [1902] uis1ccg1.uismag.com       crmd:     info: match_graph_event:     Action resource_ccg_gid13_monitor_30000 (45) confirmed on uis1ccg2-app (rc=7)
Nov 17 19:55:35 [1902] uis1ccg1.uismag.com       crmd:     info: update_failcount:      Updating failcount for resource_ccg_gid13 on uis1ccg2-app after failed monitor: rc=7 (update=value++, time=1479441335)
Nov 17 19:55:35 [1902] uis1ccg1.uismag.com       crmd:     info: process_graph_event:   Detected action (179.45) resource_ccg_gid13_monitor_30000.277=not running: failed

5. Due to migration-threshold=1, the cluster starts the resource on the other node.

Pacemaker is reacting to the monitor failure according to its configuration. If you did a "pcs status" at this time, it would show a "Failed actions" listing with the failed monitor.

The monitor failure, combined with the configured migration threshold, overrides the location preference (there would be not much point in HA if the cluster did not recover a failed resource).

There is no indication in corosync.log why the monitor failed, so you would have to look wherever your resource agent logs messages to see if there is any more information (or add some debug messages for testing). If the resource agent logs to stderr, that should show up in corosync.log; if it uses syslog, that would likely show up in /var/log/messages.

> There is an even more troubling behavior, which seems intentional, but it is
> extremely inconvenient to us: if we stop and start again that resource,
> indicating that we would want it to run on uis1ccg2-app, the resource starts
> again on uis1ccg1-app, by virtue of some kind of remembrance of a previous
> bad experience on uis1ccg2-app, probably. The only solution is to delete the
> resource, and recreate it, which would be unacceptable when the product is
> deployed in the field.

As was discussed after this, failures continue to affect the cluster's behavior until "pcs resource cleanup" is run.

There is a resource meta-attribute, failure-timeout, that allows you to expire failures automatically so that the cluster will retry recovering the resource after that much time without needing a cleanup command, but this is not the default, because most people prefer to investigate the cause of trouble before placing production services on a node again.

Comment 11 Ken Gaillot 2016-11-21 22:12:42 UTC
(In reply to vlad.socaciu from comment #6)
> I really do not understand the mentality of the cluster software: once it
> decided that a resource could not run on a particular node (for a reason
> unbeknownst to the casual user), IT WILL REFUSE TO START IT ON THAT NODE, no
> matter what the user commands demand.

HA is a chaotic system; a large number of inputs are all combined to produce a decision. Pacemaker takes into account numerous aspects of the configuration (constraints, utilization attributes, time-of-day-based rules, failure handling settings, etc.) plus the current state of the cluster (resource operation successes/failures, the value of node attributes, node health, etc.) to arrive at a list of actions that need to be taken.

Unfortunately, such a complex process is inherently difficult to summarize in a readable way. "pcs status" at the time of an issue can often give some clues, and the logs, while difficult to follow, give an extensive amount of information about why actions were taken.

> Before executing 
> 
>     pcs resource enable...
> 
> we call 
> 
>     pcs constraint location remove ...
> 
> for every existing location constraint, 
> 
>    pcs resource clear ...
> 
> (the reason for having two commands, "location remove" and "clear" with
> relatively similar purposes escapes me; in our case, this seems definitely a
> nuisance), and finally

"pcs resource clear" is a shorthand for a special type of "location remove". If you run "location remove" on all constraints, then "clear" is unnecessary. The special type is constraints added by "pcs resource move" and "pcs resource ban"; such constraints will have an ID starting with "cli-".

"move" and "ban" work by adding location constraints. These constraints stay in place until explicitly removed. This can confuse users later when it's not obvious why a resource is staying in an unexpected location. "clear" allows someone to remove all such constraints so the resource returns to its "natural" location.

> 
>    pcs constraint location ... prefers uis1ccg2-app
> 
> Nonetheless, the cluster software ignores our preference, and stubbornly
> starts the resource (named "resource_ccg_gid16" in this case) on the
> opposite node, uis1ccg1-app. The log file from uis1ccg2 will be posted. I
> think that the last such incident occurred around Nov 19, 19:08. 
> 
> Note that around 19:09, there are complaints about high CPU load, which we
> refute. There was nothing CPU intensive running during that period of time;
> the VCENTER control shows less than moderate CPU activity for the entire
> day, and our expectations, based on what we know that it is running on that
> virtual machine, are precisely the same. 
> 
> We suspect that the CPU usage reported by the cluster is inaccurate.

Pacemaker gets its CPU usage information from /proc/$PID/stat where $PID is the process ID of pacemaker's CIB process. This file is created by the kernel and is the same source used by ps.

Corosync detects directly in real-time whether it has been scheduled CPU cycles frequently enough to process the token. I.e. if the token expires in 1 second and corosync is not scheduled for longer than 1 second, it can't maintain coherent cluster communication and is likely to get fenced.

Seeing both messages (from completely different sources) is a clear indication there is some sort of issue. What might be unclear here is that CPU *usage* is not the same as CPU *load*. Load means processes are using or *waiting for* CPU cycles; a process can be waiting due to high CPU usage, or due to other factors such as blocked I/O.

The main area to investigate here is why the resource agent monitor is failing. If you are interested in the load issue, I'd recommend a support case, since that is beyond just pacemaker code.

Comment 12 vlad.socaciu 2016-11-22 01:53:15 UTC
(In reply to Ken Gaillot from comment #10)
> (In reply to vlad.socaciu from comment #5)
> > Here is what we did:
> > 
> > 1. We doubled the number of CPU(s) on each cluster node.
> > 
> > 2. I modified the resource agents such that when invoked with the "start"
> > command line argument, they will not return until it is guaranteed that a
> > further invocation with the "status" argument will return immediately and
> > print "result: 0".
> > 
> > 3. We create the resources with the command line arguments: "... op monitor
> > timeout=30s interval=30s op start timeout=45s on-fail=ignore". The "timeout"
> > values are exaggerated.
> > 
> > 4. Before starting each resource, we clear all location constraints ("pcs
> > constraint location remove ..."), and then create a location constraint to
> > the cluster node we would like the resource agent to run on. 
> > 
> > We started 25 resources on the node uis1ccg2-app by running the start
> > command on uis1ccg1-app. All but resource no. 13 (the one with the bad
> > luck), started on the expected node. Resource 13 started on uis1ccg1-app.
> > The corosync logs were posted earlier. The VCENTER program showed a CPU
> > usage lower than 90% on both cluster nodes, at the time when the resources
> > were brought up. The start action took place around 7:55 PM.
> 
> In the logs from Nov 17, we see this sequence:
> 
> 1. resource_ccg_gid13 is added to the configuration with target-role=Stopped
> 2. A location constraint is added for the resource to prefer uis1ccg2-app
> 3. target-role is cleared
> 4. The cluster starts the resource successfully on uis1ccg2-app, but an
> immediate monitor fails:
> 
> Nov 17 19:55:35 [1902] uis1ccg1.uismag.com       crmd:     info:
> match_graph_event:     Action resource_ccg_gid13_start_0 (44) confirmed on
> uis1ccg2-app (rc=0)
> ...
> Nov 17 19:55:35 [1902] uis1ccg1.uismag.com       crmd:     info:
> match_graph_event:     Action resource_ccg_gid13_monitor_30000 (45)
> confirmed on uis1ccg2-app (rc=7)
> Nov 17 19:55:35 [1902] uis1ccg1.uismag.com       crmd:     info:
> update_failcount:      Updating failcount for resource_ccg_gid13 on
> uis1ccg2-app after failed monitor: rc=7 (update=value++, time=1479441335)
> Nov 17 19:55:35 [1902] uis1ccg1.uismag.com       crmd:     info:
> process_graph_event:   Detected action (179.45)
> resource_ccg_gid13_monitor_30000.277=not running: failed
> 
> 5. Due to migration-threshold=1, the cluster starts the resource on the
> other node.
> 
> Pacemaker is reacting to the monitor failure according to its configuration.
> If you did a "pcs status" at this time, it would show a "Failed actions"
> listing with the failed monitor.
> 
> The monitor failure, combined with the configured migration threshold,
> overrides the location preference (there would be not much point in HA if
> the cluster did not recover a failed resource).
> 
> There is no indication in corosync.log why the monitor failed, so you would
> have to look wherever your resource agent logs messages to see if there is
> any more information (or add some debug messages for testing). If the
> resource agent logs to stderr, that should show up in corosync.log; if it
> uses syslog, that would likely show up in /var/log/messages.
> 
> > There is an even more troubling behavior, which seems intentional, but it is
> > extremely inconvenient to us: if we stop and start again that resource,
> > indicating that we would want it to run on uis1ccg2-app, the resource starts
> > again on uis1ccg1-app, by virtue of some kind of remembrance of a previous
> > bad experience on uis1ccg2-app, probably. The only solution is to delete the
> > resource, and recreate it, which would be unacceptable when the product is
> > deployed in the field.
> 
> As was discussed after this, failures continue to affect the cluster's
> behavior until "pcs resource cleanup" is run.
> 
> There is a resource meta-attribute, failure-timeout, that allows you to
> expire failures automatically so that the cluster will retry recovering the
> resource after that much time without needing a cleanup command, but this is
> not the default, because most people prefer to investigate the cause of
> trouble before placing production services on a node again.

Thank you very much, indeed!

Let me focus on the resource irresponsiveness first.

The resource agents are very simple shell scripts that I wrote; each resource agent starts the same process (running a program that somebody who left the company wrote), with different command line arguments. That program has its own log files separate from anything else, so it is very easy for us to determine what happened on our side.

I checked the sequence of events on uis1ccg2 at the time of the incident. Our program started at about 19:55:35.301. Then, there is a mystery on our side, which I cannot explain. But at 19:55:35.403, the program received a "stop" command from the cluster, and exited.

Now, isn't it perverse :-) that, according to corosync log file, the cluster software started resource_ccg_gid13 around 19:55:35 and then stopped it in the very same second only to declare that it failed??? Let alone the fact that, following your advice, we set "op monitor start timeout" to 45s!!!

It seems that there may be some incongruous occurrences here... .

Comment 13 Andrew Beekhof 2016-11-24 02:52:38 UTC
(In reply to vlad.socaciu from comment #12)
> The resource agents are very simple shell scripts that I wrote;

I think we need to see them

> each
> resource agent starts the same process (running a program that somebody who
> left the company wrote), with different command line arguments. 

In particular because of this.
How do your monitor actions differentiate between the different copies of the daemon?

A broken monitor operation could definitely lead to the various behaviours reported here, so we need to rule that out.

Comment 14 vlad.socaciu 2016-11-24 03:40:18 UTC
Created attachment 1223513 [details]
One of the resource agents

The "daemon" (which is literally a Unix daemon) is ccg_gid.exe. Each of its running instances listens for commands on a different TCP port. It accepts the words "stop" and "status" as commands. ccg_gid.exe exits if it received "stop", and sends back a string which matters only to us, and it is unused anyway, in case it received "status".

ccg_gid_admin.exe is an auxiliary process which sends one of those two words to ccg_gid.exe. ccg_gid_admin.exe returns 3 if it could not connect to ccg_gid.exe, or if send() or recv() failed. Otherwise, it returns 0.

The distinction between the different copies of the daemon is naturally made by the associated resource agents which send commands to the daemon instance which they respectively started.

The "start" command may take up to 1s to return. We believe that this shouldn't matter, because we create the resource with "op start timeout=45s".

Also note that the "monitor failures" -- as you perceive the incidents -- happened quite rarely after we increased the number of CPU(s) of the virtual machines. In the last two days I started and migrated 25 resources several times and there was no failure.

Comment 15 vlad.socaciu 2016-11-25 04:27:29 UTC
(In reply to Andrew Beekhof from comment #13)
> (In reply to vlad.socaciu from comment #12)
> > The resource agents are very simple shell scripts that I wrote;
> 
> I think we need to see them
> 
> > each
> > resource agent starts the same process (running a program that somebody who
> > left the company wrote), with different command line arguments. 
> 
> In particular because of this.
> How do your monitor actions differentiate between the different copies of
> the daemon?
> 
> A broken monitor operation could definitely lead to the various behaviours
> reported here, so we need to rule that out.

An addendum to my previous comment (#14): accidentally, I created a number of empty resource agent files, and started the corresponding resources, without having noticed that the agent files had size 0. The resources started and ran perfectly; I could migrate, stop and restart them without any problem. 

So much about the "monitoring" :-(.

Comment 16 Ken Gaillot 2016-11-28 22:14:20 UTC
(In reply to vlad.socaciu from comment #12)
> I checked the sequence of events on uis1ccg2 at the time of the incident.
> Our program started at about 19:55:35.301. Then, there is a mystery on our
> side, which I cannot explain. But at 19:55:35.403, the program received a
> "stop" command from the cluster, and exited.
> 
> Now, isn't it perverse :-) that, according to corosync log file, the cluster
> software started resource_ccg_gid13 around 19:55:35 and then stopped it in
> the very same second only to declare that it failed??? Let alone the fact
> that, following your advice, we set "op monitor start timeout" to 45s!!!
> 
> It seems that there may be some incongruous occurrences here... .

The timeout only matters if the command doesn't finish. If the command returns quickly (whether success or failure), pacemaker will move on, and the timeout is not needed.

So in this case, the start returned success quickly, so pacemaker ran a monitor. The monitor also returned quickly, but with a failure, so pacemaker marked it as failed and ran a stop.

Your resource agent looks fine, so most likely, "ccg_gid_admin.exe status" is sometimes returning 3, possibly when CPU load is high.

Comment 17 Ken Gaillot 2016-11-28 22:15:56 UTC
(In reply to vlad.socaciu from comment #15)
> An addendum to my previous comment (#14): accidentally, I created a number
> of empty resource agent files, and started the corresponding resources,
> without having noticed that the agent files had size 0. The resources
> started and ran perfectly; I could migrate, stop and restart them without
> any problem. 

I can't explain this. Pacemaker would be unable to execute any action, so the resources should be marked as failed.

Comment 18 vlad.socaciu 2016-11-28 22:32:26 UTC
(In reply to Ken Gaillot from comment #16)
> (In reply to vlad.socaciu from comment #12)
> > I checked the sequence of events on uis1ccg2 at the time of the incident.
> > Our program started at about 19:55:35.301. Then, there is a mystery on our
> > side, which I cannot explain. But at 19:55:35.403, the program received a
> > "stop" command from the cluster, and exited.
> > 
> > Now, isn't it perverse :-) that, according to corosync log file, the cluster
> > software started resource_ccg_gid13 around 19:55:35 and then stopped it in
> > the very same second only to declare that it failed??? Let alone the fact
> > that, following your advice, we set "op monitor start timeout" to 45s!!!
> > 
> > It seems that there may be some incongruous occurrences here... .
> 
> The timeout only matters if the command doesn't finish. If the command
> returns quickly (whether success or failure), pacemaker will move on, and
> the timeout is not needed.
> 
> So in this case, the start returned success quickly, so pacemaker ran a
> monitor. The monitor also returned quickly, but with a failure, so pacemaker
> marked it as failed and ran a stop.
> 
> Your resource agent looks fine, so most likely, "ccg_gid_admin.exe status"
> is sometimes returning 3, possibly when CPU load is high.

Thank you, Ken. 

I understand how the timeout works. 

It would be very strange if "ccg_gid_admin.exe status" returned 3 when it shouldn't, because the communication logic between ccg_gid.exe and ccg_gid_admin.exe is extremely simple -- just TCP connect(), send() and recv().

Nonetheless, lately we haven't seen any failure of this nature (a resource not starting after "enable" or after migration). But two weeks ago or so, we assigned 5 CPU(s) to each cluster node -- more than to any other virtual machine.

Comment 19 vlad.socaciu 2016-11-28 22:36:04 UTC
(In reply to Ken Gaillot from comment #17)
> (In reply to vlad.socaciu from comment #15)
> > An addendum to my previous comment (#14): accidentally, I created a number
> > of empty resource agent files, and started the corresponding resources,
> > without having noticed that the agent files had size 0. The resources
> > started and ran perfectly; I could migrate, stop and restart them without
> > any problem. 
> 
> I can't explain this. Pacemaker would be unable to execute any action, so
> the resources should be marked as failed.

This is something you guys should be able to reproduce quite easily.

For us, this manifestation is not a matter of concern -- we should be careful not to allow empty resource agents :-).

Comment 20 Ken Gaillot 2016-11-29 00:13:27 UTC
(In reply to vlad.socaciu from comment #19)
> (In reply to Ken Gaillot from comment #17)
> > (In reply to vlad.socaciu from comment #15)
> > > An addendum to my previous comment (#14): accidentally, I created a number
> > > of empty resource agent files, and started the corresponding resources,
> > > without having noticed that the agent files had size 0. The resources
> > > started and ran perfectly; I could migrate, stop and restart them without
> > > any problem. 
> > 
> > I can't explain this. Pacemaker would be unable to execute any action, so
> > the resources should be marked as failed.
> 
> This is something you guys should be able to reproduce quite easily.
> 
> For us, this manifestation is not a matter of concern -- we should be
> careful not to allow empty resource agents :-).

Huh, I didn't realize -- an empty (executable) file will return success for whatever you call it as:

# touch test-it
# chmod a+x test-it
# ./test-it start
# echo $?
0

So yes, the only solution is to not create them. :-)

Comment 21 vlad.socaciu 2016-11-29 00:46:50 UTC
(In reply to Ken Gaillot from comment #20)
> (In reply to vlad.socaciu from comment #19)
> > (In reply to Ken Gaillot from comment #17)
> > > (In reply to vlad.socaciu from comment #15)
> > > > An addendum to my previous comment (#14): accidentally, I created a number
> > > > of empty resource agent files, and started the corresponding resources,
> > > > without having noticed that the agent files had size 0. The resources
> > > > started and ran perfectly; I could migrate, stop and restart them without
> > > > any problem. 
> > > 
> > > I can't explain this. Pacemaker would be unable to execute any action, so
> > > the resources should be marked as failed.
> > 
> > This is something you guys should be able to reproduce quite easily.
> > 
> > For us, this manifestation is not a matter of concern -- we should be
> > careful not to allow empty resource agents :-).
> 
> Huh, I didn't realize -- an empty (executable) file will return success for
> whatever you call it as:
> 
> # touch test-it
> # chmod a+x test-it
> # ./test-it start
> # echo $?
> 0
> 
> So yes, the only solution is to not create them. :-)

Actually, I am a little confused about what an LSB resource agent should do, because there are two inconsistent sources of information:

    this: http://refspecs.linux-foundation.org/LSB_3.2.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html

    and this: http://linux-ha.org/wiki/LSB_Resource_Agents#column-one

Initially, I followed the former which worked with RH 6 cluster, but I thought that something didn't work when I used it with RH 7 cluster, and I added

    echo "result: $?"

before each of the start/stop/status commands returned from the resource agents.

So, you think that the cluster software looks only at the resource agent return code (per first specification) and not at whatever it may print to stdout (as per the second specification)?

Comment 22 Ken Gaillot 2016-11-29 15:33:46 UTC
(In reply to vlad.socaciu from comment #21)
> Actually, I am a little confused about what an LSB resource agent should do,
> because there are two inconsistent sources of information:
> 
>     this:
> http://refspecs.linux-foundation.org/LSB_3.2.0/LSB-Core-generic/LSB-Core-
> generic/iniscrptact.html
> 
>     and this: http://linux-ha.org/wiki/LSB_Resource_Agents#column-one
> 
> Initially, I followed the former which worked with RH 6 cluster, but I
> thought that something didn't work when I used it with RH 7 cluster, and I
> added
> 
>     echo "result: $?"
> 
> before each of the start/stop/status commands returned from the resource
> agents.
> 
> So, you think that the cluster software looks only at the resource agent
> return code (per first specification) and not at whatever it may print to
> stdout (as per the second specification)?

The linux-foundation.org spec is the primary source, and is correct.

The linux-ha.org page is outdated and has been replaced by:

  http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/ap-lsb.html

but that is essentially the same. The command:

  echo "result: $?"

is not meant to be part of the init script itself, but is shown as part of testing an init script manually to make sure it is returning the correct values. The init script's stdout is not used in any way (though it may be logged).

Comment 23 Ken Gaillot 2017-01-16 17:46:57 UTC
I believe all the questions here were eventually answered. If I missed anything, feel free to re-open.

Comment 24 vlad.socaciu 2017-03-19 22:53:59 UTC
(In reply to Ken Gaillot from comment #23)
> I believe all the questions here were eventually answered. If I missed
> anything, feel free to re-open.

I would like to have this issue revisited, because I saw the behavior again, on the customer test system, this time. By "behavior" I mean either

    a) resources not starting on the cluster node specified by the command

        pcs constraint location <resource> prefers <node>

      where <resource> is the name of the resource and <node> is the host name  of the desired node

or

    b) resources haphazardly moving to either one of the two cluster nodes (this phenomenon happens relatively "rarely").

I have a log file showing an occurrence of a) but I don't know how to post it, because the bug is closed. Sorry, but I do not know how to reopen it :-( ... .

Comment 25 Ken Gaillot 2017-03-20 16:27:03 UTC
Let's open a new bz for the new issue, so it's easier to focus the discussion on it. If the resource agent is different from the one attached here, plus attach a copy of it, too.