Bug 1415885 - Remote Node resource fails to stop
Summary: Remote Node resource fails to stop
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker
Version: 7.3
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: 7.3
Assignee: Ken Gaillot
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1356451
TreeView+ depends on / blocked
 
Reported: 2017-01-24 01:20 UTC by John Williams
Modified: 2017-02-14 17:37 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-14 17:37:55 UTC
Target Upstream Version:


Attachments (Terms of Use)
compute sosreport (9.80 MB, application/x-xz)
2017-01-24 01:20 UTC, John Williams
no flags Details
controller sosrepo (16.05 MB, application/x-xz)
2017-01-24 01:26 UTC, John Williams
no flags Details
controller-1-updated-sosreport (13.37 MB, application/x-xz)
2017-01-24 17:09 UTC, John Williams
no flags Details
controller-2-updated-sosreport (15.81 MB, application/x-xz)
2017-01-24 17:12 UTC, John Williams
no flags Details
compute-1-updated sosreport (9.79 MB, application/x-xz)
2017-01-24 17:13 UTC, John Williams
no flags Details
compute-2-updated sosreport (9.79 MB, application/x-xz)
2017-01-24 17:14 UTC, John Williams
no flags Details
xaa (19.00 MB, application/x-xz)
2017-01-24 17:15 UTC, John Williams
no flags Details
xab (19.00 MB, application/octet-stream)
2017-01-24 17:16 UTC, John Williams
no flags Details
xac (19.00 MB, application/octet-stream)
2017-01-24 17:17 UTC, John Williams
no flags Details
xad (19.00 MB, application/octet-stream)
2017-01-24 17:18 UTC, John Williams
no flags Details
xae (3.36 MB, application/octet-stream)
2017-01-24 17:18 UTC, John Williams
no flags Details
zaa (19.00 MB, application/x-xz)
2017-01-24 17:21 UTC, John Williams
no flags Details
zab (616.21 KB, application/octet-stream)
2017-01-24 17:22 UTC, John Williams
no flags Details
compute-0 (9.80 MB, application/x-xz)
2017-01-31 19:50 UTC, John Williams
no flags Details
controller-0 (15.90 MB, application/x-xz)
2017-01-31 19:50 UTC, John Williams
no flags Details
corosync.log (6.60 MB, text/plain)
2017-02-01 00:03 UTC, John Williams
no flags Details
sosreport for r14-controller-2 (pt1) (19.00 MB, application/x-xz)
2017-02-01 15:11 UTC, John Williams
no flags Details
sosreport for r14-controller-2 (pt2) (3.88 MB, application/octet-stream)
2017-02-01 15:11 UTC, John Williams
no flags Details

Description John Williams 2017-01-24 01:20:53 UTC
Created attachment 1243804 [details]
compute sosreport

Description of problem:
Enabling IHA cause the compute node to fail and shutoff. Attempting to power on the node will fail as the node is immediately shut off again with fencing enabled.  FYI: I had a debug session with Andrew Beekhof who can add more meaningful description to the issue. 

Version-Release number of selected component (if applicable):
RHOSP-10 release

How reproducible:
Always

Steps to Reproduce:
1. Enable fencing
2. Enable IHA


SOS reports for controller-0 and compute-0 attached.

Comment 1 John Williams 2017-01-24 01:26:43 UTC
Created attachment 1243806 [details]
controller sosrepo

Comment 2 Andrew Beekhof 2017-01-24 10:45:42 UTC
The root-ish cause seems to be pacemaker remote connections (needed by the Instance HA feature).

Attempts by the crmd to start the connection time out:

Jan 23 21:38:35 [65313] r14-controller-0.r14.rcbd.lab       crmd:     info: abort_transition_graph:	Transition aborted: Action lost | source=action_timer_callback:821 complete=false
Jan 23 21:38:35 [65313] r14-controller-0.r14.rcbd.lab       crmd:  warning: cib_action_update:	rsc_op 155: r14-compute-1.r14.rcbd.lab_start_0 on r14-controller-0 timed out
Jan 23 21:38:35 [65313] r14-controller-0.r14.rcbd.lab       crmd:     info: create_operation_update:	cib_action_update: Updating resource r14-compute-1.r14.rcbd.lab after start op Timed Out (interval=0)
Jan 23 21:38:35 [65313] r14-controller-0.r14.rcbd.lab       crmd:  warning: action_timer_callback:	Timer popped (timeout=60000, abort_level=1000000, complete=false)


As do subsequent attempts by the crmd to stop it:

Jan 24 00:14:08 [65313] r14-controller-0.r14.rcbd.lab       crmd:  warning: cib_action_update:	rsc_op 21: r14-compute-1.r14.rcbd.lab_stop_0 on r14-controller-0 timed out
Jan 24 00:14:08 [65313] r14-controller-0.r14.rcbd.lab       crmd:     info: create_operation_update:	cib_action_update: Updating resource r14-compute-1.r14.rcbd.lab after stop op Timed Out (interval=0)
Jan 24 00:14:08 [65313] r14-controller-0.r14.rcbd.lab       crmd:  warning: action_timer_callback:	Timer popped (timeout=60000, abort_level=1000000, complete=false)

Note that journald and pacemaker do not agree on log times, here are the same logs in journald:

Jan 23 19:14:08 host-192-168-120-133 crmd[65313]:   error: [Action   21]: In-flight rsc op r14-compute-1.r14.rcbd.lab_stop_0 on r14-controller-0 (priority: 0, waiting: none)
Jan 23 19:14:08 host-192-168-120-133 crmd[65313]: warning: rsc_op 21: r14-compute-1.r14.rcbd.lab_stop_0 on r14-controller-0 timed out
Jan 23 19:14:08 host-192-168-120-133 crmd[65313]: warning: Timer popped (timeout=60000, abort_level=1000000, complete=false)

Meanwhile, the remote node reports the tls negotiation failed:

Jan 23 21:22:27 [57308] r14-compute-0.r14.rcbd.lab pacemaker_remoted:     info: lrmd_remote_listen:     New remote connection from ::ffff:192.168.140.96
Jan 23 21:22:27 [57308] r14-compute-0.r14.rcbd.lab pacemaker_remoted:   notice: lrmd_remote_listen:     LRMD client connection established. 0x7f5178250680 id: efdb4d2a-4e37-4ae4-afe0-063c494c6fdc
Jan 23 21:22:28 [57308] r14-compute-0.r14.rcbd.lab pacemaker_remoted:    error: lrmd_remote_client_msg: Remote lrmd tls handshake failed
Jan 23 21:22:28 [57308] r14-compute-0.r14.rcbd.lab pacemaker_remoted:   notice: lrmd_remote_client_destroy:     LRMD client disconnecting remote client - name: <unknown> id: efdb4d2a-4e37-4ae4-afe0-063c494c6fdc

So we need to find a way to get some additional data regarding:

- why these connections are unable to be established
- why these stop operation is unable to complete

Comment 3 Andrew Beekhof 2017-01-24 10:59:40 UTC
Can we get the following added to /etc/sysconfig/pacemaker on the controllers:

PCMK_trace_files=lrm_state.c,remote_lrmd_ra.c,remote.c

And on the computes:

PCMK_trace_files=tls_backend.c,remote.c


Note that pacemaker and pacemaker-remoted will both need to be restarted for these options to take effect.

Once restarted, can you please reproduce and send updated sosreports please?

Comment 5 John Williams 2017-01-24 17:05:58 UTC
I have applied the trace file changes, restarted pacemaker_remote on the compute nodes and rebooted each of the controller nodes.  I rebooted he controllers one-at-a-time (waiting for the rebooted node to complete before rebooting the next controller node) as trying to run the command "systemctl restart pacemaker" didn't appear to be doing anything. I waited over 5 minutes on the command on each node. 

Giving able time for the controllers nodes to settle down after the last node was rebooted, I ran "pcs resource cleanup --force", waited several minutes and then ran sos reports on all of the controller and computes nodes.  See attached sosreports.  

Please note the sosreport for the compute-0 node was too large to upload in whole, so I use split -b 19M to break it down into smaller chucks, so it will need to be joined using "cat xa* > sosreport-r14-compute-0.r14.rcbd.lab-20170124161749.tar.xz".

Comment 6 John Williams 2017-01-24 17:09:14 UTC
Created attachment 1244002 [details]
controller-1-updated-sosreport

Comment 7 John Williams 2017-01-24 17:12:03 UTC
Created attachment 1244003 [details]
controller-2-updated-sosreport

Comment 8 John Williams 2017-01-24 17:13:16 UTC
Created attachment 1244004 [details]
compute-1-updated sosreport

Comment 9 John Williams 2017-01-24 17:14:09 UTC
Created attachment 1244005 [details]
compute-2-updated sosreport

Comment 10 John Williams 2017-01-24 17:15:20 UTC
Created attachment 1244006 [details]
xaa

Comment 11 John Williams 2017-01-24 17:16:17 UTC
Created attachment 1244007 [details]
xab

Comment 12 John Williams 2017-01-24 17:17:18 UTC
Created attachment 1244008 [details]
xac

Comment 13 John Williams 2017-01-24 17:18:18 UTC
Created attachment 1244009 [details]
xad

Comment 14 John Williams 2017-01-24 17:18:56 UTC
Created attachment 1244010 [details]
xae

Comment 15 John Williams 2017-01-24 17:21:19 UTC
Created attachment 1244011 [details]
zaa

Comment 16 John Williams 2017-01-24 17:22:51 UTC
Created attachment 1244012 [details]
zab

Comment 17 John Williams 2017-01-24 17:24:21 UTC
cat za* > sosreport-r14-controller-0.r14.rcbd.lab-20170124161718.tar.xz  to rejoin the controller file as it too was too big to submit in whole.

Comment 18 John Williams 2017-01-26 21:15:34 UTC
Any new status?

Comment 19 John Williams 2017-01-26 21:15:47 UTC
Any new status?

Comment 20 Andrew Beekhof 2017-01-31 00:29:14 UTC
JT: Do you recall me asking about authkeys?

Jan 24 15:51:10 [748431] r14-controller-0.r14.rcbd.lab       crmd: (lrmd_client.:1096  )   error: lrmd_tls_set_key:	No valid lrmd remote key found at /etc/pacemaker/authkey
Jan 24 15:51:10 [748431] r14-controller-0.r14.rcbd.lab       crmd: (lrmd_client.:526   )    info: lrmd_tls_connection_destroy:	TLS connection destroyed

Does that file exist and if so, what owner/permissions?

Comment 21 John Williams 2017-01-31 15:37:26 UTC
It did not exist nor did the pacemaker directory on the controller nodes. It did exist on the compute nodes.  I had a typo in my script --mode=0750 was mode 075O ... didn't see it.  Correcting that and will regenerate the tests.

Comment 22 John Williams 2017-01-31 15:37:43 UTC
It did not exist nor did the pacemaker directory on the controller nodes. It did exist on the compute nodes.  I had a typo in my script --mode=0750 was mode 075O ... didn't see it.  Correcting that and will regenerate the tests.

Comment 23 Ken Gaillot 2017-01-31 15:53:27 UTC
(In reply to John Williams from comment #22)
> It did not exist nor did the pacemaker directory on the controller nodes. It
> did exist on the compute nodes.  I had a typo in my script --mode=0750 was
> mode 075O ... didn't see it.  Correcting that and will regenerate the tests.

Most likely, we'll be able to close this as a duplicate of Bug 1381754.

Comment 24 John Williams 2017-01-31 19:42:31 UTC
Still having issues even with the keys ... generating the sosreport for controller-0 and compute-0 and will upload them shortly. Still appears to be issues with pacemaker and pacemaker-remote communications.

Comment 25 John Williams 2017-01-31 19:50:05 UTC
Created attachment 1246424 [details]
compute-0

Comment 26 John Williams 2017-01-31 19:50:53 UTC
Created attachment 1246425 [details]
controller-0

Comment 27 John Williams 2017-01-31 19:54:44 UTC
Andrew - I will hang out after work today, your morning, if you would like to have a look around.  We can do another Bluejeans session, if needed.

Comment 28 John Williams 2017-01-31 23:40:40 UTC
I redeployed the stack, enabled fencing and reinstalled IHA and after waiting for quite awhile, it appears that the cluster is up and all resources have started.   Upon testing IHA with the "echo c > /proc/sysrq-trigger" command on the node where the instance is running, the node reboots and while it is coming up it gets shutdown.  This only occurs if stonith is enabled (sudo pcs property set stonith-enabled=true).  Once disabled the compute node is allowed to rejoin the cluster. After the compute node has joined the cluster stonith can be re-enabled and the compute node remains powered on.

Comment 29 John Williams 2017-02-01 00:02:39 UTC
Grepping for stonith-ng in /var/log/messages, I see the following suspect messages repeated on all 3 controller nodes. 

Jan 31 18:23:38 host-192-168-120-134 stonith-ng[65214]:  notice: ipmilan-r14-compute-2 can not fence (reboot) r14-compute-1.r14.rcbd.lab: static-list
Jan 31 18:23:42 host-192-168-120-134 stonith-ng[65214]:  notice: Operation reboot of r14-compute-1.r14.rcbd.lab by <no-one> for crmd.65117@r14-controller-1.ca449c2e: No route to host

However on all three hosts, I can ping the host via the hostname "r14-compute-1.r14.rcbd.lab".  Nothing else I saw interesting in the messages log.   

I have attached the corosync.log from r14-controller-1 as it does have more interesting information.

Comment 30 John Williams 2017-02-01 00:03:30 UTC
Created attachment 1246472 [details]
corosync.log

Comment 31 Andrew Beekhof 2017-02-01 01:20:57 UTC
Sorry I missed you on IRC this morning (remember to use my nick to get my attention :-)

We're missing the interesting logs as the DC is on r14-controller-2 now:

Jan 31 19:17:03 [4849] r14-controller-0.r14.rcbd.lab       crmd: (     utils.c:983   )    info: update_dc:	Set DC to r14-controller-2 (3.0.10)

Comment 32 John Williams 2017-02-01 15:11:19 UTC
Created attachment 1246688 [details]
sosreport  for r14-controller-2 (pt1)

Comment 33 John Williams 2017-02-01 15:11:55 UTC
Created attachment 1246689 [details]
sosreport  for r14-controller-2 (pt2)

Comment 34 Ken Gaillot 2017-02-01 18:06:54 UTC
I don't see any relevant SELinux denials, but just to be sure, can you check the results of "ls -laZ /etc/pacemaker" on all nodes? I'd expect it to look something like this:

drwxr-x---. root haclient unconfined_u:object_r:etc_t:s0   ./
drwxr-xr-x. root root     system_u:object_r:etc_t:s0       ../
-rw-r-----. root haclient unconfined_u:object_r:etc_t:s0   authkey

Unfortunately, the new corosync log starts at Feb 01 03:45:04 and the systemd journal at Jan 31 21:19:27, so they don't cover the period of interest.

Comment 35 John Williams 2017-02-01 19:11:06 UTC
Pacemaker seems to be fine.  Here is the info you requested:

[root@r14-controller-0 log]# ls -laZ /etc/pacemaker
drwxr-xr-x. root root     unconfined_u:object_r:etc_t:s0   .
drwxr-xr-x. root root     system_u:object_r:etc_t:s0       ..
-rw-r--r--. root haclient unconfined_u:object_r:user_home_t:s0 authkey

[root@r14-controller-1 cluster]# ls -laZ /etc/pacemaker
drwxr-xr-x. root root     unconfined_u:object_r:etc_t:s0   .
drwxr-xr-x. root root     system_u:object_r:etc_t:s0       ..
-rw-r--r--. root haclient unconfined_u:object_r:user_home_t:s0 authkey

[root@r14-controller-2 tmp]# ls -laZ /etc/pacemaker
drwxr-xr-x. root root     unconfined_u:object_r:etc_t:s0   .
drwxr-xr-x. root root     system_u:object_r:etc_t:s0       ..
-rw-r--r--. root haclient unconfined_u:object_r:user_home_t:s0 authkey

[root@r14-compute-0 log]# ls -laZ /etc/pacemaker
drwxr-xr-x. root root     unconfined_u:object_r:etc_t:s0   .
drwxr-xr-x. root root     system_u:object_r:etc_t:s0       ..
-rw-r--r--. root haclient unconfined_u:object_r:user_home_t:s0 authkey

[root@r14-compute-1 ~]# ls -laZ /etc/pacemaker
drwxr-xr-x. root root     unconfined_u:object_r:etc_t:s0   .
drwxr-xr-x. root root     system_u:object_r:etc_t:s0       ..
-rw-r--r--. root haclient unconfined_u:object_r:user_home_t:s0 authkey

[heat-admin@r14-compute-2 ~]$ ls -laZ /etc/pacemaker
drwxr-xr-x. root root     unconfined_u:object_r:etc_t:s0   .
drwxr-xr-x. root root     system_u:object_r:etc_t:s0       ..
-rw-r--r--. root haclient unconfined_u:object_r:user_home_t:s0 authkey

Additionally, getenforce returns "Enforcing" all all controller and compute nodes.

Comment 36 John Williams 2017-02-01 19:11:07 UTC
Pacemaker seems to be fine.  Here is the info you requested:

[root@r14-controller-0 log]# ls -laZ /etc/pacemaker
drwxr-xr-x. root root     unconfined_u:object_r:etc_t:s0   .
drwxr-xr-x. root root     system_u:object_r:etc_t:s0       ..
-rw-r--r--. root haclient unconfined_u:object_r:user_home_t:s0 authkey

[root@r14-controller-1 cluster]# ls -laZ /etc/pacemaker
drwxr-xr-x. root root     unconfined_u:object_r:etc_t:s0   .
drwxr-xr-x. root root     system_u:object_r:etc_t:s0       ..
-rw-r--r--. root haclient unconfined_u:object_r:user_home_t:s0 authkey

[root@r14-controller-2 tmp]# ls -laZ /etc/pacemaker
drwxr-xr-x. root root     unconfined_u:object_r:etc_t:s0   .
drwxr-xr-x. root root     system_u:object_r:etc_t:s0       ..
-rw-r--r--. root haclient unconfined_u:object_r:user_home_t:s0 authkey

[root@r14-compute-0 log]# ls -laZ /etc/pacemaker
drwxr-xr-x. root root     unconfined_u:object_r:etc_t:s0   .
drwxr-xr-x. root root     system_u:object_r:etc_t:s0       ..
-rw-r--r--. root haclient unconfined_u:object_r:user_home_t:s0 authkey

[root@r14-compute-1 ~]# ls -laZ /etc/pacemaker
drwxr-xr-x. root root     unconfined_u:object_r:etc_t:s0   .
drwxr-xr-x. root root     system_u:object_r:etc_t:s0       ..
-rw-r--r--. root haclient unconfined_u:object_r:user_home_t:s0 authkey

[heat-admin@r14-compute-2 ~]$ ls -laZ /etc/pacemaker
drwxr-xr-x. root root     unconfined_u:object_r:etc_t:s0   .
drwxr-xr-x. root root     system_u:object_r:etc_t:s0       ..
-rw-r--r--. root haclient unconfined_u:object_r:user_home_t:s0 authkey

Additionally, getenforce returns "Enforcing" all all controller and compute nodes.

Comment 37 John Williams 2017-02-01 21:18:44 UTC
Looking at this line: 

sudo pcs constraint order start galera-clone then nova-evacuate 

it specifies galera-clone, however I don't have a galera-clone resource, instead I have a galera-master resource.   

Should I change the constraint to reflect this?

Comment 38 Sean Merrow 2017-02-09 18:20:59 UTC
Hi Ken. Any update on this or thoughts on JT's question? Thanks.

Comment 39 John Williams 2017-02-09 19:08:42 UTC
I've been working with Andrew and the issue as originally stated required a change to /usr/lib/ocf/resource.d/openstack/NovaEvacuate.  The deployed version of the code only checked if the node status was "ON" (line 169) and shut the node off. 

  fence_compute \${fence_options} -o status -n \${node}
  if [ \$? != 0 ]; then

Andrew and I changed the code to: 

  fence_compute \${fence_options} -o status -n \${node}
  rc=\$?
  if [ \$rc != 0 -a \$rc != 2 ]; then

to check for the "OFF" state so that it would then bring the node back on. This corrected the problem as stated in the BZ.

With regards to my last question related to galera-clone, Ken and I spoke via irc and I made that change but I'm still not certain the change was correct.    

As far as I am concerned this issue can be closed, although it would be nice to have a definitive answer to my last question.

Comment 40 Ken Gaillot 2017-02-14 17:37:55 UTC
John: To confirm, yes, given the configuration in the attached logs, which use galera-master as the name of the master/slave clone, the constraint should be specified for galera-master and not galera-clone. pcs will give an error message and reject the constraint if specified with the wrong name.

This issue was found to be related to the NovaEvacuate resource agent. The fix was incorporated into the resource-agents package, so I am closing this BZ as not a pacemaker bug.


Note You need to log in before you can comment on or make changes to this bug.