Bug 1321110

Summary: unseen node remains unclean after fencing
Product: Red Hat Enterprise Linux 6 Reporter: Ken Gaillot <kgaillot>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.8CC: abeekhof, andrew, anprice, cfeist, cluster-maint, extras-qa, jpokorny, kgaillot, lhh, royoung, tlavigne
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: pacemaker-1.1.14-8.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1312050 Environment:
Last Closed: 2016-05-10 23:53:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1312050    
Bug Blocks:    

Description Ken Gaillot 2016-03-24 16:55:06 UTC
+++ This bug was initially created as a clone of Bug #1312050 +++

When I bring 2 nodes of my 3-node cluster online, 'pcs status' shows:

    Node rawhide3: UNCLEAN (offline)
    Online: [ rawhide1 rawhide2 ]

which is expected. 'pcs stonith confirm rawhide3' then says:

    Node: rawhide3 confirmed fenced

so I would now expect to see:

    Online: [ rawhide1 rawhide2 ]
    OFFLINE: [ rawhide3 ]

but instead I still see:

    Node rawhide3: UNCLEAN (offline)
    Online: [ rawhide1 rawhide2 ]

and the cluster resources don't start. Bringing rawhide3 temporarily online clears this up but I don't think I should have to do that if the cluster is quorate (I didn't before recently anyway).

The pacemaker.log entries generated by the 'pcs stonith confirm rawhide3' event are attached. It seems odd that it's attempting to fence rawhide3 after the manual confirmation.

[root@rawhide1 ~]# rpm -q pcs pacemaker
pcs-0.9.149-2.fc24.x86_64
pacemaker-1.1.14-1.fc24.1.x86_64

--- Additional comment from Ken Gaillot on 2016-03-04 13:03:09 EST ---

This does look like a regression. Will investigate further.

--- Additional comment from Andrew Beekhof on 2016-03-22 23:47:40 EDT ---

Really depends why the cluster thinks its unclean.

crm/sos report?

--- Additional comment from Andrew Price on 2016-03-23 08:53:07 EDT ---

(In reply to Andrew Beekhof from comment #2)
> Really depends why the cluster thinks its unclean.

Well the third node just hasn't been booted so it doesn't know what state it's in. That's why I'm doing the manual confirm.

> crm/sos report?

On its way.

--- Additional comment from Andrew Price on 2016-03-23 08:58 EDT ---

Attached is a crm report for the startup of two of the three nodes and the 'pcs stonith confirm rawhide3' that was done at 10:15:41.

--- Additional comment from Andrew Beekhof on 2016-03-23 22:59:48 EDT ---

Looks like we're doing almost all of the right things:

Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:   notice: tengine_stonith_notify:	Peer rawhide3 was terminated (off) by a human for rawhide1: OK (ref=f3b7d46a-4094-4488-8293-13ee416efbf1) by client stonith_admin.897
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:   notice: crm_update_peer_state_iter:	crmd_peer_down: Node rawhide3[3] - state is now lost (was (null))
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: peer_update_callback:	rawhide3 is now lost (was in unknown state)
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: crm_update_peer_proc:	crmd_peer_down: Node rawhide3[3] - all processes are now offline
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: peer_update_callback:	Client rawhide3/peer now has status [offline] (DC=true, changed=     1)
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: crm_update_peer_expected:	crmd_peer_down: Node rawhide3[3] - expected state is now down (was (null))
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: erase_status_tag:	Deleting xpath: //node_state[@uname='rawhide3']/lrm
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: erase_status_tag:	Deleting xpath: //node_state[@uname='rawhide3']/transient_attributes
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: tengine_stonith_notify:	External fencing operation from stonith_admin.897 fenced rawhide3

However when it comes to the part where the PE unpacks the node (pe-warn-406.bz2) we get:

(    unpack.c:1324  )   trace: determine_online_status_fencing:	rawhide3: in_cluster=<null>, is_peer=offline, join=down, expected=down, term=0
(    unpack.c:1498  ) warning: determine_online_status:	Node rawhide3 is unclean

So it looks like we forgot to set in_ccm="true" along with expected and join.
Possibly because the node state was previously unknown instead of online.

--- Additional comment from Andrew Beekhof on 2016-03-24 01:23:05 EDT ---

Ken: Can you test this for me please?

diff --git a/crmd/callbacks.c b/crmd/callbacks.c
index 34abe81..9a65e2f 100644
--- a/crmd/callbacks.c
+++ b/crmd/callbacks.c
@@ -146,6 +146,14 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d
                 }
             }
 
+            if(AM_I_DC && data == NULL) {
+                /* If this is due to a manual fencing event and its the first
+                 * time we've seen this node, it is important to get
+                 * the cluster state into the cib
+                 */
+                populate_cib_nodes(node_update_cluster, __FUNCTION__);
+            }
+
             crmd_notify_node_event(node);
             break;

--- Additional comment from Ken Gaillot on 2016-03-24 12:47:03 EDT ---

This is a regression introduced by 8b98a9b2, which took node_update_cluster out of send_stonith_update(), instead relying on the membership layer to handle that. That fixed one situation (when the node rejoins before the stonith result is reported) but caused this issue (when the node doesn't rejoin).

This is not limited to manual fences; any fencing of an unseen node (that remains unseen after the fence) will exhibit the issue.

This is fixed upstream as of commit 98457d16 (which modifies send_stonith_update() rather than use the patch in Comment 6).

@Jan, can you either backport that commit or rebase on 5a6cdd11 for Fedora?

Comment 1 Ken Gaillot 2016-03-24 17:08:11 UTC
The reproducer for this is very simple, quick and reliable:

1. Configure a pacemaker cluster of at least three nodes.

2. Start all but one of the nodes. Pacemaker will fence the unseen node.

3. With the any of the pacemaker-1.1.14 builds for 6.8 through  	pacemaker-1.1.14-7.el6, the fenced node will be shown as unclean in pcs status after being fenced (and will likely be fenced multiple times). With either earlier or later builds, the node should be shown as (cleanly) offline in the status.

Comment 3 Ken Gaillot 2016-03-24 19:07:27 UTC
updated build available

Comment 7 errata-xmlrpc 2016-05-10 23:53:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0856.html