Bug 1312050

Summary: unseen node remains unclean after fencing [rawhide]
Product: [Fedora] Fedora Reporter: Andrew Price <anprice>
Component: pacemakerAssignee: Jan Pokorný [poki] <jpokorny>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: abeekhof, andrew, jpokorny, kgaillot, lhh
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.14-2.5a6cdd1.git.fc24 pacemaker-1.1.14-2.5a6cdd1.git.fc23 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1321110 (view as bug list) Environment:
Last Closed: 2016-04-12 09:41:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1321110    
Attachments:
Description Flags
pacemaker.log excerpt
none
crm report none

Description Andrew Price 2016-02-25 15:49:58 UTC
Created attachment 1130590 [details]
pacemaker.log excerpt

When I bring 2 nodes of my 3-node cluster online, 'pcs status' shows:

    Node rawhide3: UNCLEAN (offline)
    Online: [ rawhide1 rawhide2 ]

which is expected. 'pcs stonith confirm rawhide3' then says:

    Node: rawhide3 confirmed fenced

so I would now expect to see:

    Online: [ rawhide1 rawhide2 ]
    OFFLINE: [ rawhide3 ]

but instead I still see:

    Node rawhide3: UNCLEAN (offline)
    Online: [ rawhide1 rawhide2 ]

and the cluster resources don't start. Bringing rawhide3 temporarily online clears this up but I don't think I should have to do that if the cluster is quorate (I didn't before recently anyway).

The pacemaker.log entries generated by the 'pcs stonith confirm rawhide3' event are attached. It seems odd that it's attempting to fence rawhide3 after the manual confirmation.

[root@rawhide1 ~]# rpm -q pcs pacemaker
pcs-0.9.149-2.fc24.x86_64
pacemaker-1.1.14-1.fc24.1.x86_64

Comment 1 Ken Gaillot 2016-03-04 18:03:09 UTC
This does look like a regression. Will investigate further.

Comment 2 Andrew Beekhof 2016-03-23 03:47:40 UTC
Really depends why the cluster thinks its unclean.

crm/sos report?

Comment 3 Andrew Price 2016-03-23 12:53:07 UTC
(In reply to Andrew Beekhof from comment #2)
> Really depends why the cluster thinks its unclean.

Well the third node just hasn't been booted so it doesn't know what state it's in. That's why I'm doing the manual confirm.

> crm/sos report?

On its way.

Comment 4 Andrew Price 2016-03-23 12:58:14 UTC
Created attachment 1139569 [details]
crm report

Attached is a crm report for the startup of two of the three nodes and the 'pcs stonith confirm rawhide3' that was done at 10:15:41.

Comment 5 Andrew Beekhof 2016-03-24 02:59:48 UTC
Looks like we're doing almost all of the right things:

Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:   notice: tengine_stonith_notify:	Peer rawhide3 was terminated (off) by a human for rawhide1: OK (ref=f3b7d46a-4094-4488-8293-13ee416efbf1) by client stonith_admin.897
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:   notice: crm_update_peer_state_iter:	crmd_peer_down: Node rawhide3[3] - state is now lost (was (null))
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: peer_update_callback:	rawhide3 is now lost (was in unknown state)
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: crm_update_peer_proc:	crmd_peer_down: Node rawhide3[3] - all processes are now offline
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: peer_update_callback:	Client rawhide3/peer now has status [offline] (DC=true, changed=     1)
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: crm_update_peer_expected:	crmd_peer_down: Node rawhide3[3] - expected state is now down (was (null))
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: erase_status_tag:	Deleting xpath: //node_state[@uname='rawhide3']/lrm
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: erase_status_tag:	Deleting xpath: //node_state[@uname='rawhide3']/transient_attributes
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: tengine_stonith_notify:	External fencing operation from stonith_admin.897 fenced rawhide3

However when it comes to the part where the PE unpacks the node (pe-warn-406.bz2) we get:

(    unpack.c:1324  )   trace: determine_online_status_fencing:	rawhide3: in_cluster=<null>, is_peer=offline, join=down, expected=down, term=0
(    unpack.c:1498  ) warning: determine_online_status:	Node rawhide3 is unclean

So it looks like we forgot to set in_ccm="true" along with expected and join.
Possibly because the node state was previously unknown instead of online.

Comment 6 Andrew Beekhof 2016-03-24 05:23:05 UTC
Ken: Can you test this for me please?

diff --git a/crmd/callbacks.c b/crmd/callbacks.c
index 34abe81..9a65e2f 100644
--- a/crmd/callbacks.c
+++ b/crmd/callbacks.c
@@ -146,6 +146,14 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d
                 }
             }
 
+            if(AM_I_DC && data == NULL) {
+                /* If this is due to a manual fencing event and its the first
+                 * time we've seen this node, it is important to get
+                 * the cluster state into the cib
+                 */
+                populate_cib_nodes(node_update_cluster, __FUNCTION__);
+            }
+
             crmd_notify_node_event(node);
             break;

Comment 7 Ken Gaillot 2016-03-24 16:47:03 UTC
This is a regression introduced by 8b98a9b2, which took node_update_cluster out of send_stonith_update(), instead relying on the membership layer to handle that. That fixed one situation (when the node rejoins before the stonith result is reported) but caused this issue (when the node doesn't rejoin).

This is not limited to manual fences; any fencing of an unseen node (that remains unseen after the fence) will exhibit the issue.

This is fixed upstream as of commit 98457d16 (which modifies send_stonith_update() rather than use the patch in Comment 6).

@Jan, can you either backport that commit or rebase on 5a6cdd11 for Fedora?

Comment 8 Mike McCune 2016-03-28 22:48:26 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 9 Andrew Price 2016-03-29 11:52:15 UTC
(Changing the summary to be more accurate.)

Comment 10 Jan Pokorný [poki] 2016-03-30 15:21:40 UTC
re [comment 7]:
Yep, new builds are in the works.

Comment 11 Andrew Price 2016-04-01 12:32:16 UTC
I can confirm that this is fixed in the new pacemaker-1.1.14-2.5a6cdd1.git.fc25.x86_64 package - thanks!

Comment 12 Fedora Update System 2016-04-02 13:14:42 UTC
pacemaker-1.1.14-2.5a6cdd1.git.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-3c079b3005

Comment 13 Fedora Update System 2016-04-02 13:14:52 UTC
pacemaker-1.1.14-2.5a6cdd1.git.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-d9d5375015

Comment 14 Fedora Update System 2016-04-02 21:19:32 UTC
pacemaker-1.1.14-2.5a6cdd1.git.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-3c079b3005

Comment 15 Fedora Update System 2016-04-02 21:21:25 UTC
pacemaker-1.1.14-2.5a6cdd1.git.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-d9d5375015

Comment 16 Fedora Update System 2016-04-12 09:41:07 UTC
pacemaker-1.1.14-2.5a6cdd1.git.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.

Comment 17 Fedora Update System 2016-04-13 07:23:31 UTC
pacemaker-1.1.14-2.5a6cdd1.git.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.