Bug 1312050

Summary:

unseen node remains unclean after fencing [rawhide]

Product:

[Fedora] Fedora

Reporter:

Andrew Price <anprice>

Component:

pacemaker

Assignee:

Jan Pokorný [poki] <jpokorny>

Status:

CLOSED ERRATA

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

rawhide

CC:

abeekhof, andrew, jpokorny, kgaillot, lhh

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

pacemaker-1.1.14-2.5a6cdd1.git.fc24 pacemaker-1.1.14-2.5a6cdd1.git.fc23

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1321110 (view as bug list)

Environment:

Last Closed:

2016-04-12 09:41:11 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1321110

Attachments:

Description	Flags
pacemaker.log excerpt	none
crm report	none

Description Andrew Price 2016-02-25 15:49:58 UTC

Created attachment 1130590 [details]
pacemaker.log excerpt

When I bring 2 nodes of my 3-node cluster online, 'pcs status' shows:

    Node rawhide3: UNCLEAN (offline)
    Online: [ rawhide1 rawhide2 ]

which is expected. 'pcs stonith confirm rawhide3' then says:

    Node: rawhide3 confirmed fenced

so I would now expect to see:

    Online: [ rawhide1 rawhide2 ]
    OFFLINE: [ rawhide3 ]

but instead I still see:

    Node rawhide3: UNCLEAN (offline)
    Online: [ rawhide1 rawhide2 ]

and the cluster resources don't start. Bringing rawhide3 temporarily online clears this up but I don't think I should have to do that if the cluster is quorate (I didn't before recently anyway).

The pacemaker.log entries generated by the 'pcs stonith confirm rawhide3' event are attached. It seems odd that it's attempting to fence rawhide3 after the manual confirmation.

[root@rawhide1 ~]# rpm -q pcs pacemaker
pcs-0.9.149-2.fc24.x86_64
pacemaker-1.1.14-1.fc24.1.x86_64

Comment 1 Ken Gaillot 2016-03-04 18:03:09 UTC

This does look like a regression. Will investigate further.

Comment 2 Andrew Beekhof 2016-03-23 03:47:40 UTC

Really depends why the cluster thinks its unclean.

crm/sos report?

Comment 3 Andrew Price 2016-03-23 12:53:07 UTC

(In reply to Andrew Beekhof from comment #2)
> Really depends why the cluster thinks its unclean.

Well the third node just hasn't been booted so it doesn't know what state it's in. That's why I'm doing the manual confirm.

> crm/sos report?

On its way.

Comment 4 Andrew Price 2016-03-23 12:58:14 UTC

Created attachment 1139569 [details]
crm report

Attached is a crm report for the startup of two of the three nodes and the 'pcs stonith confirm rawhide3' that was done at 10:15:41.

Comment 5 Andrew Beekhof 2016-03-24 02:59:48 UTC

Looks like we're doing almost all of the right things:

Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:   notice: tengine_stonith_notify:	Peer rawhide3 was terminated (off) by a human for rawhide1: OK (ref=f3b7d46a-4094-4488-8293-13ee416efbf1) by client stonith_admin.897
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:   notice: crm_update_peer_state_iter:	crmd_peer_down: Node rawhide3[3] - state is now lost (was (null))
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: peer_update_callback:	rawhide3 is now lost (was in unknown state)
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: crm_update_peer_proc:	crmd_peer_down: Node rawhide3[3] - all processes are now offline
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: peer_update_callback:	Client rawhide3/peer now has status [offline] (DC=true, changed=     1)
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: crm_update_peer_expected:	crmd_peer_down: Node rawhide3[3] - expected state is now down (was (null))
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: erase_status_tag:	Deleting xpath: //node_state[@uname='rawhide3']/lrm
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: erase_status_tag:	Deleting xpath: //node_state[@uname='rawhide3']/transient_attributes
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: tengine_stonith_notify:	External fencing operation from stonith_admin.897 fenced rawhide3

However when it comes to the part where the PE unpacks the node (pe-warn-406.bz2) we get:

(    unpack.c:1324  )   trace: determine_online_status_fencing:	rawhide3: in_cluster=<null>, is_peer=offline, join=down, expected=down, term=0
(    unpack.c:1498  ) warning: determine_online_status:	Node rawhide3 is unclean

So it looks like we forgot to set in_ccm="true" along with expected and join.
Possibly because the node state was previously unknown instead of online.

Comment 6 Andrew Beekhof 2016-03-24 05:23:05 UTC

Ken: Can you test this for me please?

diff --git a/crmd/callbacks.c b/crmd/callbacks.c
index 34abe81..9a65e2f 100644
--- a/crmd/callbacks.c
+++ b/crmd/callbacks.c
@@ -146,6 +146,14 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d
                 }
             }
 
+            if(AM_I_DC && data == NULL) {
+                /* If this is due to a manual fencing event and its the first
+                 * time we've seen this node, it is important to get
+                 * the cluster state into the cib
+                 */
+                populate_cib_nodes(node_update_cluster, __FUNCTION__);
+            }
+
             crmd_notify_node_event(node);
             break;

Comment 7 Ken Gaillot 2016-03-24 16:47:03 UTC

This is a regression introduced by 8b98a9b2, which took node_update_cluster out of send_stonith_update(), instead relying on the membership layer to handle that. That fixed one situation (when the node rejoins before the stonith result is reported) but caused this issue (when the node doesn't rejoin).

This is not limited to manual fences; any fencing of an unseen node (that remains unseen after the fence) will exhibit the issue.

This is fixed upstream as of commit 98457d16 (which modifies send_stonith_update() rather than use the patch in Comment 6).

@Jan, can you either backport that commit or rebase on 5a6cdd11 for Fedora?

Comment 8 Mike McCune 2016-03-28 22:48:26 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 9 Andrew Price 2016-03-29 11:52:15 UTC

(Changing the summary to be more accurate.)

Comment 10 Jan Pokorný [poki] 2016-03-30 15:21:40 UTC

re [comment 7]:
Yep, new builds are in the works.

Comment 11 Andrew Price 2016-04-01 12:32:16 UTC

I can confirm that this is fixed in the new pacemaker-1.1.14-2.5a6cdd1.git.fc25.x86_64 package - thanks!

Comment 12 Fedora Update System 2016-04-02 13:14:42 UTC

pacemaker-1.1.14-2.5a6cdd1.git.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-3c079b3005

Comment 13 Fedora Update System 2016-04-02 13:14:52 UTC

pacemaker-1.1.14-2.5a6cdd1.git.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-d9d5375015

Comment 14 Fedora Update System 2016-04-02 21:19:32 UTC

pacemaker-1.1.14-2.5a6cdd1.git.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-3c079b3005

Comment 15 Fedora Update System 2016-04-02 21:21:25 UTC

pacemaker-1.1.14-2.5a6cdd1.git.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-d9d5375015

Comment 16 Fedora Update System 2016-04-12 09:41:07 UTC

pacemaker-1.1.14-2.5a6cdd1.git.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.

Comment 17 Fedora Update System 2016-04-13 07:23:31 UTC

pacemaker-1.1.14-2.5a6cdd1.git.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.