RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1321110 - unseen node remains unclean after fencing
Summary: unseen node remains unclean after fencing
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: pacemaker
Version: 6.8
Hardware: All
OS: Linux
unspecified
urgent
Target Milestone: rc
: ---
Assignee: Ken Gaillot
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On: 1312050
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-24 16:55 UTC by Ken Gaillot
Modified: 2016-05-10 23:53 UTC (History)
11 users (show)

Fixed In Version: pacemaker-1.1.14-8.el6
Doc Type: Bug Fix
Doc Text:
Clone Of: 1312050
Environment:
Last Closed: 2016-05-10 23:53:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0856 0 normal SHIPPED_LIVE pacemaker bug fix and enhancement update 2016-05-10 22:44:25 UTC

Description Ken Gaillot 2016-03-24 16:55:06 UTC
+++ This bug was initially created as a clone of Bug #1312050 +++

When I bring 2 nodes of my 3-node cluster online, 'pcs status' shows:

    Node rawhide3: UNCLEAN (offline)
    Online: [ rawhide1 rawhide2 ]

which is expected. 'pcs stonith confirm rawhide3' then says:

    Node: rawhide3 confirmed fenced

so I would now expect to see:

    Online: [ rawhide1 rawhide2 ]
    OFFLINE: [ rawhide3 ]

but instead I still see:

    Node rawhide3: UNCLEAN (offline)
    Online: [ rawhide1 rawhide2 ]

and the cluster resources don't start. Bringing rawhide3 temporarily online clears this up but I don't think I should have to do that if the cluster is quorate (I didn't before recently anyway).

The pacemaker.log entries generated by the 'pcs stonith confirm rawhide3' event are attached. It seems odd that it's attempting to fence rawhide3 after the manual confirmation.

[root@rawhide1 ~]# rpm -q pcs pacemaker
pcs-0.9.149-2.fc24.x86_64
pacemaker-1.1.14-1.fc24.1.x86_64

--- Additional comment from Ken Gaillot on 2016-03-04 13:03:09 EST ---

This does look like a regression. Will investigate further.

--- Additional comment from Andrew Beekhof on 2016-03-22 23:47:40 EDT ---

Really depends why the cluster thinks its unclean.

crm/sos report?

--- Additional comment from Andrew Price on 2016-03-23 08:53:07 EDT ---

(In reply to Andrew Beekhof from comment #2)
> Really depends why the cluster thinks its unclean.

Well the third node just hasn't been booted so it doesn't know what state it's in. That's why I'm doing the manual confirm.

> crm/sos report?

On its way.

--- Additional comment from Andrew Price on 2016-03-23 08:58 EDT ---

Attached is a crm report for the startup of two of the three nodes and the 'pcs stonith confirm rawhide3' that was done at 10:15:41.

--- Additional comment from Andrew Beekhof on 2016-03-23 22:59:48 EDT ---

Looks like we're doing almost all of the right things:

Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:   notice: tengine_stonith_notify:	Peer rawhide3 was terminated (off) by a human for rawhide1: OK (ref=f3b7d46a-4094-4488-8293-13ee416efbf1) by client stonith_admin.897
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:   notice: crm_update_peer_state_iter:	crmd_peer_down: Node rawhide3[3] - state is now lost (was (null))
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: peer_update_callback:	rawhide3 is now lost (was in unknown state)
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: crm_update_peer_proc:	crmd_peer_down: Node rawhide3[3] - all processes are now offline
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: peer_update_callback:	Client rawhide3/peer now has status [offline] (DC=true, changed=     1)
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: crm_update_peer_expected:	crmd_peer_down: Node rawhide3[3] - expected state is now down (was (null))
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: erase_status_tag:	Deleting xpath: //node_state[@uname='rawhide3']/lrm
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: erase_status_tag:	Deleting xpath: //node_state[@uname='rawhide3']/transient_attributes
Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk       crmd:     info: tengine_stonith_notify:	External fencing operation from stonith_admin.897 fenced rawhide3

However when it comes to the part where the PE unpacks the node (pe-warn-406.bz2) we get:

(    unpack.c:1324  )   trace: determine_online_status_fencing:	rawhide3: in_cluster=<null>, is_peer=offline, join=down, expected=down, term=0
(    unpack.c:1498  ) warning: determine_online_status:	Node rawhide3 is unclean

So it looks like we forgot to set in_ccm="true" along with expected and join.
Possibly because the node state was previously unknown instead of online.

--- Additional comment from Andrew Beekhof on 2016-03-24 01:23:05 EDT ---

Ken: Can you test this for me please?

diff --git a/crmd/callbacks.c b/crmd/callbacks.c
index 34abe81..9a65e2f 100644
--- a/crmd/callbacks.c
+++ b/crmd/callbacks.c
@@ -146,6 +146,14 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d
                 }
             }
 
+            if(AM_I_DC && data == NULL) {
+                /* If this is due to a manual fencing event and its the first
+                 * time we've seen this node, it is important to get
+                 * the cluster state into the cib
+                 */
+                populate_cib_nodes(node_update_cluster, __FUNCTION__);
+            }
+
             crmd_notify_node_event(node);
             break;

--- Additional comment from Ken Gaillot on 2016-03-24 12:47:03 EDT ---

This is a regression introduced by 8b98a9b2, which took node_update_cluster out of send_stonith_update(), instead relying on the membership layer to handle that. That fixed one situation (when the node rejoins before the stonith result is reported) but caused this issue (when the node doesn't rejoin).

This is not limited to manual fences; any fencing of an unseen node (that remains unseen after the fence) will exhibit the issue.

This is fixed upstream as of commit 98457d16 (which modifies send_stonith_update() rather than use the patch in Comment 6).

@Jan, can you either backport that commit or rebase on 5a6cdd11 for Fedora?

Comment 1 Ken Gaillot 2016-03-24 17:08:11 UTC
The reproducer for this is very simple, quick and reliable:

1. Configure a pacemaker cluster of at least three nodes.

2. Start all but one of the nodes. Pacemaker will fence the unseen node.

3. With the any of the pacemaker-1.1.14 builds for 6.8 through  	pacemaker-1.1.14-7.el6, the fenced node will be shown as unclean in pcs status after being fenced (and will likely be fenced multiple times). With either earlier or later builds, the node should be shown as (cleanly) offline in the status.

Comment 3 Ken Gaillot 2016-03-24 19:07:27 UTC
updated build available

Comment 7 errata-xmlrpc 2016-05-10 23:53:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0856.html


Note You need to log in before you can comment on or make changes to this bug.