Bug 1312050
Summary: | unseen node remains unclean after fencing [rawhide] | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Andrew Price <anprice> | ||||||
Component: | pacemaker | Assignee: | Jan Pokorný [poki] <jpokorny> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | rawhide | CC: | abeekhof, andrew, jpokorny, kgaillot, lhh | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | pacemaker-1.1.14-2.5a6cdd1.git.fc24 pacemaker-1.1.14-2.5a6cdd1.git.fc23 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1321110 (view as bug list) | Environment: | |||||||
Last Closed: | 2016-04-12 09:41:11 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1321110 | ||||||||
Attachments: |
|
This does look like a regression. Will investigate further. Really depends why the cluster thinks its unclean. crm/sos report? (In reply to Andrew Beekhof from comment #2) > Really depends why the cluster thinks its unclean. Well the third node just hasn't been booted so it doesn't know what state it's in. That's why I'm doing the manual confirm. > crm/sos report? On its way. Created attachment 1139569 [details]
crm report
Attached is a crm report for the startup of two of the three nodes and the 'pcs stonith confirm rawhide3' that was done at 10:15:41.
Looks like we're doing almost all of the right things: Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk crmd: notice: tengine_stonith_notify: Peer rawhide3 was terminated (off) by a human for rawhide1: OK (ref=f3b7d46a-4094-4488-8293-13ee416efbf1) by client stonith_admin.897 Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk crmd: notice: crm_update_peer_state_iter: crmd_peer_down: Node rawhide3[3] - state is now lost (was (null)) Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk crmd: info: peer_update_callback: rawhide3 is now lost (was in unknown state) Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk crmd: info: crm_update_peer_proc: crmd_peer_down: Node rawhide3[3] - all processes are now offline Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk crmd: info: peer_update_callback: Client rawhide3/peer now has status [offline] (DC=true, changed= 1) Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk crmd: info: crm_update_peer_expected: crmd_peer_down: Node rawhide3[3] - expected state is now down (was (null)) Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk crmd: info: erase_status_tag: Deleting xpath: //node_state[@uname='rawhide3']/lrm Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk crmd: info: erase_status_tag: Deleting xpath: //node_state[@uname='rawhide3']/transient_attributes Mar 23 10:15:41 [801] rawhide1.andrewprice.me.uk crmd: info: tengine_stonith_notify: External fencing operation from stonith_admin.897 fenced rawhide3 However when it comes to the part where the PE unpacks the node (pe-warn-406.bz2) we get: ( unpack.c:1324 ) trace: determine_online_status_fencing: rawhide3: in_cluster=<null>, is_peer=offline, join=down, expected=down, term=0 ( unpack.c:1498 ) warning: determine_online_status: Node rawhide3 is unclean So it looks like we forgot to set in_ccm="true" along with expected and join. Possibly because the node state was previously unknown instead of online. Ken: Can you test this for me please? diff --git a/crmd/callbacks.c b/crmd/callbacks.c index 34abe81..9a65e2f 100644 --- a/crmd/callbacks.c +++ b/crmd/callbacks.c @@ -146,6 +146,14 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d } } + if(AM_I_DC && data == NULL) { + /* If this is due to a manual fencing event and its the first + * time we've seen this node, it is important to get + * the cluster state into the cib + */ + populate_cib_nodes(node_update_cluster, __FUNCTION__); + } + crmd_notify_node_event(node); break; This is a regression introduced by 8b98a9b2, which took node_update_cluster out of send_stonith_update(), instead relying on the membership layer to handle that. That fixed one situation (when the node rejoins before the stonith result is reported) but caused this issue (when the node doesn't rejoin). This is not limited to manual fences; any fencing of an unseen node (that remains unseen after the fence) will exhibit the issue. This is fixed upstream as of commit 98457d16 (which modifies send_stonith_update() rather than use the patch in Comment 6). @Jan, can you either backport that commit or rebase on 5a6cdd11 for Fedora? This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions (Changing the summary to be more accurate.) re [comment 7]: Yep, new builds are in the works. I can confirm that this is fixed in the new pacemaker-1.1.14-2.5a6cdd1.git.fc25.x86_64 package - thanks! pacemaker-1.1.14-2.5a6cdd1.git.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-3c079b3005 pacemaker-1.1.14-2.5a6cdd1.git.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-d9d5375015 pacemaker-1.1.14-2.5a6cdd1.git.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-3c079b3005 pacemaker-1.1.14-2.5a6cdd1.git.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-d9d5375015 pacemaker-1.1.14-2.5a6cdd1.git.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report. pacemaker-1.1.14-2.5a6cdd1.git.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report. |
Created attachment 1130590 [details] pacemaker.log excerpt When I bring 2 nodes of my 3-node cluster online, 'pcs status' shows: Node rawhide3: UNCLEAN (offline) Online: [ rawhide1 rawhide2 ] which is expected. 'pcs stonith confirm rawhide3' then says: Node: rawhide3 confirmed fenced so I would now expect to see: Online: [ rawhide1 rawhide2 ] OFFLINE: [ rawhide3 ] but instead I still see: Node rawhide3: UNCLEAN (offline) Online: [ rawhide1 rawhide2 ] and the cluster resources don't start. Bringing rawhide3 temporarily online clears this up but I don't think I should have to do that if the cluster is quorate (I didn't before recently anyway). The pacemaker.log entries generated by the 'pcs stonith confirm rawhide3' event are attached. It seems odd that it's attempting to fence rawhide3 after the manual confirmation. [root@rawhide1 ~]# rpm -q pcs pacemaker pcs-0.9.149-2.fc24.x86_64 pacemaker-1.1.14-1.fc24.1.x86_64