Bug 996850
Summary: | Unfence at cluster startup with fence_scsi | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Andrew Beekhof <abeekhof> | ||||||
Component: | pacemaker | Assignee: | Andrew Beekhof <abeekhof> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 6.4 | CC: | cluster-maint, dvossel, fdinitto, jkortus, tlavigne | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | pacemaker-1.1.10-14.el6 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Cause:
Fencing is configured in Pacemaker not cman
Consequence:
The call to fence_node -U in the cman init script is a no-op
Fix:
Support the concept of automated unfencing in pacemaker
Result:
Messages such as these should be visible in the log files during startup:
stonith-ng[7472]: notice: cib_device_update: Unfencing ourselves with fence_scsi (Fabric)
...
stonith-ng[7472]: notice: log_operation: Operation 'on' [7485] (call -1 from stonith-ng) for host 'pcmk-5' with device 'Fabric' returned: 0
|
Story Points: | --- | ||||||
Clone Of: | 996576 | Environment: | |||||||
Last Closed: | 2013-11-21 12:10:07 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Comment 4
Jaroslav Kortus
2013-09-16 16:24:12 UTC
Created attachment 798357 [details]
crmd core
Unfortunately I can't attach meaningful backtrace as the debuginfo rpm seems to have been built as bad:
error: /mnt/redhat/brewroot/packages/pacemaker/1.1.10/9.el6/x86_64/pacemaker-debuginfo-1.1.10-9.el6.x86_64.rpm: Header SHA1 digest: BAD
(In reply to Jaroslav Kortus from comment #4) > On pacemaker-1.1.10-9.el6.x86_64 (selinux=permissive): > > SCSI fencing is still not functional. > > stonith config command: > pcs stonith create scsi fence_scsi devices="/dev/sdb" > pcmk_host_list="bucek-02.cluster-qe.lab.eng.brq.redhat.com" > pcmk_host_check="static-list" action="off" Core dumps aside, this isn't doing what you think its doing. Although it is now possible to specify action= as part of the device, it is not recommended generally and definitely not a good idea for this case in particular. Pacemaker uses these device definitions for more than just fencing, it also uses them for other unfencing and health checks. So the reason unfencing does not work, is that you've forced it to use action=off. If this is the only kind of fencing being configured, then you can set the cluster-wide option: stonith-action = enum [reboot] Action to send to STONITH device Action to send to STONITH device Allowed values: reboot, poweroff, off eg. stonith-action=off Alternatively, you can set the per-device parameter: pcmk_reboot_action = string [reboot] Advanced use only: An alternate command to run instead of 'reboot' eg. pcmk_reboot_action=off I'll begin investigating the core dump now Had to reinstall the machine, it started to fail on any RPM operation. Fresh install+debuginfos shows: Program terminated with signal 11, Segmentation fault. #0 0x0000003b74c4812c in _IO_vfprintf_internal (s=<value optimized out>, format=<value optimized out>, ap=<value optimized out>) at vfprintf.c:1641 1641 process_string_arg (((struct printf_spec *) NULL)); (gdb) bt #0 0x0000003b74c4812c in _IO_vfprintf_internal (s=<value optimized out>, format=<value optimized out>, ap=<value optimized out>) at vfprintf.c:1641 #1 0x0000003b74cffcc0 in ___vsnprintf_chk (s=0x7fffd7dc1450 "2 pending LRM operations at shutdown... waiting", maxlen=<value optimized out>, flags=1, slen=<value optimized out>, format=0x1fd3150 "%d pending LRM operations at %s%s", args=0x7fffd7dc1420) at vsnprintf_chk.c:65 #2 0x0000003b7c014774 in vsnprintf (cs=0x1fd4080, ap=0x7fffd7dc16a0) at /usr/include/bits/stdio2.h:78 #3 qb_log_real_va_ (cs=0x1fd4080, ap=0x7fffd7dc16a0) at log.c:193 #4 0x0000003b7c01484c in qb_log_from_external_source (function=<value optimized out>, filename=<value optimized out>, format=<value optimized out>, priority=<value optimized out>, lineno=<value optimized out>, tags=<value optimized out>) at log.c:340 #5 0x000000000041e244 in lrm_state_verify_stopped (lrm_state=0x1f7a840, cur_state=S_STOPPING, log_level=<value optimized out>) at lrm.c:370 #6 0x00000000004236d0 in do_lrm_control (action=144115188075855872, cause=<value optimized out>, cur_state=S_STOPPING, current_input=<value optimized out>, msg_data= 0x1fd4bf0) at lrm.c:282 #7 0x0000000000408353 in s_crmd_fsa_actions (fsa_data=0x1fd4bf0) at fsa.c:416 #8 0x0000000000409626 in s_crmd_fsa (cause=<value optimized out>) at fsa.c:231 #9 0x0000000000411f9e in crm_fsa_trigger (user_data=<value optimized out>) at callbacks.c:256 #10 0x0000003b7cc2a8f3 in crm_trigger_dispatch (source=0x1e74a00, callback=<value optimized out>, userdata=<value optimized out>) at mainloop.c:105 #11 0x0000003b7603feb2 in g_main_dispatch (context=0x1e680e0) at gmain.c:2149 #12 g_main_context_dispatch (context=0x1e680e0) at gmain.c:2702 #13 0x0000003b76043d68 in g_main_context_iterate (context=0x1e680e0, block=1, dispatch=1, self=<value optimized out>) at gmain.c:2780 #14 0x0000003b76044275 in g_main_loop_run (loop=0x1e68ed0) at gmain.c:2988 #15 0x00000000004054de in crmd_init () at main.c:154 #16 0x000000000040581c in main (argc=1, argv=0x7fffd7dc1d08) at main.c:121 Created attachment 799388 [details]
crm_report from the crash event
Hi Andrew, thanks for clarification, this command works as expected: pcs stonith create --force scsi fence_scsi devices="/dev/sdb" pcmk_host_list="bucek-02" pcmk_host_check="static-list" pcmk_reboot_action="off" We should probably either document this necessity for fence_scsi or (probably better) add reboot option to fence_scsi which would be equal to "off". Observed bugs: 1. it still registers all nodes which is IMHO wrong. 2. when you delete the stonith device and create it again, it does not trigger the registration again. It should register with the devices whenever the stonith device is created (possibly even modified). What do you think? (In reply to Jaroslav Kortus from comment #10) > Hi Andrew, > > thanks for clarification, this command works as expected: > pcs stonith create --force scsi fence_scsi devices="/dev/sdb" > pcmk_host_list="bucek-02" pcmk_host_check="static-list" > pcmk_reboot_action="off" > > We should probably either document this necessity for fence_scsi or > (probably better) add reboot option to fence_scsi which would be equal to > "off". Agreed that the later is probably preferable. > > Observed bugs: > 1. it still registers all nodes which is IMHO wrong. Not sure I follow this. Can you elaborate. > 2. when you delete the stonith device and create it again, it does not > trigger the registration again. I thought that was the intended behaviour. > It should register with the devices whenever > the stonith device is created (possibly even modified). > > > What do you think? We can do that. For 6.5 or >= 6.6 ? (In reply to Andrew Beekhof from comment #11) > > 2. when you delete the stonith device and create it again, it does not > > trigger the registration again. > > I thought that was the intended behaviour. This is the same behaviour as cman/rgmanager based cluster. If you change cluster.conf you will need to re-run unfencing manually. I don't see this either as a regression or something we need to address in 6.5. > > > It should register with the devices whenever > > the stonith device is created (possibly even modified). > > > > > > What do you think? > > We can do that. For 6.5 or >= 6.6 ? RFE.. as above. (In reply to Andrew Beekhof from comment #11) > (In reply to Jaroslav Kortus from comment #10) > > Hi Andrew, > > > > thanks for clarification, this command works as expected: > > pcs stonith create --force scsi fence_scsi devices="/dev/sdb" > > pcmk_host_list="bucek-02" pcmk_host_check="static-list" > > pcmk_reboot_action="off" > > Observed bugs: > > 1. it still registers all nodes which is IMHO wrong. > > Not sure I follow this. Can you elaborate. With the command above there should be only one key, as the fencing is valid for one node only. Instead, it creates 3 keys (one for each node). (In reply to Jaroslav Kortus from comment #13) > (In reply to Andrew Beekhof from comment #11) > > (In reply to Jaroslav Kortus from comment #10) > > > Hi Andrew, > > > > > > thanks for clarification, this command works as expected: > > > pcs stonith create --force scsi fence_scsi devices="/dev/sdb" > > > pcmk_host_list="bucek-02" pcmk_host_check="static-list" > > > pcmk_reboot_action="off" > > > Observed bugs: > > > 1. it still registers all nodes which is IMHO wrong. > > > > Not sure I follow this. Can you elaborate. > > With the command above there should be only one key, as the fencing is valid > for one node only. Instead, it creates 3 keys (one for each node). Do you have three of these devices defined, each with a single node in pcmk_host_list? a) Don't do that. Just a single device is enough/recommended b) That still shouldn't result in what you're seeing How many times do you see this message? crm_notice("Unfencing ourselves with %s (%s)", agent, device->id); Because it is protected by a static boolean and the only path to the unfencing logic: if(have_fence_scsi == FALSE && safe_str_eq(agent, "fence_scsi")) { stonith_device_t *device = g_hash_table_lookup(device_list, rsc->id); if(device) { have_fence_scsi = TRUE; crm_notice("Unfencing ourselves with %s (%s)", agent, device->id); schedule_internal_command(__FUNCTION__, device, "on", stonith_our_uname, 0, NULL, unfence_cb); (In reply to Andrew Beekhof from comment #14) > (In reply to Jaroslav Kortus from comment #13) > > (In reply to Andrew Beekhof from comment #11) > > > (In reply to Jaroslav Kortus from comment #10) > > > > Hi Andrew, > > > > > > > > thanks for clarification, this command works as expected: > > > > pcs stonith create --force scsi fence_scsi devices="/dev/sdb" > > > > pcmk_host_list="bucek-02" pcmk_host_check="static-list" > > > > pcmk_reboot_action="off" > > > > Observed bugs: > > > > 1. it still registers all nodes which is IMHO wrong. > > > > > > Not sure I follow this. Can you elaborate. > > > > With the command above there should be only one key, as the fencing is valid > > for one node only. Instead, it creates 3 keys (one for each node). > > Do you have three of these devices defined, each with a single node in > pcmk_host_list? > > a) Don't do that. Just a single device is enough/recommended The fencing command presented was the only one being in the conf. > b) That still shouldn't result in what you're seeing > > How many times do you see this message? > > crm_notice("Unfencing ourselves with %s (%s)", agent, device->id); > I can see it once per node: Sep 19 15:43:29 bucek-01 stonith-ng[16607]: notice: cib_device_update: Unfencing ourselves with fence_scsi (fence_scsi) Sep 19 15:43:29 bucek-02 stonith-ng[17038]: notice: cib_device_update: Unfencing ourselves with fence_scsi (fence_scsi) Sep 19 15:43:29 bucek-03 stonith-ng[16939]: notice: cib_device_update: Unfencing ourselves with fence_scsi (fence_scsi) (In reply to Fabio Massimo Di Nitto from comment #12) > (In reply to Andrew Beekhof from comment #11) > > > > 2. when you delete the stonith device and create it again, it does not > > > trigger the registration again. > > > > I thought that was the intended behaviour. > > This is the same behaviour as cman/rgmanager based cluster. If you change > cluster.conf you will need to re-run unfencing manually. I don't see this > either as a regression or something we need to address in 6.5. I never said this was a regression, just a difference from what I expected. Considering that it behaves consistently with cman/rgmanager, do you think this should be changed (via RFE)? Would it make sense to make it work that way or are there more reasons for current implementation? (In reply to Jaroslav Kortus from comment #16) > (In reply to Fabio Massimo Di Nitto from comment #12) > > (In reply to Andrew Beekhof from comment #11) > > > > > > 2. when you delete the stonith device and create it again, it does not > > > > trigger the registration again. > > > > > > I thought that was the intended behaviour. > > > > This is the same behaviour as cman/rgmanager based cluster. If you change > > cluster.conf you will need to re-run unfencing manually. I don't see this > > either as a regression or something we need to address in 6.5. > > I never said this was a regression, just a difference from what I expected. > Considering that it behaves consistently with cman/rgmanager, do you think > this should be changed (via RFE)? If anything it would be an RFE yes. > > Would it make sense to make it work that way or are there more reasons for > current implementation? I don't know if there are any side effects of doing it as you suggest. we will need to investigate with Marek for possible corner cases such as "clean up after a config change" and stuff like that... (In reply to Andrew Beekhof from comment #11) > (In reply to Jaroslav Kortus from comment #10) > > Hi Andrew, > > > > thanks for clarification, this command works as expected: > > pcs stonith create --force scsi fence_scsi devices="/dev/sdb" > > pcmk_host_list="bucek-02" pcmk_host_check="static-list" > > pcmk_reboot_action="off" > > > > We should probably either document this necessity for fence_scsi or > > (probably better) add reboot option to fence_scsi which would be equal to > > "off". > > Agreed that the later is probably preferable. I've looked at other fence agents and why it worked with cman and the reason is that cman probably uses default actions of the agents (which is defaults to "off" for in our case). So maybe the desired change would be to change pacemaker to leave it also empty by default, so that the agent's default action is used. This sounds like a risky change to me, so it would probably not make it to 6.5, but we could document it for now and solve it better in 6.5. Any comments on this? (In reply to Jaroslav Kortus from comment #18) > I've looked at other fence agents and why it worked with cman and the reason > is that cman probably uses default actions of the agents (which is defaults > to "off" for in our case). > > So maybe the desired change would be to change pacemaker to leave it also > empty by default, so that the agent's default action is used. This sounds > like a risky change to me, agreed > so it would probably not make it to 6.5, but we > could document it for now and solve it better in 6.5. > > Any comments on this? Documenting for now sounds reasonable and we can have the discussion for 6.6 I would probably still lean towards changing fence_scsi at this point though. (In reply to Jaroslav Kortus from comment #16) > I never said this was a regression, just a difference from what I expected. > Considering that it behaves consistently with cman/rgmanager, do you think > this should be changed (via RFE)? I think its a reasonable request. Its how I would have written it too. > Would it make sense to make it work that way or are there more reasons for > current implementation? I think someone was trying to prevent unfencing without a first rebooting. However I have no objection. follow up created as https://bugzilla.redhat.com/show_bug.cgi?id=1014978 with pacemaker-1.1.10-12.el6.x86_64 scsi fencing works as expected provided that you set pcmk_reboot_option="off". Nodes are fenced properly and unfence only self on startup. I will mark this as verified when fix for comment 8 is in. Additional fixes are in -14 I can see fix for comment 8 in pacemaker-1.1.10-14.el6.src.rpm. Marking as verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-1635.html |