996850 – Unfence at cluster startup with fence_scsi

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 996850 - Unfence at cluster startup with fence_scsi

Summary: Unfence at cluster startup with fence_scsi

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	6.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Andrew Beekhof
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-08-14 07:03 UTC by Andrew Beekhof
Modified:	2013-11-21 12:10 UTC (History)
CC List:	5 users (show)
Fixed In Version:	pacemaker-1.1.10-14.el6
Doc Type:	Bug Fix
Doc Text:	Cause: Fencing is configured in Pacemaker not cman Consequence: The call to fence_node -U in the cman init script is a no-op Fix: Support the concept of automated unfencing in pacemaker Result: Messages such as these should be visible in the log files during startup: stonith-ng[7472]: notice: cib_device_update: Unfencing ourselves with fence_scsi (Fabric) ... stonith-ng[7472]: notice: log_operation: Operation 'on' [7485] (call -1 from stonith-ng) for host 'pcmk-5' with device 'Fabric' returned: 0
Clone Of:	996576
Environment:
Last Closed:	2013-11-21 12:10:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
crmd core (222.47 KB, application/x-compressed-tar) 2013-09-16 16:25 UTC, Jaroslav Kortus	no flags	Details
crm_report from the crash event (440.88 KB, application/x-bzip2) 2013-09-18 13:49 UTC, Jaroslav Kortus	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1014978	0	medium	CLOSED	fence-agents should have a sane a default for the reboot operation	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2013:1635	0	normal	SHIPPED_LIVE	Low: pacemaker security, bug fix, and enhancement update	2013-11-20 21:53:44 UTC

Internal Links: 1014978

Comment 4 Jaroslav Kortus 2013-09-16 16:24:12 UTC

On pacemaker-1.1.10-9.el6.x86_64 (selinux=permissive):

SCSI fencing is still not functional.

stonith config command:
pcs stonith create scsi fence_scsi devices="/dev/sdb" pcmk_host_list="bucek-02.cluster-qe.lab.eng.brq.redhat.com" pcmk_host_check="static-list" action="off"

You can either have the action="off" parameter there and then it will actually fence the node when needed, or you can leave it out and then you will have successful unfence at startup. Both together (startup & fence) are not possible.

Note that fence_scsi does not support "reboot" operation, it must be one of "on" or "off" for our purposes.

Summary:
1. it seems that unfence action="on" is appended to the list of opts instead of replacing the current action=... at unfence time
2. unfence applied to all hosts ignoring static-list and pcmk_host_list
3. crmd does not handle failure on unfence very well and dumps a core (will be attached). 


Log from crmd crash. Note the "fence_scsi: [error] main::do_action_off ]" part clearly suggesting that it has been called with off action:
[ service pacemaker start ]
 Sep 16 18:00:05 bucek-02 corosync[9833]:   [QUORUM] Members[3]: 1 2 3
Sep 16 18:00:05 bucek-02 corosync[9833]:   [CPG   ] chosen downlist: sender r(0) ip(10.34.70.43) ; members(old:2 left:0)
Sep 16 18:00:05 bucek-02 corosync[9833]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 16 18:00:09 bucek-02 fenced[9893]: fenced 3.0.12.1 started
Sep 16 18:00:09 bucek-02 dlm_controld[9915]: dlm_controld 3.0.12.1 started
Sep 16 18:00:10 bucek-02 gfs_controld[9968]: gfs_controld 3.0.12.1 started
Sep 16 18:00:13 bucek-02 pacemakerd[10049]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Sep 16 18:00:13 bucek-02 pacemakerd[10049]:   notice: main: Starting Pacemaker 1.1.10-9.el6 (Build: 368c726):  generated-manpages agent-manpages ascii-docs publican-docs ncurses libqb-logging libqb-ipc nagios  corosync-plugin cman
Sep 16 18:00:13 bucek-02 lrmd[10057]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Sep 16 18:00:13 bucek-02 cib[10055]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Sep 16 18:00:13 bucek-02 attrd[10058]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Sep 16 18:00:13 bucek-02 stonith-ng[10056]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Sep 16 18:00:13 bucek-02 attrd[10058]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Sep 16 18:00:13 bucek-02 stonith-ng[10056]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Sep 16 18:00:13 bucek-02 pengine[10059]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Sep 16 18:00:13 bucek-02 crmd[10060]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Sep 16 18:00:13 bucek-02 crmd[10060]:   notice: main: CRM Git Version: 368c726
Sep 16 18:00:13 bucek-02 cib[10055]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Sep 16 18:00:13 bucek-02 attrd[10058]:   notice: main: Starting mainloop...
Sep 16 18:00:14 bucek-02 crmd[10060]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Sep 16 18:00:14 bucek-02 stonith-ng[10056]:   notice: setup_cib: Watching for stonith topology changes
Sep 16 18:00:14 bucek-02 crmd[10060]:   notice: cman_event_callback: Membership 220: quorum acquired
Sep 16 18:00:14 bucek-02 crmd[10060]:   notice: crm_update_peer_state: cman_event_callback: Node bucek-01.cluster-qe.lab.eng.brq.redhat.com[1] - state is now member (was (null))
Sep 16 18:00:14 bucek-02 crmd[10060]:   notice: crm_update_peer_state: cman_event_callback: Node bucek-02.cluster-qe.lab.eng.brq.redhat.com[2] - state is now member (was (null))
Sep 16 18:00:14 bucek-02 crmd[10060]:   notice: crm_update_peer_state: cman_event_callback: Node bucek-03.cluster-qe.lab.eng.brq.redhat.com[3] - state is now member (was (null))
Sep 16 18:00:14 bucek-02 crmd[10060]:   notice: do_started: The local CRM is operational
Sep 16 18:00:14 bucek-02 crmd[10060]:   notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Sep 16 18:00:15 bucek-02 stonith-ng[10056]:   notice: stonith_device_register: Added 'scsi' to the device list (1 active devices)
Sep 16 18:00:15 bucek-02 stonith-ng[10056]:   notice: cib_device_update: Unfencing ourselves with fence_scsi (scsi)
Sep 16 18:00:16 bucek-02 attrd[10058]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
Sep 16 18:00:16 bucek-02 crmd[10060]:   notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Sep 16 18:00:17 bucek-02 stonith-ng[10056]:    error: log_operation: Operation 'on' [10073] (call -1 from stonith-ng) for host 'bucek-02.cluster-qe.lab.eng.brq.redhat.com' with device 'scsi' returned: 1 (Operation not permitted)
Sep 16 18:00:17 bucek-02 stonith-ng[10056]:  warning: log_operation: scsi:10073 [ Sep 16 18:00:17 fence_scsi: [error] main::do_action_off ]
Sep 16 18:00:17 bucek-02 crmd[10060]:    error: crm_ipc_read: Connection to stonith-ng failed
Sep 16 18:00:17 bucek-02 pacemakerd[10049]:    error: pcmk_child_exit: Child process stonith-ng (10056) exited: Network is down (100)
Sep 16 18:00:17 bucek-02 crmd[10060]:    error: mainloop_gio_callback: Connection to stonith-ng[0xac4700] closed (I/O condition=17)
Sep 16 18:00:17 bucek-02 pacemakerd[10049]:  warning: pcmk_child_exit: Pacemaker child process stonith-ng no longer wishes to be respawned. Shutting ourselves down.
Sep 16 18:00:17 bucek-02 pacemakerd[10049]:   notice: pcmk_shutdown_worker: Shuting down Pacemaker
Sep 16 18:00:17 bucek-02 pacemakerd[10049]:   notice: stop_child: Stopping crmd: Sent -15 to process 10060
Sep 16 18:00:17 bucek-02 crmd[10060]:     crit: tengine_stonith_connection_destroy: Fencing daemon connection failed
Sep 16 18:00:17 bucek-02 crmd[10060]:   notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms
Sep 16 18:00:17 bucek-02 attrd[10058]:   notice: attrd_cs_dispatch: Update relayed from bucek-01.cluster-qe.lab.eng.brq.redhat.com
Sep 16 18:00:17 bucek-02 attrd[10058]:   notice: attrd_trigger_update: Sending flush op to all hosts for: shutdown (1379347217)
Sep 16 18:00:18 bucek-02 crmd[10060]:    error: te_connect_stonith: Sign-in failed: triggered a retry
Sep 16 18:00:18 bucek-02 attrd[10058]:   notice: attrd_perform_update: Sent update 4: shutdown=1379347217
Sep 16 18:00:19 bucek-02 crmd[10060]:    error: te_connect_stonith: Sign-in failed: triggered a retry
Sep 16 18:00:19 bucek-02 crmd[10060]:   notice: do_state_transition: State transition S_NOT_DC -> S_STOPPING [ input=I_STOP cause=C_HA_MESSAGE origin=route_message ]
Sep 16 18:00:19 bucek-02 crmd[10060]:   notice: lrm_state_verify_stopped: Stopped 1 recurring operations at shutdown... waiting (0 ops remaining)
Sep 16 18:00:19 bucek-02 kernel: crmd[10060] general protection ip:3e1104812c sp:7fff56c2b9a0 error:0 in libc-2.12.so[3e11000000+18a000]
Sep 16 18:00:19 bucek-02 abrtd: Directory 'ccpp-2013-09-16-18:00:19-10060' creation detected
Sep 16 18:00:19 bucek-02 abrt[10083]: Saved core dump of pid 10060 (/usr/libexec/pacemaker/crmd) to /var/spool/abrt/ccpp-2013-09-16-18:00:19-10060 (7098368 bytes)
Sep 16 18:00:19 bucek-02 pacemakerd[10049]:    error: child_death_dispatch: Managed process 10060 (crmd) dumped core
Sep 16 18:00:19 bucek-02 pacemakerd[10049]:   notice: pcmk_child_exit: Child process crmd terminated with signal 11 (pid=10060, core=1)
Sep 16 18:00:19 bucek-02 pacemakerd[10049]:   notice: stop_child: Stopping pengine: Sent -15 to process 10059
Sep 16 18:00:19 bucek-02 pacemakerd[10049]:   notice: stop_child: Stopping attrd: Sent -15 to process 10058
Sep 16 18:00:19 bucek-02 attrd[10058]:   notice: main: Exiting...
Sep 16 18:00:19 bucek-02 pacemakerd[10049]:   notice: stop_child: Stopping lrmd: Sent -15 to process 10057

Comment 5 Jaroslav Kortus 2013-09-16 16:25:36 UTC

Created attachment 798357 [details]
crmd core

Unfortunately I can't attach meaningful backtrace as the debuginfo rpm seems to have been built as bad:
error: /mnt/redhat/brewroot/packages/pacemaker/1.1.10/9.el6/x86_64/pacemaker-debuginfo-1.1.10-9.el6.x86_64.rpm: Header SHA1 digest: BAD

Comment 6 Andrew Beekhof 2013-09-18 04:56:08 UTC

(In reply to Jaroslav Kortus from comment #4)
> On pacemaker-1.1.10-9.el6.x86_64 (selinux=permissive):
> 
> SCSI fencing is still not functional.
> 
> stonith config command:
> pcs stonith create scsi fence_scsi devices="/dev/sdb"
> pcmk_host_list="bucek-02.cluster-qe.lab.eng.brq.redhat.com"
> pcmk_host_check="static-list" action="off"

Core dumps aside, this isn't doing what you think its doing.
Although it is now possible to specify action= as part of the device, it is not recommended generally and definitely not a good idea for this case in particular.

Pacemaker uses these device definitions for more than just fencing, it also uses them for other unfencing and health checks.

So the reason unfencing does not work, is that you've forced it to use action=off.

If this is the only kind of fencing being configured, then you can set the cluster-wide option:

       stonith-action = enum [reboot]
           Action to send to STONITH device

           Action to send to STONITH device Allowed values: reboot, poweroff, off

eg.

   stonith-action=off

Alternatively, you can set the per-device parameter:

   pcmk_reboot_action = string [reboot]
           Advanced use only: An alternate command to run instead of 'reboot'

eg.

   pcmk_reboot_action=off


I'll begin investigating the core dump now

Comment 8 Jaroslav Kortus 2013-09-18 13:14:28 UTC

Had to reinstall the machine, it started to fail on any RPM operation. Fresh install+debuginfos shows:

Program terminated with signal 11, Segmentation fault.
#0  0x0000003b74c4812c in _IO_vfprintf_internal (s=<value optimized out>, format=<value optimized out>, ap=<value optimized out>) at vfprintf.c:1641
1641		  process_string_arg (((struct printf_spec *) NULL));
(gdb) bt
#0  0x0000003b74c4812c in _IO_vfprintf_internal (s=<value optimized out>, format=<value optimized out>, ap=<value optimized out>) at vfprintf.c:1641
#1  0x0000003b74cffcc0 in ___vsnprintf_chk (s=0x7fffd7dc1450 "2 pending LRM operations at shutdown... waiting", maxlen=<value optimized out>, flags=1, 
    slen=<value optimized out>, format=0x1fd3150 "%d pending LRM operations at %s%s", args=0x7fffd7dc1420) at vsnprintf_chk.c:65
#2  0x0000003b7c014774 in vsnprintf (cs=0x1fd4080, ap=0x7fffd7dc16a0) at /usr/include/bits/stdio2.h:78
#3  qb_log_real_va_ (cs=0x1fd4080, ap=0x7fffd7dc16a0) at log.c:193
#4  0x0000003b7c01484c in qb_log_from_external_source (function=<value optimized out>, filename=<value optimized out>, format=<value optimized out>, 
    priority=<value optimized out>, lineno=<value optimized out>, tags=<value optimized out>) at log.c:340
#5  0x000000000041e244 in lrm_state_verify_stopped (lrm_state=0x1f7a840, cur_state=S_STOPPING, log_level=<value optimized out>) at lrm.c:370
#6  0x00000000004236d0 in do_lrm_control (action=144115188075855872, cause=<value optimized out>, cur_state=S_STOPPING, current_input=<value optimized out>, msg_data=
    0x1fd4bf0) at lrm.c:282
#7  0x0000000000408353 in s_crmd_fsa_actions (fsa_data=0x1fd4bf0) at fsa.c:416
#8  0x0000000000409626 in s_crmd_fsa (cause=<value optimized out>) at fsa.c:231
#9  0x0000000000411f9e in crm_fsa_trigger (user_data=<value optimized out>) at callbacks.c:256
#10 0x0000003b7cc2a8f3 in crm_trigger_dispatch (source=0x1e74a00, callback=<value optimized out>, userdata=<value optimized out>) at mainloop.c:105
#11 0x0000003b7603feb2 in g_main_dispatch (context=0x1e680e0) at gmain.c:2149
#12 g_main_context_dispatch (context=0x1e680e0) at gmain.c:2702
#13 0x0000003b76043d68 in g_main_context_iterate (context=0x1e680e0, block=1, dispatch=1, self=<value optimized out>) at gmain.c:2780
#14 0x0000003b76044275 in g_main_loop_run (loop=0x1e68ed0) at gmain.c:2988
#15 0x00000000004054de in crmd_init () at main.c:154
#16 0x000000000040581c in main (argc=1, argv=0x7fffd7dc1d08) at main.c:121

Comment 9 Jaroslav Kortus 2013-09-18 13:49:11 UTC

Created attachment 799388 [details]
crm_report from the crash event

Comment 10 Jaroslav Kortus 2013-09-18 14:08:53 UTC

Hi Andrew,

thanks for clarification, this command works as expected:
pcs stonith create --force scsi fence_scsi devices="/dev/sdb" pcmk_host_list="bucek-02" pcmk_host_check="static-list" pcmk_reboot_action="off"

We should probably either document this necessity for fence_scsi or (probably better) add reboot option to fence_scsi which would be equal to "off".

Observed bugs:
1. it still registers all nodes which is IMHO wrong.
2. when you delete the stonith device and create it again, it does not trigger the registration again. It should register with the devices whenever the stonith device is created (possibly even modified).


What do you think?

Comment 11 Andrew Beekhof 2013-09-19 07:32:19 UTC

(In reply to Jaroslav Kortus from comment #10)
> Hi Andrew,
> 
> thanks for clarification, this command works as expected:
> pcs stonith create --force scsi fence_scsi devices="/dev/sdb"
> pcmk_host_list="bucek-02" pcmk_host_check="static-list"
> pcmk_reboot_action="off"
> 
> We should probably either document this necessity for fence_scsi or
> (probably better) add reboot option to fence_scsi which would be equal to
> "off".

Agreed that the later is probably preferable.

> 
> Observed bugs:
> 1. it still registers all nodes which is IMHO wrong.

Not sure I follow this.  Can you elaborate.

> 2. when you delete the stonith device and create it again, it does not
> trigger the registration again.

I thought that was the intended behaviour.

> It should register with the devices whenever
> the stonith device is created (possibly even modified).
> 
> 
> What do you think?

We can do that.  For 6.5 or >= 6.6 ?

Comment 12 Fabio Massimo Di Nitto 2013-09-19 07:43:15 UTC

(In reply to Andrew Beekhof from comment #11)

> > 2. when you delete the stonith device and create it again, it does not
> > trigger the registration again.
> 
> I thought that was the intended behaviour.

This is the same behaviour as cman/rgmanager based cluster. If you change cluster.conf you will need to re-run unfencing manually. I don't see this either as a regression or something we need to address in 6.5.

> 
> > It should register with the devices whenever
> > the stonith device is created (possibly even modified).
> > 
> > 
> > What do you think?
> 
> We can do that.  For 6.5 or >= 6.6 ?

RFE.. as above.

Comment 13 Jaroslav Kortus 2013-09-19 10:30:53 UTC

(In reply to Andrew Beekhof from comment #11)
> (In reply to Jaroslav Kortus from comment #10)
> > Hi Andrew,
> > 
> > thanks for clarification, this command works as expected:
> > pcs stonith create --force scsi fence_scsi devices="/dev/sdb"
> > pcmk_host_list="bucek-02" pcmk_host_check="static-list"
> > pcmk_reboot_action="off"
> > Observed bugs:
> > 1. it still registers all nodes which is IMHO wrong.
> 
> Not sure I follow this.  Can you elaborate.

With the command above there should be only one key, as the fencing is valid for one node only. Instead, it creates 3 keys (one for each node).

Comment 14 Andrew Beekhof 2013-09-19 11:16:55 UTC

(In reply to Jaroslav Kortus from comment #13)
> (In reply to Andrew Beekhof from comment #11)
> > (In reply to Jaroslav Kortus from comment #10)
> > > Hi Andrew,
> > > 
> > > thanks for clarification, this command works as expected:
> > > pcs stonith create --force scsi fence_scsi devices="/dev/sdb"
> > > pcmk_host_list="bucek-02" pcmk_host_check="static-list"
> > > pcmk_reboot_action="off"
> > > Observed bugs:
> > > 1. it still registers all nodes which is IMHO wrong.
> > 
> > Not sure I follow this.  Can you elaborate.
> 
> With the command above there should be only one key, as the fencing is valid
> for one node only. Instead, it creates 3 keys (one for each node).

Do you have three of these devices defined, each with a single node in pcmk_host_list?

a) Don't do that. Just a single device is enough/recommended
b) That still shouldn't result in what you're seeing

How many times do you see this message?

   crm_notice("Unfencing ourselves with %s (%s)", agent, device->id);

Because it is protected by a static boolean and the only path to the unfencing logic:

  if(have_fence_scsi == FALSE && safe_str_eq(agent, "fence_scsi")) {
      stonith_device_t *device = g_hash_table_lookup(device_list, rsc->id);

      if(device) {
          have_fence_scsi = TRUE;
          crm_notice("Unfencing ourselves with %s (%s)", agent, device->id);
          schedule_internal_command(__FUNCTION__, device, "on", stonith_our_uname, 0, NULL, unfence_cb);

Comment 15 Jaroslav Kortus 2013-09-19 13:45:03 UTC

(In reply to Andrew Beekhof from comment #14)
> (In reply to Jaroslav Kortus from comment #13)
> > (In reply to Andrew Beekhof from comment #11)
> > > (In reply to Jaroslav Kortus from comment #10)
> > > > Hi Andrew,
> > > > 
> > > > thanks for clarification, this command works as expected:
> > > > pcs stonith create --force scsi fence_scsi devices="/dev/sdb"
> > > > pcmk_host_list="bucek-02" pcmk_host_check="static-list"
> > > > pcmk_reboot_action="off"
> > > > Observed bugs:
> > > > 1. it still registers all nodes which is IMHO wrong.
> > > 
> > > Not sure I follow this.  Can you elaborate.
> > 
> > With the command above there should be only one key, as the fencing is valid
> > for one node only. Instead, it creates 3 keys (one for each node).
> 
> Do you have three of these devices defined, each with a single node in
> pcmk_host_list?
> 
> a) Don't do that. Just a single device is enough/recommended

The fencing command presented was the only one being in the conf.

> b) That still shouldn't result in what you're seeing
> 
> How many times do you see this message?
> 
>    crm_notice("Unfencing ourselves with %s (%s)", agent, device->id);
> 

I can see it once per node:
Sep 19 15:43:29 bucek-01 stonith-ng[16607]:   notice: cib_device_update: Unfencing ourselves with fence_scsi (fence_scsi)
Sep 19 15:43:29 bucek-02 stonith-ng[17038]:   notice: cib_device_update: Unfencing ourselves with fence_scsi (fence_scsi)
Sep 19 15:43:29 bucek-03 stonith-ng[16939]:   notice: cib_device_update: Unfencing ourselves with fence_scsi (fence_scsi)

Comment 16 Jaroslav Kortus 2013-09-19 13:48:39 UTC

(In reply to Fabio Massimo Di Nitto from comment #12)
> (In reply to Andrew Beekhof from comment #11)
> 
> > > 2. when you delete the stonith device and create it again, it does not
> > > trigger the registration again.
> > 
> > I thought that was the intended behaviour.
> 
> This is the same behaviour as cman/rgmanager based cluster. If you change
> cluster.conf you will need to re-run unfencing manually. I don't see this
> either as a regression or something we need to address in 6.5.

I never said this was a regression, just a difference from what I expected. Considering that it behaves consistently with cman/rgmanager, do you think this should be changed (via RFE)? 

Would it make sense to make it work that way or are there more reasons for current implementation?

Comment 17 Fabio Massimo Di Nitto 2013-09-19 13:51:08 UTC

(In reply to Jaroslav Kortus from comment #16)
> (In reply to Fabio Massimo Di Nitto from comment #12)
> > (In reply to Andrew Beekhof from comment #11)
> > 
> > > > 2. when you delete the stonith device and create it again, it does not
> > > > trigger the registration again.
> > > 
> > > I thought that was the intended behaviour.
> > 
> > This is the same behaviour as cman/rgmanager based cluster. If you change
> > cluster.conf you will need to re-run unfencing manually. I don't see this
> > either as a regression or something we need to address in 6.5.
> 
> I never said this was a regression, just a difference from what I expected.
> Considering that it behaves consistently with cman/rgmanager, do you think
> this should be changed (via RFE)? 

If anything it would be an RFE yes.

> 
> Would it make sense to make it work that way or are there more reasons for
> current implementation?

I don't know if there are any side effects of doing it as you suggest. we will need to investigate with Marek for possible corner cases such as "clean up after a config change" and stuff like that...

Comment 18 Jaroslav Kortus 2013-09-19 14:52:46 UTC

(In reply to Andrew Beekhof from comment #11)
> (In reply to Jaroslav Kortus from comment #10)
> > Hi Andrew,
> > 
> > thanks for clarification, this command works as expected:
> > pcs stonith create --force scsi fence_scsi devices="/dev/sdb"
> > pcmk_host_list="bucek-02" pcmk_host_check="static-list"
> > pcmk_reboot_action="off"
> > 
> > We should probably either document this necessity for fence_scsi or
> > (probably better) add reboot option to fence_scsi which would be equal to
> > "off".
> 
> Agreed that the later is probably preferable.

I've looked at other fence agents and why it worked with cman and the reason is that cman probably uses default actions of the agents (which is defaults to "off" for in our case).

So maybe the desired change would be to change pacemaker to leave it also empty by default, so that the agent's default action is used. This sounds like a risky change to me, so it would probably not make it to 6.5, but we could document it for now and solve it better in 6.5.

Any comments on this?

Comment 19 Andrew Beekhof 2013-09-19 23:38:11 UTC

(In reply to Jaroslav Kortus from comment #18)
> I've looked at other fence agents and why it worked with cman and the reason
> is that cman probably uses default actions of the agents (which is defaults
> to "off" for in our case).
> 
> So maybe the desired change would be to change pacemaker to leave it also
> empty by default, so that the agent's default action is used. This sounds
> like a risky change to me,

agreed

> so it would probably not make it to 6.5, but we
> could document it for now and solve it better in 6.5.
> 
> Any comments on this?

Documenting for now sounds reasonable and we can have the discussion for 6.6
I would probably still lean towards changing fence_scsi at this point though.

Comment 20 Andrew Beekhof 2013-09-19 23:41:32 UTC

(In reply to Jaroslav Kortus from comment #16)
> I never said this was a regression, just a difference from what I expected.
> Considering that it behaves consistently with cman/rgmanager, do you think
> this should be changed (via RFE)? 

I think its a reasonable request.  Its how I would have written it too.

> Would it make sense to make it work that way or are there more reasons for
> current implementation?

I think someone was trying to prevent unfencing without a first rebooting.
However I have no objection.

Comment 22 Jaroslav Kortus 2013-10-03 09:06:52 UTC

follow up created as https://bugzilla.redhat.com/show_bug.cgi?id=1014978

Comment 23 Jaroslav Kortus 2013-10-03 10:38:53 UTC

with pacemaker-1.1.10-12.el6.x86_64 scsi fencing works as expected provided that you set pcmk_reboot_option="off".

Nodes are fenced properly and unfence only self on startup.

I will mark this as verified when fix for comment 8 is in.

Comment 24 Andrew Beekhof 2013-10-04 03:53:50 UTC

Additional fixes are in -14

Comment 25 Jaroslav Kortus 2013-10-07 18:40:27 UTC

I can see fix for comment 8 in pacemaker-1.1.10-14.el6.src.rpm. Marking as verified.

Comment 26 errata-xmlrpc 2013-11-21 12:10:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-1635.html

Note You need to log in before you can comment on or make changes to this bug.