Bug 865161 - fence_scsi requires both fencing and unfencing section in cluster.conf
fence_scsi requires both fencing and unfencing section in cluster.conf
Status: CLOSED DUPLICATE of bug 877098
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: luci (Show other bugs)
6.3
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Ryan McCabe
Cluster QE
: TestOnly
Depends On: 887349 987070
Blocks: 893574 893575
  Show dependency treegraph
 
Reported: 2012-10-10 17:56 EDT by Christoph Torlinsky
Modified: 2013-11-15 15:21 EST (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 893574 893575 (view as bug list)
Environment:
Last Closed: 2013-01-09 13:55:35 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Christoph Torlinsky 2012-10-10 17:56:38 EDT
Description of problem:

When debugging fencing or other components, I noticed that on my Virtual
test Cluster i was able to fence join on a non-SCSI3 PR compliant device,
very odd, this should error out or throw me some exception / warning .

Version-Release number of selected component (if applicable):

6.3
How reproducible:

Try debugging (I'm doing fencing debugs)

Steps to Reproduce:
1.Form a Cluster (Two Node Generic)
2.check device (in this case /dev/sdc) which is a Virtual shared VDI
with sg_persist (example: #sg_persist --out --register --param-sark=123abc /dev/sdc)
This should error out.
ATA VBOX HARDDISK 1.0
Peripheral device type: disk
PR out: command not supported

3.Turn on debugging in cluster and configure fence_scsi
ccs -h hanode1 --addfencedev scsi3 agent=fence_scsi devices=/dev/sdb
ccs -h hanodeq --addmethod DiskRes hanode1
ccs -h hanodeq --addmethod DiskRes hanode2
ccs -h hanode1 --addfenceinst scsi3 hanode1 DiskRes
ccs -h hanode1 --addfenceinst scsi3 hanode2 DiskRes

But do NOT unfence

start up and activate cluster
look at fenced

Actual results:
Oct 10 23:30:23 fenced fenced 3.0.12.1 started
Oct 10 23:30:26 fenced cpg_join fenced:daemon ...
Oct 10 23:30:26 fenced setup_cpg_daemon 11
Oct 10 23:30:26 fenced group_mode 3 compat 0
Oct 10 23:30:26 fenced fenced:daemon conf 2 1 0 memb 1 2 join 1 left
Oct 10 23:30:26 fenced fenced:daemon ring 1:180 2 memb 1 2
Oct 10 23:30:27 fenced receive_protocol from 2 max 1.1.1.0 run 1.1.1.1
Oct 10 23:30:27 fenced daemon node 2 max 0.0.0.0 run 0.0.0.0
Oct 10 23:30:27 fenced daemon node 2 join 1349904626 left 0 local quorum 1349904
623
Oct 10 23:30:27 fenced run protocol from nodeid 2
Oct 10 23:30:27 fenced receive_protocol from 1 max 1.1.1.0 run 0.0.0.0
Oct 10 23:30:27 fenced daemon node 1 max 0.0.0.0 run 0.0.0.0
Oct 10 23:30:27 fenced daemon node 1 join 1349904626 left 0 local quorum 1349904
623
Oct 10 23:30:27 fenced daemon run 1.1.1 max 1.1.1
Oct 10 23:30:27 fenced receive_protocol from 1 max 1.1.1.0 run 1.1.1.0
Oct 10 23:30:27 fenced daemon node 1 max 1.1.1.0 run 0.0.0.0
Oct 10 23:30:27 fenced daemon node 1 join 1349904626 left 0 local quorum 1349904
623
Oct 10 23:30:28 fenced client connection 3 fd 14
Oct 10 23:30:28 fenced /cluster/fence_daemon/@post_join_delay is 20
Oct 10 23:30:28 fenced /cluster/fence_daemon/@post_fail_delay is 3
Oct 10 23:30:28 fenced added 2 nodes from ccs
Oct 10 23:30:28 fenced cpg_join fenced:default ...
Oct 10 23:30:28 fenced fenced:default conf 2 1 0 memb 1 2 join 1 left
Oct 10 23:30:28 fenced add_change cg 1 joined nodeid 1
Oct 10 23:30:28 fenced add_change cg 1 m 2 j 1 r 0 f 0
Oct 10 23:30:28 fenced check_ringid cluster 180 cpg 0:0
Oct 10 23:30:28 fenced fenced:default ring 1:180 2 memb 1 2
Oct 10 23:30:28 fenced check_ringid done cluster 180 cpg 1:180
Oct 10 23:30:28 fenced check_quorum done
Oct 10 23:30:28 fenced send_start 1:1 flags 1 started 0 m 2 j 1 r 0 f 0
Oct 10 23:30:28 fenced receive_start 1:1 len 152
Oct 10 23:30:28 fenced match_change 1:1 matches cg 1
Oct 10 23:30:28 fenced wait_messages cg 1 need 1 of 2
Oct 10 23:30:28 fenced receive_start 2:5 len 152
Oct 10 23:30:28 fenced match_change 2:5 matches cg 1
Oct 10 23:30:28 fenced wait_messages cg 1 got all 2
Oct 10 23:30:28 fenced set_master from 0 to complete node 2
Oct 10 23:30:28 fenced receive_complete 2:5 len 152
Oct 10 23:48:21 fenced cluster node 2 removed seq 184
Oct 10 23:48:21 fenced fenced:daemon conf 1 0 1 memb 1 join left 2
Oct 10 23:48:21 fenced fenced:daemon conf 1 0 1 memb 1 join left 2
Oct 10 23:48:21 fenced fenced:daemon ring 1:184 1 memb 1
Oct 10 23:48:21 fenced fenced:default conf 1 0 1 memb 1 join left 2
Oct 10 23:48:21 fenced add_change cg 2 remove nodeid 2 reason 3
Oct 10 23:48:21 fenced add_change cg 2 m 1 j 0 r 1 f 1
Oct 10 23:48:21 fenced add_victims node 2
Oct 10 23:48:21 fenced check_ringid cluster 184 cpg 1:180
Oct 10 23:48:21 fenced fenced:default ring 1:184 1 memb 1
Oct 10 23:48:21 fenced check_ringid done cluster 184 cpg 1:184
Oct 10 23:48:21 fenced check_quorum done
Oct 10 23:48:21 fenced send_start 1:2 flags 2 started 1 m 1 j 0 r 1 f 1
Oct 10 23:48:21 fenced receive_start 1:2 len 152
Oct 10 23:48:21 fenced match_change 1:2 matches cg 2
Oct 10 23:48:21 fenced wait_messages cg 2 got all 1
Oct 10 23:48:21 fenced set_master from 2 to complete node 1
Oct 10 23:48:21 fenced delay post_fail_delay 3 quorate_from_last_update 0
Oct 10 23:48:22 fenced cluster node 2 added seq 192
Oct 10 23:48:22 fenced fenced:daemon conf 2 1 0 memb 1 2 join 2 left
Oct 10 23:48:22 fenced fenced:daemon ring 1:192 2 memb 1 2
Oct 10 23:48:22 fenced receive_protocol from 2 max 1.1.1.0 run 1.1.1.1
Oct 10 23:48:22 fenced daemon node 2 max 0.0.0.0 run 0.0.0.0
Oct 10 23:48:22 fenced daemon node 2 join 1349905702 left 1349905701 local quoru
m 1349904623


Expected results:

I would have not expect to be able to join a fence domain here. 

Additional info:

I used a VBOX device shared for LUNs, all i had, it was non-SCSI3 PR compliant
Comment 1 Christoph Torlinsky 2012-10-10 17:59:19 EDT
I did actually use /dev/sdc (ignore the /dev/sdb)
# sg_persist --out --register --param-sark=123abc /dev/sdc
  ATA       VBOX HARDDISK     1.0 
  Peripheral device type: disk
PR out:, command not supported


So yes, it still starts against it,
Comment 3 Christoph Torlinsky 2012-10-10 18:14:15 EDT
 <fence_daemon post_fail_delay="3" post_join_delay="20"/>
        <clusternodes>
                <clusternode name="hanode1" nodeid="1">
                        <fence>
                                <method name="DiskRes">
                                        <device name="scsi3"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="hanode2" nodeid="2">
                        <fence>
                                <method name="DiskRes">
                                        <device name="scsi3"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_scsi" devices="/dev/sdc" name="scsi3"/
>
        </fencedevices>
Comment 4 Ryan O'Hara 2012-10-11 10:03:16 EDT
You need to configure unfencing. With your current configuration, fence_scsi is not called at all. In other words, at no point is fence_scsi checking that your device(s) support SCSI-3 PR. If you configure unfencing (as you should), when you run 'service cman start' it will attempt to create registrations/reservations. If that fails, then the node should not be allowed to join the fence domain.
Comment 5 Christoph Torlinsky 2012-10-11 10:33:48 EDT
Hey there, as mentioned it does error out if unfence is set to on, however
per our documents an unfence action may not be necessarily set on in all instances, so
the initial scsi fence registration should error out when it detects in this case a non compliant disk as well,
like unfence does, there appears to be no enforcement during the initial fence
step. Fencing should error out too, which is what I'm saying, currently it allows the it to fence.

 The main use for unfencing is with storage/SAN (non-power) agents.

       When using power-based fencing agents, the fencing  action  itself  is
       supposed  to  turn  a  node  back on after first turning the power off
       (this happens automatically with a "reboot" action, and  needs  to  be
       configured explicitly as "off" + "on" otherwise.)

       When  using  storage-based  fencing  agents, the fencing action is not
       allowed to re-enable a node after disabling it.  Re-enabling a  fenced
       node  is  only safe once the node has been rebooted.  A natural way to
       re-enable a fenced node’s access to storage, is for that node  to  re-
       enable  the  access  itself during its startup process.  The cman init
       script calls fence_node -U (nodename defaults to local  nodename  when
       unfencing).   Unfencing a node without an <unfence> configuration (see
       below) is a no-op.

       The basic differences between fencing and unfencing:

       Fencing

       1. libfence: fence_node(), command line: fence_node nodename

       2. Turns off or disables a node.

       3. Agents run with the default action of "off", "disable" or "reboot".

       4. Performed by a cluster node against another node that fails (by the
          fenced daemon)
Comment 6 Christoph Torlinsky 2012-10-11 10:45:20 EDT
fence_scsi  is  an  I/O  fencing  agent  that uses SCSI-3 persistent reservations to control
       access to shared storage devices. These devices must support SCSI-3 persistent  reservations
       (SPC-3 or greater) as well as the "preempt-and-abort" subcommand.



above should not work, but it does...

Joining fence domain...                                 [  OK  ]
[root@faust cluster]# ps -ef | grep fence
root      4504     1  0 16:42 ?        00:00:00 fenced
root      4878  4504  0 16:43 ?        00:00:00 /usr/bin/perl /usr/sbin/fence_scsi
Comment 7 Ryan O'Hara 2012-10-11 11:28:05 EDT
Unfencing is not optional with fence_scsi.
Comment 8 Christoph Torlinsky 2012-10-11 11:41:33 EDT
Hey there, I agree - however - fence_scsi works against a non-compliant disk,
even before we get to the unfencing bit which is to be set (page 67 of the Cluster Admin Guide)
The unfencing bit is covered by the fence_node man page.
All I'm saying is that it appears fence_scsi is not really enforcing SCSI3-compliance at cman startup.
Comment 9 David Teigland 2012-10-11 11:48:03 EDT
fenced (and fence_tool) are agnostic/unaware of any fence agent specifics, like fence_scsi.  So, it won't catch any errors/problems/misconfiguration at the agent level.

As Ryan said, fence_scsi only works if <unfence> is configured in cluster.conf.  When unfence is configured, cman init will run fence_node -U, which causes a node to unfence itself.  It does this *before* doing fence_tool join, and if the unfence step fails, cman init aborts and does not run fence_tool join -- this cman startup failure in init is the only protection against joining the fence domain without first registering SCSI PR via unfencing.

If <unfence> is not configured for fence_scsi, the rest of the stack will not notice or fail.  Ideally, the cluster.conf validation would see that fence_scsi is configured without <unfence> and warn or fail to validate the config.  I'm not sure if this would be possible to check for or not.
Comment 10 David Teigland 2012-10-11 11:51:00 EDT
I'm confused about how/why you're running things, e.g. cman startup and fence_scsi, before/without fully configuring cluster.conf?  Is this something that the documentation is encouraging?
Comment 11 David Teigland 2012-10-11 12:08:57 EDT
Also, fence_scsi should obviously fail immediately and return/log an error if the device it's operating on does not support SCSI3PR.
Comment 12 Christoph Torlinsky 2012-10-11 12:45:18 EDT
I agree with you, for what it's worth the cluster will start up with unfence
not being in the cluster.conf and add the fence_scsi device I pass along,
regardless as you can see here:

[root@faust cluster]# ccs -h hanode1 --lsfenceinst
hanode1
  DiskRes
    scsi3: 
hanode2
  DiskRes
    scsi3: 
[root@faust cluster]# ccs -h hanode1 --lsfencedev
scsi3: devices=/dev/sdc, agent=fence_scsi
[root@faust cluster]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    312   2012-10-11 16:53:08  hanode1
   2   M    324   2012-10-11 17:06:11  hanode2
[root@faust cluster]# 


I reckon it's either something in the cman startup that needs to be caught,
or the sg_persist commands don't fully test the devices out when it's called
in one of the routines that call sg_persist

[root@faust bin]# sg_persist --out --register --param-sark=123abc /dev/sdc
g_persist -vv  --out --register --param-sark=123abc /dev/sdc
open /dev/sdc with flags=0x800
    inquiry cdb: 12 00 00 00 24 00 
  ATA       
Peripheral device type: disk
open /dev/sdc with flags=0x802
    Persistent Reservation Out cmd: 5f 00 00 00 00 00 00 00 18 00 
    Persistent Reservation Out parameters:
 00     00 00 00 00 00 00 00 00  00 00 00 00 00 12 3a bc    ..............:.
 10     00 00 00 00 00 00 00 00                             ........        
persistent reserve out:  Fixed format, current;  Sense key: Illegal Request
 Additional sense: Invalid command operation code
 Raw sense data (in hex):
        70 00 05 00 00 00 00 0a  00 00 00 00 20 00 00 00    
        00 00                                               
PR out:, command not supported
Comment 13 Ryan O'Hara 2012-10-11 13:07:36 EDT
The sg_persist calls to create registrations are done withing fence_scsi when it is given the "on" action (ie. unfencing). If fence_scsi attempts to create registrations via sg_persist and that command fails, fence_scsi will detect this and return an error. If you do not setup unfencing, this will never happen.
Comment 14 David Teigland 2012-10-11 13:13:19 EDT
So you are creating and starting a cluster without any fencing devices, and then adding a the scsi fencing devices once the cluster is already running?  That's something that we should prohibit -- we might be able to catch it being done in some cases and report an error, but probably not all, which means we'd need to document this as something you cannot do.
Comment 15 Christoph Torlinsky 2012-10-11 13:14:47 EDT
So in this case the cluster init should catch it when unfence is missing in the cluster.conf and fence device is added, ran some of the sg_persist
from the fence_scsi script and they do catch my device not being compliant,
cman still goes ahead however and ups the cluster and adds the scsi devices as fencedevs.
Comment 16 Christoph Torlinsky 2012-10-11 13:17:21 EDT
I'm adding the fence devs in first, and defining my method, then later deciding
whether to use unfence action on or off, but I can omit it entirely and the
cluster starts up and lets me add it later without throwing a warning or stopping
me from doing so against a disk that it should not be allowed against.
Comment 17 David Teigland 2012-10-11 13:26:06 EDT
It sounds like we need to fix the ccs tool and probably the documentation.  
Adding fence_scsi to cluster.conf requires:
- the entire cluster must first be stopped
- the full configuration must be added (devices, unfencing) before the cluster is started

"whether to use unfence action on or off"
I'm not sure I understand this part, <unfence> always uses the on action.
Is there a procedure or documentation that implies this is a decision to make?
Comment 18 Christoph Torlinsky 2012-10-11 13:34:33 EDT
Yes, there is Page 68 of the RHEL6 Admin Guide (latest i could find )
 To configure unfencing for the storage based fence device on this node, execute the following command:
ccs -h host --addunfence fencedevicename node action=on|off
You will need to add a fence method for each node in the cluster. The following commands configure a fence method for each node with the method name SAN. The device for the fence method specifies sanswitch as the device name, which is a device previously configured with the --addfencedev option, as described in Section 5.5, “Configuring Fence Devices”. 

Furthermore I took out my Method and Fence Device out entirely, and 
was able to startup the cluster, as you can see here.

[root@faust cluster]# ccs -h hanode1 --rmmethod DiskRes hanode2
ccs -h hanode1 --rmmethod DiskRes hanode1

 ccs -h hanode1 --rmfencedev scsi3
[root@faust cluster]# ccs -h hanode1 --sync --activate
[root@faust cluster]# ccs -h hanode1 --startall
Started hanode1
Started hanode2
[root@faust cluster]# clustat
Cluster Status for webcluster @ Thu Oct 11 19:26:47 2012
Member Status: Quorate

 Member Name                                               ID   Status
 ------ ----                                               ---- ------
 hanode1                                                       1 Online, Local
 hanode2                                                       2 Online
more cluster.conf
<?xml version="1.0"?>
<cluster config_version="45" name="webcluster">
        <fence_daemon post_fail_delay="3" post_join_delay="20"/>
        <clusternodes>
                <clusternode name="hanode1" nodeid="1">
                        <fence/>
                </clusternode>
                <clusternode name="hanode2" nodeid="2">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices/>
        <rm>
                <resources>
                        <script file="/etc/init.d/httpd start" name="httpd"/>
                </resources>
                <service autostart="0" name="webserver" recovery="relocate">
                        <apache config_file="/etc/httpd/conf/httpd.conf" name="web" server_root="/etc/httpd" shutdown
_wait="0"/>
                        <ip address="10.1.3.100"/>
                </service>
        </rm>
        <logging debug="on"/>
</cluster>


SO i take it back, fence_scsi does all the right things, and so does fence_node
they just never get called to enforce right in the cluster (unless unfence action is set), which is not enforced at cluster startup.
Comment 19 David Teigland 2012-10-11 13:57:36 EDT
Thanks for finding this.  The docs should be updated to state that the cluster must be stopped while adding a fence device that requires unfencing.  (It's probably safest to stop the cluster while adding any fencing, but the argument for that is not quite so clear.)

It would also be nice if the ccs command refused to add unfencing to a live cluster, with an error stating the same.
Comment 20 Christoph Torlinsky 2012-10-11 14:17:36 EDT
yeah, that would be safest - when a scsi3-pr fence device is configured,
the cluster should be down (to ensure here is no changes in state) and not in a running state, and should also 
enforce the unfence to be sure to run the sg_persist check to really validate
that the devices comply with the scsi3 instructions per scsi spec.

I would check during the cman start up for incorrect configuration and at the 
ccs command line level, this is a tricky one to get wrong because the documentation is not clear enough, and if it's gotten wrong it can cause some
serious outages or data corruption, as you can imagine. It's like the utilities
all do the right thing, just the cluster itself is loose about enforcing with them,.. i would not rely on a documentation fixes alone in my opinion.
SCSI3 fencing in addition to other methods (network based) is a great thing
to have working.
Comment 21 Ryan O'Hara 2012-10-11 14:47:38 EDT
Can we agree this is not a fence_scsi bug? Perhaps a bug for documentation and/or cman init script.
Comment 22 Christoph Torlinsky 2012-10-11 15:08:33 EDT
Sure, fence_scsi appears to be OK (the sg_persist in it do the test), it's just not being called right, seems like both a ccs issue and cman init (cluster startup), in calling fencing of scsi devices upon configuration and startup to 
me. Documentation fixing seems like a loose workaround to me, but should be done also.
Comment 23 Christoph Torlinsky 2012-10-11 15:53:50 EDT
Just to clarify, counting on customers to read documentation and hope
it will correct cluster state is probably not a good idea, especially
when fencing is involved at the storage layer, it would be better to ensure
that the cluster does not fire up with incorrect settings / storage. Just my thoughts here.
Comment 24 Fabio Massimo Di Nitto 2012-10-15 23:43:34 EDT
(In reply to comment #21)
> Can we agree this is not a fence_scsi bug? Perhaps a bug for documentation
> and/or cman init script.

(In reply to comment #9)
> fenced (and fence_tool) are agnostic/unaware of any fence agent specifics,
> like fence_scsi.  So, it won't catch any errors/problems/misconfiguration at
> the agent level.

cman init lands in the same area as fenced and tools. It is agnostic/unaware of specific agents requirements.

Either reassign to documentation or to the different UIs.
Comment 26 Fabio Massimo Di Nitto 2013-01-09 08:45:02 EST
Moving to luci, clone for ccs and docs will arrive right after.
Comment 27 Jan Pokorný 2013-01-09 10:19:46 EST
Issue delegation was too slow here as it should already be fixed in luci
going to 6.4.

I added enforcement of unfencing (even if it wasn't configured initially,
it will be forced + user is acknowledged about this explicitly) in
connection to [bug 877098] (cf. comments 15, 18, 20) that targetted
fence_sanlock, however "unfence is mandatory" principle is the same
for fence_scsi so this one got it straight away (trivial change).


Christoph (or Fabio), please check that >=luci-0.26.0-35.el6 (as in
RHEL 6.4 in preparation) works for you in this regard.
I am also asking Radek to have a look at this specifically.

If no negative feedback occurs, I am closing this bug shortly (or feel
free to beat me in it).


P.S. I would note that it is irresponsible *not* to state such critical
constraint in the man page (see [bug 887349]) at the first place.
Comment 29 Fabio Massimo Di Nitto 2013-01-09 13:55:35 EST
Ok you can just close the bug if this is already handled by luci.

I didn´t honestly have time to look at history and the original bug was targeting components that had "no clue" about fencing configuration. The original was also targeted for 6.5 but apparently "cloning" decided to force it back to 6.4.

*** This bug has been marked as a duplicate of bug 877098 ***

Note You need to log in before you can comment on or make changes to this bug.