Bug 865161
Summary: | fence_scsi requires both fencing and unfencing section in cluster.conf | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Christoph Torlinsky <christ> | |
Component: | luci | Assignee: | Ryan McCabe <rmccabe> | |
Status: | CLOSED DUPLICATE | QA Contact: | Cluster QE <mspqa-list> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 6.3 | CC: | ccaulfie, cfeist, christ, cluster-maint, dyasny, fdinitto, jpokorny, lhh, mherbert, rpeterso, rsteiger, teigland | |
Target Milestone: | rc | Keywords: | TestOnly | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 893574 893575 (view as bug list) | Environment: | ||
Last Closed: | 2013-01-09 18:55:35 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 887349, 987070 | |||
Bug Blocks: | 893574, 893575 |
Description
Christoph Torlinsky
2012-10-10 21:56:38 UTC
I did actually use /dev/sdc (ignore the /dev/sdb) # sg_persist --out --register --param-sark=123abc /dev/sdc ATA VBOX HARDDISK 1.0 Peripheral device type: disk PR out:, command not supported So yes, it still starts against it, <fence_daemon post_fail_delay="3" post_join_delay="20"/> <clusternodes> <clusternode name="hanode1" nodeid="1"> <fence> <method name="DiskRes"> <device name="scsi3"/> </method> </fence> </clusternode> <clusternode name="hanode2" nodeid="2"> <fence> <method name="DiskRes"> <device name="scsi3"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"/> <fencedevices> <fencedevice agent="fence_scsi" devices="/dev/sdc" name="scsi3"/ > </fencedevices> You need to configure unfencing. With your current configuration, fence_scsi is not called at all. In other words, at no point is fence_scsi checking that your device(s) support SCSI-3 PR. If you configure unfencing (as you should), when you run 'service cman start' it will attempt to create registrations/reservations. If that fails, then the node should not be allowed to join the fence domain. Hey there, as mentioned it does error out if unfence is set to on, however per our documents an unfence action may not be necessarily set on in all instances, so the initial scsi fence registration should error out when it detects in this case a non compliant disk as well, like unfence does, there appears to be no enforcement during the initial fence step. Fencing should error out too, which is what I'm saying, currently it allows the it to fence. The main use for unfencing is with storage/SAN (non-power) agents. When using power-based fencing agents, the fencing action itself is supposed to turn a node back on after first turning the power off (this happens automatically with a "reboot" action, and needs to be configured explicitly as "off" + "on" otherwise.) When using storage-based fencing agents, the fencing action is not allowed to re-enable a node after disabling it. Re-enabling a fenced node is only safe once the node has been rebooted. A natural way to re-enable a fenced node’s access to storage, is for that node to re- enable the access itself during its startup process. The cman init script calls fence_node -U (nodename defaults to local nodename when unfencing). Unfencing a node without an <unfence> configuration (see below) is a no-op. The basic differences between fencing and unfencing: Fencing 1. libfence: fence_node(), command line: fence_node nodename 2. Turns off or disables a node. 3. Agents run with the default action of "off", "disable" or "reboot". 4. Performed by a cluster node against another node that fails (by the fenced daemon) fence_scsi is an I/O fencing agent that uses SCSI-3 persistent reservations to control access to shared storage devices. These devices must support SCSI-3 persistent reservations (SPC-3 or greater) as well as the "preempt-and-abort" subcommand. above should not work, but it does... Joining fence domain... [ OK ] [root@faust cluster]# ps -ef | grep fence root 4504 1 0 16:42 ? 00:00:00 fenced root 4878 4504 0 16:43 ? 00:00:00 /usr/bin/perl /usr/sbin/fence_scsi Unfencing is not optional with fence_scsi. Hey there, I agree - however - fence_scsi works against a non-compliant disk, even before we get to the unfencing bit which is to be set (page 67 of the Cluster Admin Guide) The unfencing bit is covered by the fence_node man page. All I'm saying is that it appears fence_scsi is not really enforcing SCSI3-compliance at cman startup. fenced (and fence_tool) are agnostic/unaware of any fence agent specifics, like fence_scsi. So, it won't catch any errors/problems/misconfiguration at the agent level. As Ryan said, fence_scsi only works if <unfence> is configured in cluster.conf. When unfence is configured, cman init will run fence_node -U, which causes a node to unfence itself. It does this *before* doing fence_tool join, and if the unfence step fails, cman init aborts and does not run fence_tool join -- this cman startup failure in init is the only protection against joining the fence domain without first registering SCSI PR via unfencing. If <unfence> is not configured for fence_scsi, the rest of the stack will not notice or fail. Ideally, the cluster.conf validation would see that fence_scsi is configured without <unfence> and warn or fail to validate the config. I'm not sure if this would be possible to check for or not. I'm confused about how/why you're running things, e.g. cman startup and fence_scsi, before/without fully configuring cluster.conf? Is this something that the documentation is encouraging? Also, fence_scsi should obviously fail immediately and return/log an error if the device it's operating on does not support SCSI3PR. I agree with you, for what it's worth the cluster will start up with unfence not being in the cluster.conf and add the fence_scsi device I pass along, regardless as you can see here: [root@faust cluster]# ccs -h hanode1 --lsfenceinst hanode1 DiskRes scsi3: hanode2 DiskRes scsi3: [root@faust cluster]# ccs -h hanode1 --lsfencedev scsi3: devices=/dev/sdc, agent=fence_scsi [root@faust cluster]# cman_tool nodes Node Sts Inc Joined Name 1 M 312 2012-10-11 16:53:08 hanode1 2 M 324 2012-10-11 17:06:11 hanode2 [root@faust cluster]# I reckon it's either something in the cman startup that needs to be caught, or the sg_persist commands don't fully test the devices out when it's called in one of the routines that call sg_persist [root@faust bin]# sg_persist --out --register --param-sark=123abc /dev/sdc g_persist -vv --out --register --param-sark=123abc /dev/sdc open /dev/sdc with flags=0x800 inquiry cdb: 12 00 00 00 24 00 ATA Peripheral device type: disk open /dev/sdc with flags=0x802 Persistent Reservation Out cmd: 5f 00 00 00 00 00 00 00 18 00 Persistent Reservation Out parameters: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 12 3a bc ..............:. 10 00 00 00 00 00 00 00 00 ........ persistent reserve out: Fixed format, current; Sense key: Illegal Request Additional sense: Invalid command operation code Raw sense data (in hex): 70 00 05 00 00 00 00 0a 00 00 00 00 20 00 00 00 00 00 PR out:, command not supported The sg_persist calls to create registrations are done withing fence_scsi when it is given the "on" action (ie. unfencing). If fence_scsi attempts to create registrations via sg_persist and that command fails, fence_scsi will detect this and return an error. If you do not setup unfencing, this will never happen. So you are creating and starting a cluster without any fencing devices, and then adding a the scsi fencing devices once the cluster is already running? That's something that we should prohibit -- we might be able to catch it being done in some cases and report an error, but probably not all, which means we'd need to document this as something you cannot do. So in this case the cluster init should catch it when unfence is missing in the cluster.conf and fence device is added, ran some of the sg_persist from the fence_scsi script and they do catch my device not being compliant, cman still goes ahead however and ups the cluster and adds the scsi devices as fencedevs. I'm adding the fence devs in first, and defining my method, then later deciding whether to use unfence action on or off, but I can omit it entirely and the cluster starts up and lets me add it later without throwing a warning or stopping me from doing so against a disk that it should not be allowed against. It sounds like we need to fix the ccs tool and probably the documentation. Adding fence_scsi to cluster.conf requires: - the entire cluster must first be stopped - the full configuration must be added (devices, unfencing) before the cluster is started "whether to use unfence action on or off" I'm not sure I understand this part, <unfence> always uses the on action. Is there a procedure or documentation that implies this is a decision to make? Yes, there is Page 68 of the RHEL6 Admin Guide (latest i could find ) To configure unfencing for the storage based fence device on this node, execute the following command: ccs -h host --addunfence fencedevicename node action=on|off You will need to add a fence method for each node in the cluster. The following commands configure a fence method for each node with the method name SAN. The device for the fence method specifies sanswitch as the device name, which is a device previously configured with the --addfencedev option, as described in Section 5.5, “Configuring Fence Devices”. Furthermore I took out my Method and Fence Device out entirely, and was able to startup the cluster, as you can see here. [root@faust cluster]# ccs -h hanode1 --rmmethod DiskRes hanode2 ccs -h hanode1 --rmmethod DiskRes hanode1 ccs -h hanode1 --rmfencedev scsi3 [root@faust cluster]# ccs -h hanode1 --sync --activate [root@faust cluster]# ccs -h hanode1 --startall Started hanode1 Started hanode2 [root@faust cluster]# clustat Cluster Status for webcluster @ Thu Oct 11 19:26:47 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ hanode1 1 Online, Local hanode2 2 Online more cluster.conf <?xml version="1.0"?> <cluster config_version="45" name="webcluster"> <fence_daemon post_fail_delay="3" post_join_delay="20"/> <clusternodes> <clusternode name="hanode1" nodeid="1"> <fence/> </clusternode> <clusternode name="hanode2" nodeid="2"> <fence/> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"/> <fencedevices/> <rm> <resources> <script file="/etc/init.d/httpd start" name="httpd"/> </resources> <service autostart="0" name="webserver" recovery="relocate"> <apache config_file="/etc/httpd/conf/httpd.conf" name="web" server_root="/etc/httpd" shutdown _wait="0"/> <ip address="10.1.3.100"/> </service> </rm> <logging debug="on"/> </cluster> SO i take it back, fence_scsi does all the right things, and so does fence_node they just never get called to enforce right in the cluster (unless unfence action is set), which is not enforced at cluster startup. Thanks for finding this. The docs should be updated to state that the cluster must be stopped while adding a fence device that requires unfencing. (It's probably safest to stop the cluster while adding any fencing, but the argument for that is not quite so clear.) It would also be nice if the ccs command refused to add unfencing to a live cluster, with an error stating the same. yeah, that would be safest - when a scsi3-pr fence device is configured, the cluster should be down (to ensure here is no changes in state) and not in a running state, and should also enforce the unfence to be sure to run the sg_persist check to really validate that the devices comply with the scsi3 instructions per scsi spec. I would check during the cman start up for incorrect configuration and at the ccs command line level, this is a tricky one to get wrong because the documentation is not clear enough, and if it's gotten wrong it can cause some serious outages or data corruption, as you can imagine. It's like the utilities all do the right thing, just the cluster itself is loose about enforcing with them,.. i would not rely on a documentation fixes alone in my opinion. SCSI3 fencing in addition to other methods (network based) is a great thing to have working. Can we agree this is not a fence_scsi bug? Perhaps a bug for documentation and/or cman init script. Sure, fence_scsi appears to be OK (the sg_persist in it do the test), it's just not being called right, seems like both a ccs issue and cman init (cluster startup), in calling fencing of scsi devices upon configuration and startup to me. Documentation fixing seems like a loose workaround to me, but should be done also. Just to clarify, counting on customers to read documentation and hope it will correct cluster state is probably not a good idea, especially when fencing is involved at the storage layer, it would be better to ensure that the cluster does not fire up with incorrect settings / storage. Just my thoughts here. (In reply to comment #21) > Can we agree this is not a fence_scsi bug? Perhaps a bug for documentation > and/or cman init script. (In reply to comment #9) > fenced (and fence_tool) are agnostic/unaware of any fence agent specifics, > like fence_scsi. So, it won't catch any errors/problems/misconfiguration at > the agent level. cman init lands in the same area as fenced and tools. It is agnostic/unaware of specific agents requirements. Either reassign to documentation or to the different UIs. Moving to luci, clone for ccs and docs will arrive right after. Issue delegation was too slow here as it should already be fixed in luci going to 6.4. I added enforcement of unfencing (even if it wasn't configured initially, it will be forced + user is acknowledged about this explicitly) in connection to [bug 877098] (cf. comments 15, 18, 20) that targetted fence_sanlock, however "unfence is mandatory" principle is the same for fence_scsi so this one got it straight away (trivial change). Christoph (or Fabio), please check that >=luci-0.26.0-35.el6 (as in RHEL 6.4 in preparation) works for you in this regard. I am also asking Radek to have a look at this specifically. If no negative feedback occurs, I am closing this bug shortly (or feel free to beat me in it). P.S. I would note that it is irresponsible *not* to state such critical constraint in the man page (see [bug 887349]) at the first place. Ok you can just close the bug if this is already handled by luci. I didn´t honestly have time to look at history and the original bug was targeting components that had "no clue" about fencing configuration. The original was also targeted for 6.5 but apparently "cloning" decided to force it back to 6.4. *** This bug has been marked as a duplicate of bug 877098 *** |