Bug 738384 - fix scsi unfencing to allow simultaneous unfences
Summary: fix scsi unfencing to allow simultaneous unfences
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: fence-agents
Version: 6.2
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: 6.3
Assignee: Ryan O'Hara
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 743047
TreeView+ depends on / blocked
 
Reported: 2011-09-14 17:01 UTC by Jaroslav Kortus
Modified: 2011-12-06 12:23 UTC (History)
3 users (show)

Fixed In Version: fence-agents-3.1.5-10.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-12-06 12:23:17 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:1599 normal SHIPPED_LIVE fence-agents bug fix and enhancement update 2011-12-06 00:51:16 UTC

Description Jaroslav Kortus 2011-09-14 17:01:42 UTC
Description of problem:
when fence_node -U is run during unfencing operations (registering with shared scsi devices), it may happen that the unfencing fails with fence_scsi: [error] main::do_reserve (err=99).

This is due to the fact that no reservation is present on the device when the command is issued, but because all the nodes have fired the reservation command, some of them will succeed and some of them will fail.

It would be good if fence_scsi tried it again after it gets 99 error, looked again if reservation is present and if it is, just continue, if it's not, wait for some random time (0-5 secs) and then try once again or fail.

Version-Release number of selected component (if applicable):
fence-agents-3.1.5-8.el6.x86_64


How reproducible:
100%

Steps to Reproduce:
1. start cluster with configured scsi reservations (chance it happens here already)
2. remove all reservations/registrations from the device
3. issue cluster-wide fence_node -U (via cssh for example)
  
Actual results:
- some of the nodes will fail the unfence operation
- "kernel: sd 1:0:1:1: reservation conflict" or similar message in syslog
- error 99 in fence_scsi log file present (if configured)

Expected results:
no failure if retry was possible

Additional info:

Comment 9 Jaroslav Kortus 2011-09-16 12:21:07 UTC
(07:16:41) [root@marathon-01:/usr/sbin]$ for dev in /dev/sda /dev/sdb; do sg_persist -i -k $dev; sg_persist -i -r $dev; done
  WINSYS    SX2394R           361H
  Peripheral device type: disk
  PR generation=0x88, 5 registered reservation keys follow:
    0x512a0001
    0x512a0002
    0x512a0003
    0x512a0005
    0x512a0004
  WINSYS    SX2394R           361H
  Peripheral device type: disk
  PR generation=0x88, Reservation follows:
    Key=0x512a0001
    scope: LU_SCOPE,  type: Write Exclusive, registrants only
  WINSYS    SX2394R           361H
  Peripheral device type: disk
  PR generation=0x98, 5 registered reservation keys follow:
    0x512a0001
    0x512a0002
    0x512a0005
    0x512a0003
    0x512a0004
  WINSYS    SX2394R           361H
  Peripheral device type: disk
  PR generation=0x98, Reservation follows:
    Key=0x512a0001
    scope: LU_SCOPE,  type: Write Exclusive, registrants only

Sep 16 07:16:41 marathon-02 kernel: sd 1:0:1:1: reservation conflict
Sep 16 07:16:41 marathon-02 fence_node[17674]: unfence marathon-02 success

Works as expected with the patch.

Comment 10 Ryan O'Hara 2011-09-19 17:28:04 UTC
Pushed to master branch.

commit d532e41a3d2a9d85db4b87b80c36119f59534c85

Comment 11 Ryan O'Hara 2011-09-19 17:38:02 UTC
Pushed to RHEL6 branch.

commit e5bf447139c7ba7c128f615a8bcbf46174d0945a

Comment 14 errata-xmlrpc 2011-12-06 12:23:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1599.html


Note You need to log in before you can comment on or make changes to this bug.