Hide Forgot
Description of problem: when fence_node -U is run during unfencing operations (registering with shared scsi devices), it may happen that the unfencing fails with fence_scsi: [error] main::do_reserve (err=99). This is due to the fact that no reservation is present on the device when the command is issued, but because all the nodes have fired the reservation command, some of them will succeed and some of them will fail. It would be good if fence_scsi tried it again after it gets 99 error, looked again if reservation is present and if it is, just continue, if it's not, wait for some random time (0-5 secs) and then try once again or fail. Version-Release number of selected component (if applicable): fence-agents-3.1.5-8.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. start cluster with configured scsi reservations (chance it happens here already) 2. remove all reservations/registrations from the device 3. issue cluster-wide fence_node -U (via cssh for example) Actual results: - some of the nodes will fail the unfence operation - "kernel: sd 1:0:1:1: reservation conflict" or similar message in syslog - error 99 in fence_scsi log file present (if configured) Expected results: no failure if retry was possible Additional info:
(07:16:41) [root@marathon-01:/usr/sbin]$ for dev in /dev/sda /dev/sdb; do sg_persist -i -k $dev; sg_persist -i -r $dev; done WINSYS SX2394R 361H Peripheral device type: disk PR generation=0x88, 5 registered reservation keys follow: 0x512a0001 0x512a0002 0x512a0003 0x512a0005 0x512a0004 WINSYS SX2394R 361H Peripheral device type: disk PR generation=0x88, Reservation follows: Key=0x512a0001 scope: LU_SCOPE, type: Write Exclusive, registrants only WINSYS SX2394R 361H Peripheral device type: disk PR generation=0x98, 5 registered reservation keys follow: 0x512a0001 0x512a0002 0x512a0005 0x512a0003 0x512a0004 WINSYS SX2394R 361H Peripheral device type: disk PR generation=0x98, Reservation follows: Key=0x512a0001 scope: LU_SCOPE, type: Write Exclusive, registrants only Sep 16 07:16:41 marathon-02 kernel: sd 1:0:1:1: reservation conflict Sep 16 07:16:41 marathon-02 fence_node[17674]: unfence marathon-02 success Works as expected with the patch.
Pushed to master branch. commit d532e41a3d2a9d85db4b87b80c36119f59534c85
Pushed to RHEL6 branch. commit e5bf447139c7ba7c128f615a8bcbf46174d0945a
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1599.html