when an uncached gnbd fails, gnbd_monitor fences the server if it is nonresponsive. Then it waits for all current users of the device to close it. Finally it tries to contact the server at regular intervals. If the server comes back up, and reexports the device. gnbd_monitor is supposed to reimport it and start the monitoring all over again. Currently, the check to make sure that the reimport was successful is wrong, so usually, after the device has been successfully reimported, gnbd_monitor will not reset. The next time that the device fails, gnbd_monitor will skip the fence steps and simply try and reimport the device. This means that it cases where the gnbd server is nonresponsive, but the gnbd server node is still alive, gnbd_monitor will not fence the server after the first time. Fixing this problem involves changing the line if (check_recvd(dev) == 1) to if (check_recvd(dev) >= 0) which is obviously the correct thing to check for. A related issue is the requirement that gnbd_monitor waits until all users have closed the device. This is an unnecessary requirement, and it makes it much harder to use dm-multipath, since dm-multipath keeps failed paths open.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0170.html