when an uncached gnbd fails, gnbd_monitor fences the server if it is
nonresponsive. Then it waits for all current users of the device to close it.
Finally it tries to contact the server at regular intervals. If the server
comes back up, and reexports the device. gnbd_monitor is supposed to reimport it
and start the monitoring all over again.
Currently, the check to make sure that the reimport was successful is wrong, so
usually, after the device has been successfully reimported, gnbd_monitor will
not reset. The next time that the device fails, gnbd_monitor will skip the fence
steps and simply try and reimport the device. This means that it cases where
the gnbd server is nonresponsive, but the gnbd server node is still alive,
gnbd_monitor will not fence the server after the first time.
Fixing this problem involves changing the line
if (check_recvd(dev) == 1)
if (check_recvd(dev) >= 0)
which is obviously the correct thing to check for.
A related issue is the requirement that gnbd_monitor waits until all users have
closed the device. This is an unnecessary requirement, and it makes it much
harder to use dm-multipath, since dm-multipath keeps failed paths open.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.