Created attachment 516605 [details]
Suggested patch for /usr/share/cluster/utils/fs-lib.sh
Description of problem:
The customer uses High Availability Add-On cluster with the iSCSI shared disk. When iSCSI disk access fails, the active nodes repeats stopping and starting the service forever.
The resource definition of the filesystem on the shared disk is as below:
<fs device="/dev/sdb" fstype="ext4" mountpoint="/data01" name="data_fs"/>
The root cause of the problem is that when rgmanager tries to restart the service, even though mounting the filesystem fails with the return code 32, /usr/share/cluster/utils/fs-lib.sh doesn't recognize it as an error.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Configure a cluster with iSCSI shared disk and create a filesystem resource on it. Do not use qdisk.
2.Emulate the disk path error by blocking the iSCSI access with the iptables on the active cluster node.
# iptables -A INPUT -m tcp -p tcp --sport 3260 -j REJECT
rgmanager repeats stopping and starting the service forever.
The service is relocated to the other node.
See the attachment for the suggested patch to /usr/share/cluster/utils/fs-lib.sh. It catches the all non-zero return codes as an error when mounting the filesystem. In my lab cluster, it successfully relocated the service. However, I'm not sure whether it's good to handle ALL non-zero return codes as an error.
Nonzero return codes should be treated as errors, according to the mount man page.
Also, it appears that your patch would work.
Basically, if mount fails, the resource agent should return a failure -- this is for all values of failure, not just '1'. In this case, the device is missing, and mount returned the generic '32' error code for a failed mount, which was not handled.
This should be simple to fix.
Available in rhel6-fixes branch upstream.
As for testing, I don´t have a setup to trigger an error != 1 at the moment but the patch is easy enough and tested in netfs.sh code.
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
Cause: fs-lib.sh resource agent library was ignoring errors other than '1'
Consequence: When a mount returned an error other than 1 (such as an iScsi mount) fs-lib.sh thought it worked properly
Fix: make fs-lib.sh recognize other errors
Result: fs-lib.sh now recognizes all errors and fails properly.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.