Description of problem ====================== Gluster volume creation fails when the process is done by rhs-hadoop-install script and the cluster is big or slow. The most serious problem is that in some cases, the volume is created in inconsistend state. Version-Release number of selected component (if applicable) ============================================================ using iso RHSS-2.1.bd-20140219.n.0 [1] glusterfs-fuse-3.4.0.44rhs-1.el6rhs.x86_64 glusterfs-geo-replication-3.4.0.44rhs-1.el6rhs.x86_64 glusterfs-server-3.4.0.44rhs-1.el6rhs.x86_64 glusterfs-libs-3.4.0.44rhs-1.el6rhs.x86_64 glusterfs-3.4.0.44rhs-1.el6rhs.x86_64 rhs-hadoop-install-0_72-2.el6rhs.noarch [1] http://download.eng.bos.redhat.com/nightly/RHSS-2.1.bd-20140219.n.0/ How reproducible ================ Sometimes, depends on the cluster. On the other hand I'm able to reproduce it every time on my cluster of virtual machines (ping me to request access). Steps to Reproduce ================== 1. install RHSS iso on cluster of machines, allocate device on each machine (for XFS volume, which will hold bricks) 2. install rhs-hadoop-install script for gluster volume setup (included in the RHSS-2.1.bd iso) 2.1 cd /usr/share/rhs-hadoop-install 2.2 create hosts file with names of machines of the cluster on every line (see /usr/share/rhs-hadoop-install/README.txt for details) 2.3 run the installer: ./install.sh --debug -y /dev/vdb1 where /dev/vdb1 is a device where XFS volume will be created (for brick), this device is expected to exists on every node of the cluster Actual results ============== Two different failures can happen: i) 'volume create' operation fails, and so the volume is not created ~~~ vol create: volume create: HadoopVol: failed: Staging failed on dhcp-37-118.lab.eng.brq.redhat.com. Error: Host dhcp-37-119.lab.eng.brq.redhat.com not connected Staging failed on dhcp-37-113.lab.eng.brq.redhat.com. Error: Host dhcp-37-119.lab.eng.brq.redhat.com not connected ...verify vol create wait: 2 seconds ...verify vol create wait: 4 seconds ...verify vol create wait: 6 seconds ...verify vol create wait: 8 seconds ...verify vol create wait: 10 seconds ...verify vol create wait: 12 seconds ...verify vol create wait: 14 seconds ...verify vol create wait: 16 seconds ...verify vol create wait: 18 seconds ...verify vol create wait: 20 seconds ...verify vol create wait: 22 seconds ...verify vol create wait: 24 seconds ...verify vol create wait: 26 seconds ...verify vol create wait: 28 seconds ...verify vol create wait: 30 seconds ...verify vol create wait: 32 seconds ...verify vol create wait: 34 seconds ...verify vol create wait: 36 seconds ...verify vol create wait: 38 seconds ...verify vol create wait: 40 seconds ERROR: Volume "HadoopVol" creation failed with error 1 Bricks=" dhcp-37-123.lab.eng.brq.redhat.com:/mnt/brick1/HadoopVol dhcp-37-113.lab.eng.brq.redhat.com:/mnt/brick1/HadoopVol dhcp-37-118.lab.eng.brq.redhat.com:/mnt/brick1/HadoopVol dhcp-37-119.lab.eng.brq.redhat.com:/mnt/brick1/HadoopVol" ~~~ note: 'gluster peer staus' shows all nodes as peers in the cluster and connected ii) 'volume create' operation seem to succeed and volume is created, but one node which is supposed to hold brick for the volume doesn't know about this volume, so any operation on the volume fails: ~~~ vol create: volume create: HadoopVol: success: please start the volume to access data Volume "HadoopVol" created... vol start: volume start: HadoopVol: failed: Staging failed on dhcp-37-119.lab.eng.brq.redhat.com. Error: Volume HadoopVol does not exist Volume "HadoopVol" started... -- dhcp-37-123.lab.eng.brq.redhat.com -- mount volume ERROR: dhcp-37-123.lab.eng.brq.redhat.com: mount /mnt/glusterfs: Mount failed. Please check the log file for more details. ~~~ This seems like a serious problem to me, because in this case, the new volume ends up in inconsistend state and I was unable to do anything about it, including the simple removal: ~~~ # gluster volume delete HadoopVol Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y volume delete: HadoopVol: failed: Staging failed on dhcp-37-119.lab.eng.brq.redhat.com. Error: Volume HadoopVol does not exist ~~~ note: in this case, command 'gluster volume list' on dhcp-37-119.lab.eng.brq.redhat.com doesn't show the volume, even though that the create operation succeeded. Expected results ================ Volume is created without any problems. Additional info =============== i) When I rerun the same operations by hand, it works: ~~~ root@mb1_master:/usr/share/rhs-hadoop-install# time gluster volume create HadoopVol replica 2 dhcp-37-123.lab.eng.brq.redhat.com:/mnt/brick1/HadoopVol dhcp-37-113.lab.eng.brq.redhat.com:/mnt/brick1/HadoopVol dhcp-37-118.lab.eng.brq.redhat.com:/mnt/brick1/HadoopVol dhcp-37-119.lab.eng.brq.redhat.com:/mnt/brick1/ HadoopVol volume create: HadoopVol: success: please start the volume to access data real 0m5.743s user 0m0.115s sys 0m0.024s root@mb1_master:/usr/share/rhs-hadoop-install# time gluster --mode=script volume start HadoopVol volume start: HadoopVol: success real 0m4.310s user 0m0.113s sys 0m0.022s ~~~ note that the total time to do this was about 10s ... ii) When I add additional sleep operation after the peer probe operation and before the volume creation, it fixes the problem on my virtual machines: ~~~ root@mb1_master:/usr/share/rhs-hadoop-install# diff -u install.sh.pristine install.sh --- install.sh.pristine 2014-03-03 18:38:01.174825913 +0100 +++ install.sh 2014-03-03 18:39:19.674823213 +0100 @@ -867,6 +867,7 @@ create_trusted_pool verify_pool_created # create vol + sleep 5 out="$(ssh -oStrictHostKeyChecking=no root@$firstNode " gluster volume create $VOLNAME replica $REPLICA_CNT $bricks 2>&1")" err=$? ~~~
Maybe BZ 984881 is related to this one.
Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/ If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.