1072327 – gluster 'volume create' fails or creates inconsistent volume

Bug 1072327 - gluster 'volume create' fails or creates inconsistent volume

Summary: gluster 'volume create' fails or creates inconsistent volume

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Bug Updates Notification Mailing List
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:	glusterd
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-03-04 11:52 UTC by Martin Bukatovic
Modified:	2015-12-03 17:11 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-12-03 17:11:31 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	984881	0	medium	CLOSED	Gluster volume create fails	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1061737	1	None	None	None	2021-01-20 06:05:38 UTC

Internal Links: 984881 1061737

Description Martin Bukatovic 2014-03-04 11:52:17 UTC

Description of problem
======================

Gluster volume creation fails when the process is done by
rhs-hadoop-install script and the cluster is big or slow.

The most serious problem is that in some cases, the volume is created in
inconsistend state.

Version-Release number of selected component (if applicable)
============================================================

using iso RHSS-2.1.bd-20140219.n.0 [1]

glusterfs-fuse-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-geo-replication-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-server-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-libs-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-3.4.0.44rhs-1.el6rhs.x86_64

rhs-hadoop-install-0_72-2.el6rhs.noarch

[1] http://download.eng.bos.redhat.com/nightly/RHSS-2.1.bd-20140219.n.0/

How reproducible
================

Sometimes, depends on the cluster.

On the other hand I'm able to reproduce it every time on my cluster of virtual
machines (ping me to request access).

Steps to Reproduce
==================

1.  install RHSS iso on cluster of machines,
    allocate device on each machine (for XFS volume, which will hold bricks)
2.  install rhs-hadoop-install script for gluster volume setup (included in the
    RHSS-2.1.bd iso)
2.1 cd /usr/share/rhs-hadoop-install
2.2 create hosts file with names of machines of the cluster on every line
    (see /usr/share/rhs-hadoop-install/README.txt for details)
2.3 run the installer:
    ./install.sh --debug -y /dev/vdb1
    where /dev/vdb1 is a device where XFS volume will be created (for brick),
    this device is expected to exists on every node of the cluster

Actual results
==============

Two different failures can happen:

i) 'volume create' operation fails, and so the volume is not created

~~~
vol create: volume create: HadoopVol: failed: Staging failed on dhcp-37-118.lab.eng.brq.redhat.com. Error: Host dhcp-37-119.lab.eng.brq.redhat.com not connected
Staging failed on dhcp-37-113.lab.eng.brq.redhat.com. Error: Host dhcp-37-119.lab.eng.brq.redhat.com not connected
...verify vol create wait: 2 seconds
...verify vol create wait: 4 seconds
...verify vol create wait: 6 seconds
...verify vol create wait: 8 seconds
...verify vol create wait: 10 seconds
...verify vol create wait: 12 seconds
...verify vol create wait: 14 seconds
...verify vol create wait: 16 seconds
...verify vol create wait: 18 seconds
...verify vol create wait: 20 seconds
...verify vol create wait: 22 seconds
...verify vol create wait: 24 seconds
...verify vol create wait: 26 seconds
...verify vol create wait: 28 seconds
...verify vol create wait: 30 seconds
...verify vol create wait: 32 seconds
...verify vol create wait: 34 seconds
...verify vol create wait: 36 seconds
...verify vol create wait: 38 seconds
...verify vol create wait: 40 seconds
   ERROR: Volume "HadoopVol" creation failed with error 1
          Bricks=" dhcp-37-123.lab.eng.brq.redhat.com:/mnt/brick1/HadoopVol dhcp-37-113.lab.eng.brq.redhat.com:/mnt/brick1/HadoopVol dhcp-37-118.lab.eng.brq.redhat.com:/mnt/brick1/HadoopVol dhcp-37-119.lab.eng.brq.redhat.com:/mnt/brick1/HadoopVol"
~~~

note: 'gluster peer staus' shows all nodes as peers in the cluster and connected

ii) 'volume create' operation seem to succeed and volume is created, but one
node which is supposed to hold brick for the volume doesn't know about this
volume, so any operation on the volume fails:

~~~
vol create: volume create: HadoopVol: success: please start the volume to access data
   Volume "HadoopVol" created...
vol start: volume start: HadoopVol: failed: Staging failed on dhcp-37-119.lab.eng.brq.redhat.com. Error: Volume HadoopVol does not exist
   Volume "HadoopVol" started...
-- dhcp-37-123.lab.eng.brq.redhat.com -- mount volume
ERROR: dhcp-37-123.lab.eng.brq.redhat.com: mount /mnt/glusterfs: Mount failed. Please check the log file for more details.
~~~

This seems like a serious problem to me, because in this case, the new volume
ends up in inconsistend state and I was unable to do anything about it,
including the simple removal:

~~~
# gluster volume delete HadoopVol
Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y
volume delete: HadoopVol: failed: Staging failed on dhcp-37-119.lab.eng.brq.redhat.com. Error: Volume HadoopVol does not exist
~~~

note: in this case, command 'gluster volume list' on
dhcp-37-119.lab.eng.brq.redhat.com  doesn't show the volume, even though that
the create operation succeeded.

Expected results
================

Volume is created without any problems.

Additional info
===============

i) When I rerun the same operations by hand, it works:

~~~
root@mb1_master:/usr/share/rhs-hadoop-install# time gluster volume create HadoopVol replica 2 dhcp-37-123.lab.eng.brq.redhat.com:/mnt/brick1/HadoopVol dhcp-37-113.lab.eng.brq.redhat.com:/mnt/brick1/HadoopVol dhcp-37-118.lab.eng.brq.redhat.com:/mnt/brick1/HadoopVol dhcp-37-119.lab.eng.brq.redhat.com:/mnt/brick1/      HadoopVol
volume create: HadoopVol: success: please start the volume to access data

real    0m5.743s
user    0m0.115s
sys     0m0.024s
root@mb1_master:/usr/share/rhs-hadoop-install# time gluster --mode=script volume start HadoopVol
volume start: HadoopVol: success

real    0m4.310s
user    0m0.113s
sys     0m0.022s
~~~

note that the total time to do this was about 10s ...

ii) When I add additional sleep operation after the peer probe operation and
before the volume creation, it fixes the problem on my virtual machines:

~~~
root@mb1_master:/usr/share/rhs-hadoop-install# diff -u install.sh.pristine install.sh
--- install.sh.pristine 2014-03-03 18:38:01.174825913 +0100
+++ install.sh  2014-03-03 18:39:19.674823213 +0100
@@ -867,6 +867,7 @@
     create_trusted_pool
     verify_pool_created
     # create vol
+    sleep 5
     out="$(ssh -oStrictHostKeyChecking=no root@$firstNode "
        gluster volume create $VOLNAME replica $REPLICA_CNT $bricks 2>&1")"
     err=$?

~~~

Comment 2 Martin Bukatovic 2014-03-04 16:15:32 UTC

Maybe BZ 984881 is related to this one.

Comment 3 Vivek Agarwal 2015-12-03 17:11:31 UTC

Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/

If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.

Note You need to log in before you can comment on or make changes to this bug.