1161893 – volume no longer available after update to 3.6.1

Bug 1161893 - volume no longer available after update to 3.6.1

Summary: volume no longer available after update to 3.6.1

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	3.6.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-11-08 21:52 UTC by Mauro Mozzarelli
Modified:	2016-08-23 12:50 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-08-23 12:50:15 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Mauro Mozzarelli 2014-11-08 21:52:07 UTC

Description of problem:
I am running CentOS 6.6 and I have glusterfs-epel.repo. Usually I receive automated updates. To date I had no issues with glusterfs packages updates
Today I received an automatic update from 3.5.2 to 3.6.1
Since the update I glusterfs stopped working and I am no longer able to mount my volume. I tried to diagnose the problem but cannot find any clue so I am now restoring my volume as simple xfs filesystems no longer shared and I uninstall glusterfs. I am afraid if this is the impact after an update I can no longer rely on this filesystem that served me well for the past 3 months.

Version-Release number of selected component (if applicable):
3.6.1 received via glusterfs-epel.repo

How reproducible:
update from 3.5.2 and see what happens to the volume

Steps to Reproduce:
1. Start with CentOS 6.6 and glusterfs-epel.repo 3.5.2 create a brick and a gv 
2. Allow update to 3.6.1
3. Volume mount will hang for several minutes or will fail to mount gluster volume status shows volume as not online. Logs fail to give a clue to a mortal non developer of what is wrong.

Actual results:
mount volume hangs or fails

Expected results:
glusterd starts volume is active and mountable

Additional info:

Comment 1 Mauro Mozzarelli 2014-11-09 00:53:51 UTC

I have done more tests as follows on two physical CentOS 6.6 x86_64 servers 
1) I erased all gluster packages and removed manually the directory that were left over: /var/lib/glusterd and /var/log/glusterfs
2) I yum re-installed glusterfs packages 3.6.1 from glusterfs-epel.repo
glusterfs-fuse
glusterfs-libs
glusterfs-server
glusterfs-geo-replication
glusterfs-api
glusterfs
glusterfs-cli

3) I re-created the brick filesystems using mkfs.xfs -f -i size=512 [mydev] on two nodes that here I will refer to as node1 and node2
4) I activated glusterd service using "chkconfig glusterd on" and I started glusterd using "service glusterd start"
5) I created a volume following the exact instructions at 
http://www.gluster.org/community/documentation/index.php/QuickStart


6) I started and mounted the volume. So far so good
7) I set nfs.disable on for the volume since I have several nfs exported filesystems and they would conflict
8) I included the volume mount in /etc/fstab as glusterfs
9) I filled the volume with some data
10) I shutdown node2
11) I shutdown node1 and then rebooted it (node1). node2 still shutdown
12) at boot "glusterd start" hung for several minutes thus hanging the whole server boot process
13) once the server finally resumed booting, the gluster volume failed to mount
14) I tried to restart glusterd manually, still with the same results
15) I started node2, same situation on node2

------------------------

At this point I removed all gluster 3.6.1 packages and repeated the above installing 3.5.2 packages instead.

With 3.5.2 I found no problem whatsoever, glusterd does not hang, filesystem gets mounted, no matter if either one node is started at a time or both at the same time. If I decide to leave node2 shutdown, when I boot it even after days it re-synchronizes and everything works fine.

I would appreciate if this issue could be fixed for release 3.6.1 to support the use cases perfectly working with releases up to 3.5.2.

I recommend as well to implement full regression testing before a new release.

THANK YOU.

Comment 2 Lalatendu Mohanty 2014-11-10 07:26:00 UTC

Marking the version as 3.6.0 as 3.6.1 tag is not available.

Comment 3 Lalatendu Mohanty 2014-11-10 10:17:18 UTC

Mauro, Thanks for reporting the bug. It looks like gluster volume is going offline after upgrade. Atm, we are not sure what caused it. It would be helpful if you can provide us the gluster logs (/var/log/glusterfs/) from your setup. Meanwhile we are trying to reproduce this issue.

Comment 4 Lalatendu Mohanty 2014-11-10 10:52:12 UTC

Also , which protocol you are using to mount the volume on the clienti.e. fuse, nfs or smb?

Comment 5 Mauro Mozzarelli 2014-11-10 11:54:01 UTC

Thank you for looking after this issue.
I no longer have the logs because in the haste of restoring service I did not make a backup of the /var gluster directories and I recovered the data by accessing directly the brick/volume directory on the most up to date node.
However I was able to reproduce the issue after I totally deleted glusterfs packages and /var directories and I re-installed the packages as explained above. 

To mount the filesystem I put an entry in /etc/fstab as specified by the on-line manual:

node1:/gv_home /home glusterfs defaults,_netdev 0 0

I am not sure which protocol this would use. Is it fuse?

After re-installing 3.5.2 and recreating the replicated gv everything is working fine again. I also modified my glusterfs-epel.repo path to pick only 3.5 updates and thus ignore 3.6.

My current configuration (use case) is a 2 nodes replica/mirror with node2 normally turned off. When I turn on node2 the volume is synchronized automatically and I can then use node2 as a replacement of node1 that now can be turned off. This works very well up to release 3.5.2.

I believe this can be easily reproduced as explained in my second message.

Comment 6 Mauro Mozzarelli 2014-11-10 11:57:53 UTC

I think that when I had 3.6.1 and after reboot the volume was no longer mountable I noticed that it was in non started status, so I tried to start it, but did not make any difference as it failed to start.

Comment 7 Lalatendu Mohanty 2014-11-10 12:06:01 UTC

We tried to reproduce this issue as mentioned below but could not reproduce it.

However the recommended step for upgrade should be as documented at http://www.gluster.org/community/documentation/index.php/Upgrade_to_3.6

Server 1 : 3.5.2 rhsauto046.lab.eng.blr.redhat.com

Server 2 : 3.5.2 rhsauto057.lab.eng.blr.redhat.com


-- Server 1 -- 

[root@rhsauto046 yum.repos.d]# ps aux | grep gluster
root      5879  0.0  0.5 420208 20580 ?        Ssl  15:51   0:00 /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid
root      5944  0.0  0.5 649980 21612 ?        Ssl  16:31   0:00 /usr/sbin/glusterfsd -s rhsauto046.lab.eng.blr.redhat.com --volfile-id gv0.rhsauto046.lab.eng.blr.redhat.com.bricks-gv0 -p /var/lib/glusterd/vols/gv0/run/rhsauto046.lab.eng.blr.redhat.com-bricks-gv0.pid -S /var/run/ddf23bad40708c9856f322f3de0004ae.socket --brick-name /bricks/gv0 -l /var/log/glusterfs/bricks/bricks-gv0.log --xlator-option *-posix.glusterd-uuid=780164d4-15d1-4422-a4ff-9dc7483bbd27 --brick-port 49152 --xlator-option gv0-server.listen-port=49152
root      5958  0.0  1.2 317932 49668 ?        Ssl  16:31   0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/5df21af9684cc4c5cdc8d281c4c0dcde.socket
root      5962  0.0  0.6 335400 27448 ?        Ssl  16:31   0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/8255b1d2da6f936c5950eb2747fe20e7.socket --xlator-option *replicate*.node-uuid=780164d4-15d1-4422-a4ff-9dc7483bbd27
root      6046  0.0  0.0 103252   804 pts/1    S+   17:05   0:00 grep gluster



[root@rhsauto046 yum.repos.d]# gluster v i
 
Volume Name: gv0
Type: Replicate
Volume ID: 48567fcf-7b41-4906-bd92-a3c52bb2a135
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: rhsauto057.lab.eng.blr.redhat.com:/bricks/gv0
Brick2: rhsauto046.lab.eng.blr.redhat.com:/bricks/gv0



--/snip--

Did a online upgrade (#yum update glusterfs)  to 3.6.1 rpms. After upgrade-

--/snip--

root@rhsauto046 yum.repos.d]# rpm -qa | grep gluster
glusterfs-api-3.6.1-1.el6.x86_64
glusterfs-libs-3.6.1-1.el6.x86_64
glusterfs-cli-3.6.1-1.el6.x86_64
glusterfs-3.6.1-1.el6.x86_64
glusterfs-server-3.6.1-1.el6.x86_64
glusterfs-fuse-3.6.1-1.el6.x86_64


[root@rhsauto046 yum.repos.d]# ps aux | grep gluster
root      6145  4.1  0.4 440076 16628 ?        Ssl  17:06   0:00 /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid
root      6158  0.2  0.5 609700 21400 ?        Ssl  17:06   0:00 /usr/sbin/glusterfsd -s rhsauto046.lab.eng.blr.redhat.com --volfile-id gv0.rhsauto046.lab.eng.blr.redhat.com.bricks-gv0 -p /var/lib/glusterd/vols/gv0/run/rhsauto046.lab.eng.blr.redhat.com-bricks-gv0.pid -S /var/run/ddf23bad40708c9856f322f3de0004ae.socket --brick-name /bricks/gv0 -l /var/log/glusterfs/bricks/bricks-gv0.log --xlator-option *-posix.glusterd-uuid=780164d4-15d1-4422-a4ff-9dc7483bbd27 --brick-port 49152 --xlator-option gv0-server.listen-port=49152
root      6169  1.0  1.3 408532 53588 ?        Ssl  17:06   0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/5df21af9684cc4c5cdc8d281c4c0dcde.socket
root      6176  1.0  0.4 442376 19304 ?        Ssl  17:06   0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/8255b1d2da6f936c5950eb2747fe20e7.socket --xlator-option *replicate*.node-uuid=780164d4-15d1-4422-a4ff-9dc7483bbd27
root      6193  0.0  0.0 103252   808 pts/1    S+   17:06   0:00 grep gluster



[root@rhsauto046 yum.repos.d]# gluster v info
 
Volume Name: gv0
Type: Replicate
Volume ID: 48567fcf-7b41-4906-bd92-a3c52bb2a135
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: rhsauto057.lab.eng.blr.redhat.com:/bricks/gv0
Brick2: rhsauto046.lab.eng.blr.redhat.com:/bricks/gv0

--/sinp--


The volume was mountable and accessible from client and gluster volume info showed "Started" even after online upgrade. We did this for both servers.

Please note that, the proper upgrade should be as mentioned earlier in the doc.

Comment 8 Mauro Mozzarelli 2014-11-10 12:44:41 UTC

Did you disable nfs?
Did you shutdown both nodes and started only one after the upgrade?

Comment 9 Mauro Mozzarelli 2014-11-10 12:54:27 UTC

As in my second message I reproduced the issue even after re-installing 3.6.1. from scratch, without involving an upgrade. I noticed that you use localhost for mounting the volume. In my case the nodes are multihomed with multiple network cards and I use the FQDN associated with the Network Interface Card of the local network sharing the gv. I will try again 3.6.1 this time I will post the logs.

Comment 10 Mauro Mozzarelli 2014-11-11 13:09:51 UTC

I have now re-installed 3.6.1 for the purpose of providing more information.
Here are the steps followed:

1)removed all 3.5.2 packages, remove /var/lib/glusterd /var/log/glusterfs to prepare for a fresh install
2) installed 3.6.2 packages:
glusterfs-3.6.1-1.el6.x86_64
glusterfs-api-3.6.1-1.el6.x86_64
glusterfs-cli-3.6.1-1.el6.x86_64
glusterfs-fuse-3.6.1-1.el6.x86_64
glusterfs-geo-replication-3.6.1-1.el6.x86_64
glusterfs-libs-3.6.1-1.el6.x86_64
glusterfs-rdma-3.6.1-1.el6.x86_64
glusterfs-server-3.6.1-1.el6.x86_64

and started daemon using:
service glusterd start

3) created new directory /brick1/gv1 on two physical nodes (node1 and node2)
4) gluster volume create gv1 replica 2 node1:/brick1/gv1 node2:/brick1/gv1
5) gluster volume start gv1
6) gluster volume set gv1 nfs.disable on
7) mkdir /mnt/gv1
8) mount -t glusterfs node1:/gv1 /mnt/gv1 (on both nodes)
9) copied some data in /mnt/gv1, all works replication works on both nodes
10) now shutdown glusterfs services on both nodes (this is quicker than a full reboot as in message 2 and has the same effect):
umount /mnt/gv1
service glusterfsd stop [I do not know how did this start I didn't]
service glusterd stop
killed last remaining glusterfs process that did not want to die

11) on node1 only:
# service glusterd start  [OK]
# gluster volume info
Volume Name: gv1
Type: Replicate
Volume ID: 5de2ebc7-b4d6-44c7-8137-211caa286e87
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: node1:/brick1/gv1
Brick2: node2:/brick1/gv1
Options Reconfigured:
nfs.disable: on
# gluster volume status
Status of volume: gv1
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick sirius:/brick1/gv1				N/A	N	N/A
Self-heal Daemon on localhost				N/A	N	N/A
 
Task Status of Volume gv1
------------------------------------------------------------------------------
There are no active volume tasks

# gluster volume start gv1 
volume start: gv1: failed: Volume gv1 already started


At this point the volume is not mountable. The same happens when I reboot only it takes several minutes for glusterd to start. I will now revert back to 3.5.2 as I have only these servers. I will keep the logs, please let me know which one you want and how I get them to you. Thank you.

Comment 11 Mauro Mozzarelli 2014-11-18 10:25:54 UTC

Are there any updates on this blocking issue for v3.6.1? Did you want the logs?

Comment 12 Mauro Mozzarelli 2014-12-12 17:41:52 UTC

Still running glusterfs 3.5.3. Has version 3.6 been fixed yet?

Comment 13 Nicolas R. 2015-07-08 06:57:46 UTC

Hello

I've got exactly the same problem on Redhat 7 with Gluster 3.6.3.

With a replicate volume between two nodes, it's fine if both are up.

If one is down, after rebooting the second, the volume is not mounted :
the volume is started but offline.

# gluster v i

Volume Name: data-sync
Type: Replicate
Volume ID: 32813e8c-5c58-4b48-b872-ac792cdc4505
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: host1:/gluster
Brick2: host2:/gluster

# gluster v status
Status of volume: data-sync
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick host2:/gluster                             N/A     N       N/A
NFS Server on localhost                                 N/A     N       N/A
Self-heal Daemon on localhost                           N/A     N       N/A

Task Status of Volume data-sync
------------------------------------------------------------------------------
There are no active volume tasks


The only solution is not restarting service, but stop and start the volume.

Any idea to fix it ?

Comment 14 Mauro Mozzarelli 2015-07-09 19:24:22 UTC

Hello Nicholas, the only solution I found to date was to downgrade to 3.5.
It would be good to have an update from the developers.

Comment 15 Mauro Mozzarelli 2015-09-28 21:14:40 UTC

I have now upgraded to release 3.7.4 and I found the issue is resolved in this version.

Comment 16 Pranith Kumar K 2016-08-23 12:50:15 UTC

Closing based on https://bugzilla.redhat.com/show_bug.cgi?id=1161893#c15

Note You need to log in before you can comment on or make changes to this bug.