Description of problem: I am running CentOS 6.6 and I have glusterfs-epel.repo. Usually I receive automated updates. To date I had no issues with glusterfs packages updates Today I received an automatic update from 3.5.2 to 3.6.1 Since the update I glusterfs stopped working and I am no longer able to mount my volume. I tried to diagnose the problem but cannot find any clue so I am now restoring my volume as simple xfs filesystems no longer shared and I uninstall glusterfs. I am afraid if this is the impact after an update I can no longer rely on this filesystem that served me well for the past 3 months. Version-Release number of selected component (if applicable): 3.6.1 received via glusterfs-epel.repo How reproducible: update from 3.5.2 and see what happens to the volume Steps to Reproduce: 1. Start with CentOS 6.6 and glusterfs-epel.repo 3.5.2 create a brick and a gv 2. Allow update to 3.6.1 3. Volume mount will hang for several minutes or will fail to mount gluster volume status shows volume as not online. Logs fail to give a clue to a mortal non developer of what is wrong. Actual results: mount volume hangs or fails Expected results: glusterd starts volume is active and mountable Additional info:
I have done more tests as follows on two physical CentOS 6.6 x86_64 servers 1) I erased all gluster packages and removed manually the directory that were left over: /var/lib/glusterd and /var/log/glusterfs 2) I yum re-installed glusterfs packages 3.6.1 from glusterfs-epel.repo glusterfs-fuse glusterfs-libs glusterfs-server glusterfs-geo-replication glusterfs-api glusterfs glusterfs-cli 3) I re-created the brick filesystems using mkfs.xfs -f -i size=512 [mydev] on two nodes that here I will refer to as node1 and node2 4) I activated glusterd service using "chkconfig glusterd on" and I started glusterd using "service glusterd start" 5) I created a volume following the exact instructions at http://www.gluster.org/community/documentation/index.php/QuickStart 6) I started and mounted the volume. So far so good 7) I set nfs.disable on for the volume since I have several nfs exported filesystems and they would conflict 8) I included the volume mount in /etc/fstab as glusterfs 9) I filled the volume with some data 10) I shutdown node2 11) I shutdown node1 and then rebooted it (node1). node2 still shutdown 12) at boot "glusterd start" hung for several minutes thus hanging the whole server boot process 13) once the server finally resumed booting, the gluster volume failed to mount 14) I tried to restart glusterd manually, still with the same results 15) I started node2, same situation on node2 ------------------------ At this point I removed all gluster 3.6.1 packages and repeated the above installing 3.5.2 packages instead. With 3.5.2 I found no problem whatsoever, glusterd does not hang, filesystem gets mounted, no matter if either one node is started at a time or both at the same time. If I decide to leave node2 shutdown, when I boot it even after days it re-synchronizes and everything works fine. I would appreciate if this issue could be fixed for release 3.6.1 to support the use cases perfectly working with releases up to 3.5.2. I recommend as well to implement full regression testing before a new release. THANK YOU.
Marking the version as 3.6.0 as 3.6.1 tag is not available.
Mauro, Thanks for reporting the bug. It looks like gluster volume is going offline after upgrade. Atm, we are not sure what caused it. It would be helpful if you can provide us the gluster logs (/var/log/glusterfs/) from your setup. Meanwhile we are trying to reproduce this issue.
Also , which protocol you are using to mount the volume on the clienti.e. fuse, nfs or smb?
Thank you for looking after this issue. I no longer have the logs because in the haste of restoring service I did not make a backup of the /var gluster directories and I recovered the data by accessing directly the brick/volume directory on the most up to date node. However I was able to reproduce the issue after I totally deleted glusterfs packages and /var directories and I re-installed the packages as explained above. To mount the filesystem I put an entry in /etc/fstab as specified by the on-line manual: node1:/gv_home /home glusterfs defaults,_netdev 0 0 I am not sure which protocol this would use. Is it fuse? After re-installing 3.5.2 and recreating the replicated gv everything is working fine again. I also modified my glusterfs-epel.repo path to pick only 3.5 updates and thus ignore 3.6. My current configuration (use case) is a 2 nodes replica/mirror with node2 normally turned off. When I turn on node2 the volume is synchronized automatically and I can then use node2 as a replacement of node1 that now can be turned off. This works very well up to release 3.5.2. I believe this can be easily reproduced as explained in my second message.
I think that when I had 3.6.1 and after reboot the volume was no longer mountable I noticed that it was in non started status, so I tried to start it, but did not make any difference as it failed to start.
We tried to reproduce this issue as mentioned below but could not reproduce it. However the recommended step for upgrade should be as documented at http://www.gluster.org/community/documentation/index.php/Upgrade_to_3.6 Server 1 : 3.5.2 rhsauto046.lab.eng.blr.redhat.com Server 2 : 3.5.2 rhsauto057.lab.eng.blr.redhat.com -- Server 1 -- [root@rhsauto046 yum.repos.d]# ps aux | grep gluster root 5879 0.0 0.5 420208 20580 ? Ssl 15:51 0:00 /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid root 5944 0.0 0.5 649980 21612 ? Ssl 16:31 0:00 /usr/sbin/glusterfsd -s rhsauto046.lab.eng.blr.redhat.com --volfile-id gv0.rhsauto046.lab.eng.blr.redhat.com.bricks-gv0 -p /var/lib/glusterd/vols/gv0/run/rhsauto046.lab.eng.blr.redhat.com-bricks-gv0.pid -S /var/run/ddf23bad40708c9856f322f3de0004ae.socket --brick-name /bricks/gv0 -l /var/log/glusterfs/bricks/bricks-gv0.log --xlator-option *-posix.glusterd-uuid=780164d4-15d1-4422-a4ff-9dc7483bbd27 --brick-port 49152 --xlator-option gv0-server.listen-port=49152 root 5958 0.0 1.2 317932 49668 ? Ssl 16:31 0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/5df21af9684cc4c5cdc8d281c4c0dcde.socket root 5962 0.0 0.6 335400 27448 ? Ssl 16:31 0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/8255b1d2da6f936c5950eb2747fe20e7.socket --xlator-option *replicate*.node-uuid=780164d4-15d1-4422-a4ff-9dc7483bbd27 root 6046 0.0 0.0 103252 804 pts/1 S+ 17:05 0:00 grep gluster [root@rhsauto046 yum.repos.d]# gluster v i Volume Name: gv0 Type: Replicate Volume ID: 48567fcf-7b41-4906-bd92-a3c52bb2a135 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: rhsauto057.lab.eng.blr.redhat.com:/bricks/gv0 Brick2: rhsauto046.lab.eng.blr.redhat.com:/bricks/gv0 --/snip-- Did a online upgrade (#yum update glusterfs) to 3.6.1 rpms. After upgrade- --/snip-- root@rhsauto046 yum.repos.d]# rpm -qa | grep gluster glusterfs-api-3.6.1-1.el6.x86_64 glusterfs-libs-3.6.1-1.el6.x86_64 glusterfs-cli-3.6.1-1.el6.x86_64 glusterfs-3.6.1-1.el6.x86_64 glusterfs-server-3.6.1-1.el6.x86_64 glusterfs-fuse-3.6.1-1.el6.x86_64 [root@rhsauto046 yum.repos.d]# ps aux | grep gluster root 6145 4.1 0.4 440076 16628 ? Ssl 17:06 0:00 /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid root 6158 0.2 0.5 609700 21400 ? Ssl 17:06 0:00 /usr/sbin/glusterfsd -s rhsauto046.lab.eng.blr.redhat.com --volfile-id gv0.rhsauto046.lab.eng.blr.redhat.com.bricks-gv0 -p /var/lib/glusterd/vols/gv0/run/rhsauto046.lab.eng.blr.redhat.com-bricks-gv0.pid -S /var/run/ddf23bad40708c9856f322f3de0004ae.socket --brick-name /bricks/gv0 -l /var/log/glusterfs/bricks/bricks-gv0.log --xlator-option *-posix.glusterd-uuid=780164d4-15d1-4422-a4ff-9dc7483bbd27 --brick-port 49152 --xlator-option gv0-server.listen-port=49152 root 6169 1.0 1.3 408532 53588 ? Ssl 17:06 0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/5df21af9684cc4c5cdc8d281c4c0dcde.socket root 6176 1.0 0.4 442376 19304 ? Ssl 17:06 0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/8255b1d2da6f936c5950eb2747fe20e7.socket --xlator-option *replicate*.node-uuid=780164d4-15d1-4422-a4ff-9dc7483bbd27 root 6193 0.0 0.0 103252 808 pts/1 S+ 17:06 0:00 grep gluster [root@rhsauto046 yum.repos.d]# gluster v info Volume Name: gv0 Type: Replicate Volume ID: 48567fcf-7b41-4906-bd92-a3c52bb2a135 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: rhsauto057.lab.eng.blr.redhat.com:/bricks/gv0 Brick2: rhsauto046.lab.eng.blr.redhat.com:/bricks/gv0 --/sinp-- The volume was mountable and accessible from client and gluster volume info showed "Started" even after online upgrade. We did this for both servers. Please note that, the proper upgrade should be as mentioned earlier in the doc.
Did you disable nfs? Did you shutdown both nodes and started only one after the upgrade?
As in my second message I reproduced the issue even after re-installing 3.6.1. from scratch, without involving an upgrade. I noticed that you use localhost for mounting the volume. In my case the nodes are multihomed with multiple network cards and I use the FQDN associated with the Network Interface Card of the local network sharing the gv. I will try again 3.6.1 this time I will post the logs.
I have now re-installed 3.6.1 for the purpose of providing more information. Here are the steps followed: 1)removed all 3.5.2 packages, remove /var/lib/glusterd /var/log/glusterfs to prepare for a fresh install 2) installed 3.6.2 packages: glusterfs-3.6.1-1.el6.x86_64 glusterfs-api-3.6.1-1.el6.x86_64 glusterfs-cli-3.6.1-1.el6.x86_64 glusterfs-fuse-3.6.1-1.el6.x86_64 glusterfs-geo-replication-3.6.1-1.el6.x86_64 glusterfs-libs-3.6.1-1.el6.x86_64 glusterfs-rdma-3.6.1-1.el6.x86_64 glusterfs-server-3.6.1-1.el6.x86_64 and started daemon using: service glusterd start 3) created new directory /brick1/gv1 on two physical nodes (node1 and node2) 4) gluster volume create gv1 replica 2 node1:/brick1/gv1 node2:/brick1/gv1 5) gluster volume start gv1 6) gluster volume set gv1 nfs.disable on 7) mkdir /mnt/gv1 8) mount -t glusterfs node1:/gv1 /mnt/gv1 (on both nodes) 9) copied some data in /mnt/gv1, all works replication works on both nodes 10) now shutdown glusterfs services on both nodes (this is quicker than a full reboot as in message 2 and has the same effect): umount /mnt/gv1 service glusterfsd stop [I do not know how did this start I didn't] service glusterd stop killed last remaining glusterfs process that did not want to die 11) on node1 only: # service glusterd start [OK] # gluster volume info Volume Name: gv1 Type: Replicate Volume ID: 5de2ebc7-b4d6-44c7-8137-211caa286e87 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: node1:/brick1/gv1 Brick2: node2:/brick1/gv1 Options Reconfigured: nfs.disable: on # gluster volume status Status of volume: gv1 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick sirius:/brick1/gv1 N/A N N/A Self-heal Daemon on localhost N/A N N/A Task Status of Volume gv1 ------------------------------------------------------------------------------ There are no active volume tasks # gluster volume start gv1 volume start: gv1: failed: Volume gv1 already started At this point the volume is not mountable. The same happens when I reboot only it takes several minutes for glusterd to start. I will now revert back to 3.5.2 as I have only these servers. I will keep the logs, please let me know which one you want and how I get them to you. Thank you.
Are there any updates on this blocking issue for v3.6.1? Did you want the logs?
Still running glusterfs 3.5.3. Has version 3.6 been fixed yet?
Hello I've got exactly the same problem on Redhat 7 with Gluster 3.6.3. With a replicate volume between two nodes, it's fine if both are up. If one is down, after rebooting the second, the volume is not mounted : the volume is started but offline. # gluster v i Volume Name: data-sync Type: Replicate Volume ID: 32813e8c-5c58-4b48-b872-ac792cdc4505 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: host1:/gluster Brick2: host2:/gluster # gluster v status Status of volume: data-sync Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick host2:/gluster N/A N N/A NFS Server on localhost N/A N N/A Self-heal Daemon on localhost N/A N N/A Task Status of Volume data-sync ------------------------------------------------------------------------------ There are no active volume tasks The only solution is not restarting service, but stop and start the volume. Any idea to fix it ?
Hello Nicholas, the only solution I found to date was to downgrade to 3.5. It would be good to have an update from the developers.
I have now upgraded to release 3.7.4 and I found the issue is resolved in this version.
Closing based on https://bugzilla.redhat.com/show_bug.cgi?id=1161893#c15