Description of problem: volume information is out of sync in the cluster. Setup: Cluster formed from four nodes rhs-client6.lab.eng.blr.redhat.com rhs-client7.lab.eng.blr.redhat.com rhs-client8.lab.eng.blr.redhat.com rhs-client9.lab.eng.blr.redhat.com "gluster volume info <volume-name>" on rhs-client6,rhs-client8,and rhs-client9 shows information of <volume-name>. But rhs-client7 volume doesn't exist. Note: Volume was deleted from the cluster, but client-6,client-8 and client-9 are not updated of the deletion of volume. Version-Release number of selected component (if applicable): ============================================================= [10/12/12 - 12:39:51 root@rhs-client6 ~]# gluster --version glusterfs 3.3.0rhsvirt1 built on Oct 8 2012 15:23:00 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. [10/12/12 - 12:39:54 root@rhs-client6 ~]# [10/12/12 - 12:39:28 root@rhs-client6 ~]# rpm -qa | grep gluster glusterfs-geo-replication-3.3.0rhsvirt1-7.el6rhs.x86_64 vdsm-gluster-4.9.6-14.el6rhs.noarch gluster-swift-plugin-1.0-5.noarch gluster-swift-container-1.4.8-4.el6.noarch org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch glusterfs-3.3.0rhsvirt1-7.el6rhs.x86_64 glusterfs-server-3.3.0rhsvirt1-7.el6rhs.x86_64 glusterfs-rdma-3.3.0rhsvirt1-7.el6rhs.x86_64 gluster-swift-proxy-1.4.8-4.el6.noarch gluster-swift-account-1.4.8-4.el6.noarch gluster-swift-doc-1.4.8-4.el6.noarch glusterfs-fuse-3.3.0rhsvirt1-7.el6rhs.x86_64 glusterfs-debuginfo-3.3.0rhsvirt1-7.el6rhs.x86_64 gluster-swift-1.4.8-4.el6.noarch gluster-swift-object-1.4.8-4.el6.noarch Actual results: =============== Client-6: ========= [10/12/12 - 12:42:01 root@rhs-client6 ~]# gluster volume info replicate-rhevh Volume Name: replicate-rhevh Type: Replicate Volume ID: 89b7b672-63c3-41c9-a4db-6939e3f20f3c Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: rhs-client6.lab.eng.blr.redhat.com:/disk2 Brick2: rhs-client7.lab.eng.blr.redhat.com:/disk2 [10/12/12 - 12:42:14 root@rhs-client6 ~]# gluster volume info | grep replicate-rhevh Volume Name: replicate-rhevh Volume Name: replicate-rhevh2 [10/12/12 - 12:43:22 root@rhs-client6 ~]# gluster volume status replicate-rhevh Volume replicate-rhevh does not exist [10/12/12 - 12:50:52 root@rhs-client6 tmp]# ps -eaf | grep glusterfsd | grep replicate root 18307 1 0 12:05 ? 00:00:00 /usr/sbin/glusterfsd -s localhost --volfile-id replicate-rhevh.rhs-client6.lab.eng.blr.redhat.com.disk2 -p /var/lib/glusterd/vols/replicate-rhevh/run/rhs-client6.lab.eng.blr.redhat.com-disk2.pid -S /tmp/e8a7646941763241021808ad1c937947.socket --brick-name /disk2 -l /var/log/glusterfs/bricks/disk2.log --xlator-option *-posix.glusterd-uuid=9a3167f5-6050-4291-bdc5-96be6ee740c4 --brick-port 24014 --xlator-option replicate-rhevh-server.listen-port=24014 root 18313 1 0 12:05 ? 00:00:00 /usr/sbin/glusterfsd -s localhost --volfile-id replicate-rhevh2.rhs-client6.lab.eng.blr.redhat.com.replicate-disk -p /var/lib/glusterd/vols/replicate-rhevh2/run/rhs-client6.lab.eng.blr.redhat.com-replicate-disk.pid -S /tmp/37182cbe060e1cee60a593871c8ad75c.socket --brick-name /replicate-disk -l /var/log/glusterfs/bricks/replicate-disk.log --xlator-option *-posix.glusterd-uuid=9a3167f5-6050-4291-bdc5-96be6ee740c4 --brick-port 24011 --xlator-option replicate-rhevh2-server.listen-port=24011 [10/12/12 - 12:50:56 root@rhs-client6 tmp]# Client-7 ======== [10/12/12 - 12:41:09 root@rhs-client7 ~]# gluster volume info replicate-rhevh Volume replicate-rhevh does not exist [10/12/12 - 12:41:22 root@rhs-client7 ~]# gluster volume info | grep replicate-rhevh Volume Name: replicate-rhevh2 [10/12/12 - 12:42:34 root@rhs-client7 ~]# gluster volume status replicate-rhevh Volume replicate-rhevh does not exist [10/12/12 - 12:50:00 root@rhs-client7 tar]# ps -eaf | grep glusterfsd | grep replicate root 24995 1 0 12:05 ? 00:00:02 /usr/sbin/glusterfsd -s localhost --volfile-id replicate-rhevh2.rhs-client7.lab.eng.blr.redhat.com.replicate-disk -p /var/lib/glusterd/vols/replicate-rhevh2/run/rhs-client7.lab.eng.blr.redhat.com-replicate-disk.pid -S /tmp/34ce168cca1ffd0f64c69b974431b3a4.socket --brick-name /replicate-disk -l /var/log/glusterfs/bricks/replicate-disk.log --xlator-option *-posix.glusterd-uuid=b9d6cb21-051f-4791-9476-734856e77fbf --brick-port 24013 --xlator-option replicate-rhevh2-server.listen-port=24013 [10/12/12 - 12:50:03 root@rhs-client7 tar]# Note: Client-8 and client-9 have the same information as of client-6
Created attachment 625839 [details] sosreports and /var/lib/glusterd directory files
general behavior/bug in glusterd. Not very specific to 2.0+
*** Bug 865406 has been marked as a duplicate of this bug. ***
http://review.gluster.org/4188
marking ON_QA for rhs-2.1.0 flag. Let us know which update of RHS.2.0.z do we need this fix.
Updating summary since this is a general bug.
Verified the fix on the build: ============================== glusterfs 3.4.0.22rhs built on Aug 23 2013 01:58:42 "volume sync command scenarios" ======================================== a. gluster volume sync <hostname> b. gluster volume sync <hostname> all c. gluster volume sync <hostname> <volume_name> d. gluster volume sync <localhost> Following cases tests the above 4 scenarios. ============================================================================== Case1 : ( 1 x 2 replicate volumes, 2 Storage nodes ) ============================================================================== 1. start glusterd on both the storage nodes. 2. peer probe storage_node2 ( from storage_node1 ) 3. On storage_node1 execute: for i in `seq 1 5`; do gluster v create "vol_rep_$i" replica 2 storage_node1:/rhs/bricks/vol_rep_${i}_b0 storage_node2:/rhs/bricks/vol_rep_${i}_b1 --mode=script gluster v set "vol_rep_$i" self-heal-daemon off gluster v info "vol_rep_$i" gluster v start "vol_rep_$i" done 4. killall glusterfsd glusterd glusterfs on storage_node2. 5. On storage_node1 execute: rm -rf /rhs/bricks/* for i in `seq 1 5`; do gluster v stop vol_rep_${i} --mode=script gluster v delete vol_rep_${i} --mode=script gluster v create vol_rep_${i} replica 2 storage_node1:/rhs/bricks/vol_rep_${i}_b0 storage_node1:/rhs/bricks/vol_rep_${i}_b1 --mode=script gluster v set vol_rep_${i} self-heal-daemon on gluster v info vol_rep_${i} gluster v start vol_rep_${i} done 6. restart glusterd on storage_node2 (service glusterd start) 7. check the peer status from both the nodes. ( Both the nodes will be in "Peer Rejected" state for each other ). 8. From storage_node2 execute : +++++++++++++++++++++++++++++++++++ a). "gluster volume sync <storage_node1> vol_rep_1" Expected: "volume 'vol_rep_1' information should be synced from storage_node1 to storage_node2." Actual : as expected b. "gluster volume sync <storage_node1> all" Expected: "All volumes information should be synced from storage_node1 to storage_node2." Actual : as expected c. "gluster volume sync <storage_node2>" Expected: "volume sync: failed: sync from localhost not allowed " Actual : as expected ============================================================================== Case2 : ( 1 x 2 replicate volumes, 2 Storage nodes ) ============================================================================== 1. start glusterd on both the storage nodes. 2. peer probe storage_node2 ( from storage_node1 ) 3. On storage_node1 execute: for i in `seq 1 5`; do gluster v create "vol_rep_$i" replica 2 storage_node1:/rhs/bricks/vol_rep_${i}_b0 storage_node2:/rhs/bricks/vol_rep_${i}_b1 --mode=script gluster v set "vol_rep_$i" self-heal-daemon off gluster v info "vol_rep_$i" gluster v start "vol_rep_$i" done 4. killall glusterfsd glusterd glusterfs on storage_node2. 5. On storage_node1 execute: rm -rf /rhs/bricks/* for i in `seq 1 5`; do gluster v stop vol_rep_${i} --mode=script gluster v delete vol_rep_${i} --mode=script gluster v create vol_rep_${i} replica 2 storage_node1:/rhs/bricks/vol_rep_${i}_b0 storage_node1:/rhs/bricks/vol_rep_${i}_b1 --mode=script gluster v set vol_rep_${i} self-heal-daemon on gluster v info vol_rep_${i} gluster v start vol_rep_${i} done 6. restart glusterd on storage_node2 (service glusterd start) 7. check the peer status from both the nodes. ( Both the nodes will be in "Peer Rejected" state for each other ). 8. From storage_node1 execute : ++++++++++++++++++++++++++++++++++++ a. "gluster volume sync <storage_node2> vol_rep_1" Expected:"volume vol_rep_1 information should be synced from storage_node2 to storage_node1. Actual : as expected b. "gluster volume sync <storage_node2>" Expected: "All volumes information should be synced from storage_node2 to storage_node1." Actual : as expected ============================================================================== Note: ============================================================================== The above 2 cases verifies this bug. However the Peers remains in "Peer Rejected" state and the volumes are not restarted. This issue is tracked in the bug : https://bugzilla.redhat.com/show_bug.cgi?id=865700 Moving this bug from ON_QA to Verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html