Created attachment 589746 [details] glusterd logs Description of problem: ----------------------- gluster cli reported "operation failed" for the command "gluster volume heal vol info" during the self-heal of 50K files. create_files.sh:-(create 1M - 10M 50K files) (270GB disk space is required) ----------------- #!/bin/bash path=$1 count_value=1 for i in {1..50000};do if [ $count_value -gt 10 ]; then count_value=1 fi echo -e "Creating File : file.$i\n" dd if=/dev/urandom of=$path/file.$i bs=1M count=$count_value echo -e "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" let "count_value = $count_value + 1" done Version-Release number of selected component (if applicable): -------------------------------------------------------------- 3.3.0qa45 Steps to Reproduce: --------------------- 1.Create a distribute-replicate volume(3x3) 2.Bring down 2 bricks from each replicate sub-volume. 3.Create a fuse mount 4.On the mount, execute the script "create_files.sh" 5.Once the script is completely executed, bring back the bricks.(gluster v start <vol_name> force" 6.On storage node execute : ------------------------- a. gluster v heal vol full Heal operation on volume vol has been successful Actual Result:- ------------------ [06/06/12 - 02:05:48 root@AFR-Server1 ~]# gluster v heal vol info [06/06/12 - 02:09:31 root@AFR-Server1 ~]# echo $? 130 [06/06/12 - 02:09:34 root@AFR-Server1 ~]# gluster v heal vol info operation failed [06/06/12 - 02:11:17 root@AFR-Server1 ~]# gluster v heal vol info healed operation failed [06/06/12 - 02:11:28 root@AFR-Server1 ~]# gluster v heal vol info heal-failed operation failed [06/06/12 - 02:11:45 root@AFR-Server1 ~]# gluster v heal vol info heal-failed operation failed Additional info: --------------- [06/06/12 - 02:18:58 root@AFR-Server1 ~]# gluster v info Volume Name: vol Type: Distributed-Replicate Volume ID: b2f7f458-598e-456f-af7c-aa5af0036393 Status: Started Number of Bricks: 3 x 3 = 9 Transport-type: tcp Bricks: Brick1: 10.16.159.184:/export_b1/dir1 Brick2: 10.16.159.188:/export_b1/dir1 Brick3: 10.16.159.196:/export_b1/dir1 Brick4: 10.16.159.184:/export_c1/dir1 Brick5: 10.16.159.188:/export_c1/dir1 Brick6: 10.16.159.196:/export_c1/dir1 Brick7: 10.16.159.184:/export_d1/dir1 Brick8: 10.16.159.188:/export_d1/dir1 Brick9: 10.16.159.196:/export_d1/dir1
[06/12/12 - 07:55:15 root@AFR-Server1 ~]# gluster v heal vol info [06/12/12 - 07:57:20 root@AFR-Server1 ~]# echo $? 110 [06/12/12 - 07:57:26 root@AFR-Server1 ~]# gluster v heal vol info operation failed [06/12/12 - 07:57:29 root@AFR-Server1 ~]# gluster v heal vol info healed operation failed [06/12/12 - 07:57:31 root@AFR-Server1 ~]# ps -ef | grep gluster root 14466 1 0 02:29 ? 00:00:01 glusterd root 14519 1 41 02:29 ? 02:14:36 /usr/local/sbin/glusterfsd -s localhost --volfile-id vol.10.16.159.184.export_b1-dir1 -p /var/lib/glusterd/vols/vol/run/10.16.159.184-export_b1-dir1.pid -S /tmp/74414b0b8aa5135f694fce1fc36d154b.socket --brick-name /export_b1/dir1 -l /usr/local/var/log/glusterfs/bricks/export_b1-dir1.log --xlator-option *-posix.glusterd-uuid=1f9181c2-44c4-475f-9554-5446f94a64b1 --brick-port 24009 --xlator-option vol-server.listen-port=24009 root 15649 1 0 07:55 ? 00:00:00 /usr/local/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /usr/local/var/log/glusterfs/nfs.log -S /tmp/93490882b2b0ff4340fcd061ab2749fe.socket root 15655 1 7 07:55 ? 00:00:10 /usr/local/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /usr/local/var/log/glusterfs/glustershd.log -S /tmp/2aa8553740d2e5cbc70758ce1064a8c0.socket --xlator-option *replicate*.node-uuid=1f9181c2-44c4-475f-9554-5446f94a64b1 root 15700 11867 0 07:57 pts/0 00:00:00 grep gluster [06/12/12 - 07:57:35 root@AFR-Server1 ~]# [06/12/12 - 07:57:48 root@AFR-Server1 ~]# gluster v heal vol info healed operation failed [06/12/12 - 07:57:52 root@AFR-Server1 ~]# gluster system:: fsm log number of transitions: 50 Old State: [Unlock sent] New State: [Default] Event : [GD_OP_EVENT_ALL_ACC] timestamp: [2012-06-12 07:55:06] Old State: [Default] New State: [Lock sent] Event : [GD_OP_EVENT_START_LOCK] timestamp: [2012-06-12 07:55:08] Old State: [Lock sent] New State: [Lock sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Lock sent] New State: [Lock sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Lock sent] New State: [Stage op sent] Event : [GD_OP_EVENT_ALL_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Stage op sent] New State: [Stage op sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Stage op sent] New State: [Stage op sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Stage op sent] New State: [Brick op sent] Event : [GD_OP_EVENT_STAGE_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Brick op sent] New State: [Commit op sent] Event : [GD_OP_EVENT_ALL_ACK] timestamp: [2012-06-12 07:55:08] Old State: [Commit op sent] New State: [Commit op sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Commit op sent] New State: [Commit op sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Commit op sent] New State: [Unlock sent] Event : [GD_OP_EVENT_COMMIT_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Unlock sent] New State: [Unlock sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Unlock sent] New State: [Unlock sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Unlock sent] New State: [Default] Event : [GD_OP_EVENT_ALL_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Default] New State: [Lock sent] Event : [GD_OP_EVENT_START_LOCK] timestamp: [2012-06-12 07:55:08] Old State: [Lock sent] New State: [Lock sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Lock sent] New State: [Lock sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Lock sent] New State: [Stage op sent] Event : [GD_OP_EVENT_ALL_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Stage op sent] New State: [Stage op sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Stage op sent] New State: [Stage op sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Stage op sent] New State: [Brick op sent] Event : [GD_OP_EVENT_STAGE_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Brick op sent] New State: [Commit op sent] Event : [GD_OP_EVENT_ALL_ACK] timestamp: [2012-06-12 07:55:08] Old State: [Commit op sent] New State: [Commit op sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Commit op sent] New State: [Commit op sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Commit op sent] New State: [Unlock sent] Event : [GD_OP_EVENT_COMMIT_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Unlock sent] New State: [Unlock sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Unlock sent] New State: [Unlock sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Unlock sent] New State: [Default] Event : [GD_OP_EVENT_ALL_ACC] timestamp: [2012-06-12 07:55:08] Old State: [Default] New State: [Lock sent] Event : [GD_OP_EVENT_START_LOCK] timestamp: [2012-06-12 07:55:15] Old State: [Lock sent] New State: [Lock sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:15] Old State: [Lock sent] New State: [Lock sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:15] Old State: [Lock sent] New State: [Stage op sent] Event : [GD_OP_EVENT_ALL_ACC] timestamp: [2012-06-12 07:55:15] Old State: [Stage op sent] New State: [Stage op sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:15] Old State: [Stage op sent] New State: [Stage op sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:15] Old State: [Stage op sent] New State: [Brick op sent] Event : [GD_OP_EVENT_STAGE_ACC] timestamp: [2012-06-12 07:55:15] Old State: [Brick op sent] New State: [Commit op sent] Event : [GD_OP_EVENT_ALL_ACK] timestamp: [2012-06-12 07:55:15] Old State: [Commit op sent] New State: [Commit op sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:15] Old State: [Commit op sent] New State: [Commit op sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:15] Old State: [Commit op sent] New State: [Unlock sent] Event : [GD_OP_EVENT_COMMIT_ACC] timestamp: [2012-06-12 07:55:15] Old State: [Unlock sent] New State: [Unlock sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:15] Old State: [Unlock sent] New State: [Unlock sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:15] Old State: [Unlock sent] New State: [Default] Event : [GD_OP_EVENT_ALL_ACC] timestamp: [2012-06-12 07:55:15] Old State: [Default] New State: [Lock sent] Event : [GD_OP_EVENT_START_LOCK] timestamp: [2012-06-12 07:55:20] Old State: [Lock sent] New State: [Lock sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:20] Old State: [Lock sent] New State: [Lock sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:20] Old State: [Lock sent] New State: [Stage op sent] Event : [GD_OP_EVENT_ALL_ACC] timestamp: [2012-06-12 07:55:20] Old State: [Stage op sent] New State: [Stage op sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:20] Old State: [Stage op sent] New State: [Stage op sent] Event : [GD_OP_EVENT_RCVD_ACC] timestamp: [2012-06-12 07:55:20] Old State: [Stage op sent] New State: [Brick op sent] Event : [GD_OP_EVENT_STAGE_ACC] timestamp: [2012-06-12 07:55:20]
Created attachment 591366 [details] statedump of glustershd process
Created attachment 591369 [details] statedump of glustershd process
*** Bug 852191 has been marked as a duplicate of this bug. ***
My bug got marked as a duplicate. My use case has 3-5 Million files. Is this being actively worked on? Any idea if a fix will be in prior to 3.4 (or during 3.4) ? I'm comfortable building from git if needed. Thanks, Rob
I've built with latest git commit a032de191ec32be363d1feedfbd839f6dbde2579 and can still reproduce this. Let me know how I can help. Thanks, Rob
Logs from 3.3.1 (from single server that is healing, quorum 2 (of 3)) when healing stops: [2013-01-28 09:31:52.652900] I [client-handshake.c:453:client_set_lk_version_cbk] 0-newdata-client-0: Server lk version = 1 [2013-01-28 09:31:52.653354] I [client-handshake.c:453:client_set_lk_version_cbk] 0-newdata-client-1: Server lk version = 1 pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2013-01-28 09:40:15 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.3.1 /lib/x86_64-linux-gnu/libc.so.6(+0x364a0)[0x7f76254024a0] /lib/x86_64-linux-gnu/libc.so.6(+0x129024)[0x7f76254f5024] /lib/x86_64-linux-gnu/libc.so.6(xdr_string+0x116)[0x7f76254f4886] /usr/lib/libgfxdr.so.0(xdr_gd1_mgmt_brick_op_req+0x26)[0x7f76259b38e6] /usr/lib/libgfxdr.so.0(xdr_to_generic+0x44)[0x7f76259ae494] /usr/sbin/glusterfs(glusterfs_handle_translator_op+0xb0)[0x7f762627b0c0] /usr/lib/libglusterfs.so.0(synctask_wrap+0x12)[0x7f7625e1cd92] /lib/x86_64-linux-gnu/libc.so.6(+0x48650)[0x7f7625414650]
I am seeing volume heal commands fail occasionally on my 6 node 3x2 setups. Here is an example from heal info split-brian: # gluster volume heal healtest info split-brain operation failed # echo $? 255 This happened at just about 20:32:07. In the etc-glusterfs-glusterd.vol.log I see: [2013-02-27 20:32:08.081243] E [glusterd-utils.c:278:glusterd_lock] 0-glusterd: Unable to get lock for uuid: c3538fdf-c16e-4eaf-9650-6cf0caf7478b, lock held by: 647b8ff3-8110-4eab-956b-df78235fb192 [2013-02-27 20:32:08.081254] E [glusterd-handler.c:470:glusterd_op_txn_begin] 0-management: Unable to acquire local lock, ret: -1 I'll attach the "gluster system:: fsm log" output next time I run into this.
gluster volume heal <volname> info is re-written to handle large amount of files which need self-heal. Moving it to ON_QA