This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours

Bug 829170

Summary: gluster cli returning "operation failed" output for command "gluster v heal vol info"
Product: [Community] GlusterFS Reporter: Shwetha Panduranga <shwetha.h.panduranga>
Component: glusterdAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.3-betaCC: bturner, gluster-bugs, laurent.chouinard, mailbox, ndevos, Rob.Hendelman, rwheeler
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0.53rhs Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 852294 (view as bug list) Environment:
Last Closed: 2014-07-11 11:10:23 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 852294, 858475    
Attachments:
Description Flags
glusterd logs
none
statedump of glustershd process
none
statedump of glustershd process none

Description Shwetha Panduranga 2012-06-06 02:57:26 EDT
Created attachment 589746 [details]
glusterd logs

Description of problem:
-----------------------
gluster cli reported "operation failed" for the command "gluster volume heal vol  info" during the self-heal of 50K files. 

create_files.sh:-(create 1M - 10M 50K files) (270GB disk space is required)
-----------------
#!/bin/bash

path=$1
count_value=1
for i in {1..50000};do
	if [ $count_value -gt 10 ]; then
		count_value=1
	fi
	echo -e "Creating File : file.$i\n"
	dd if=/dev/urandom of=$path/file.$i bs=1M count=$count_value
	echo -e "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
	let "count_value = $count_value + 1"
done


Version-Release number of selected component (if applicable):
--------------------------------------------------------------
3.3.0qa45

Steps to Reproduce:
---------------------
1.Create a distribute-replicate volume(3x3)
2.Bring down 2 bricks from each replicate sub-volume. 
3.Create a fuse mount
4.On the mount, execute the script "create_files.sh"
5.Once the script is completely executed, bring back the bricks.(gluster v start <vol_name> force"
6.On storage node execute :
-------------------------
a. gluster v heal vol full
Heal operation on volume vol has been successful

Actual Result:-
------------------
[06/06/12 - 02:05:48 root@AFR-Server1 ~]# gluster v heal vol  info
[06/06/12 - 02:09:31 root@AFR-Server1 ~]# echo $?
130
[06/06/12 - 02:09:34 root@AFR-Server1 ~]# gluster v heal vol  info
operation failed
[06/06/12 - 02:11:17 root@AFR-Server1 ~]# gluster v heal vol  info healed
operation failed
[06/06/12 - 02:11:28 root@AFR-Server1 ~]# gluster v heal vol  info heal-failed
operation failed

[06/06/12 - 02:11:45 root@AFR-Server1 ~]# gluster v heal vol  info heal-failed
operation failed
  
Additional info:
---------------

[06/06/12 - 02:18:58 root@AFR-Server1 ~]# gluster v info
 
Volume Name: vol
Type: Distributed-Replicate
Volume ID: b2f7f458-598e-456f-af7c-aa5af0036393
Status: Started
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: 10.16.159.184:/export_b1/dir1
Brick2: 10.16.159.188:/export_b1/dir1
Brick3: 10.16.159.196:/export_b1/dir1
Brick4: 10.16.159.184:/export_c1/dir1
Brick5: 10.16.159.188:/export_c1/dir1
Brick6: 10.16.159.196:/export_c1/dir1
Brick7: 10.16.159.184:/export_d1/dir1
Brick8: 10.16.159.188:/export_d1/dir1
Brick9: 10.16.159.196:/export_d1/dir1
Comment 1 Shwetha Panduranga 2012-06-12 08:04:20 EDT
[06/12/12 - 07:55:15 root@AFR-Server1 ~]# gluster v heal vol info
[06/12/12 - 07:57:20 root@AFR-Server1 ~]# echo $?
110
[06/12/12 - 07:57:26 root@AFR-Server1 ~]# gluster v heal vol info
operation failed
[06/12/12 - 07:57:29 root@AFR-Server1 ~]# gluster v heal vol info healed
operation failed
[06/12/12 - 07:57:31 root@AFR-Server1 ~]# ps -ef | grep gluster
root     14466     1  0 02:29 ?        00:00:01 glusterd
root     14519     1 41 02:29 ?        02:14:36 /usr/local/sbin/glusterfsd -s localhost --volfile-id vol.10.16.159.184.export_b1-dir1 -p /var/lib/glusterd/vols/vol/run/10.16.159.184-export_b1-dir1.pid -S /tmp/74414b0b8aa5135f694fce1fc36d154b.socket --brick-name /export_b1/dir1 -l /usr/local/var/log/glusterfs/bricks/export_b1-dir1.log --xlator-option *-posix.glusterd-uuid=1f9181c2-44c4-475f-9554-5446f94a64b1 --brick-port 24009 --xlator-option vol-server.listen-port=24009
root     15649     1  0 07:55 ?        00:00:00 /usr/local/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /usr/local/var/log/glusterfs/nfs.log -S /tmp/93490882b2b0ff4340fcd061ab2749fe.socket
root     15655     1  7 07:55 ?        00:00:10 /usr/local/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /usr/local/var/log/glusterfs/glustershd.log -S /tmp/2aa8553740d2e5cbc70758ce1064a8c0.socket --xlator-option *replicate*.node-uuid=1f9181c2-44c4-475f-9554-5446f94a64b1
root     15700 11867  0 07:57 pts/0    00:00:00 grep gluster
[06/12/12 - 07:57:35 root@AFR-Server1 ~]# 
[06/12/12 - 07:57:48 root@AFR-Server1 ~]# gluster v heal vol info healed
operation failed
[06/12/12 - 07:57:52 root@AFR-Server1 ~]# gluster system:: fsm log
number of transitions: 50
Old State: [Unlock sent]
New State: [Default]
Event    : [GD_OP_EVENT_ALL_ACC]
timestamp: [2012-06-12 07:55:06]

Old State: [Default]
New State: [Lock sent]
Event    : [GD_OP_EVENT_START_LOCK]
timestamp: [2012-06-12 07:55:08]

Old State: [Lock sent]
New State: [Lock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Lock sent]
New State: [Lock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Lock sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_ALL_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Stage op sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Stage op sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Stage op sent]
New State: [Brick op sent]
Event    : [GD_OP_EVENT_STAGE_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Brick op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_ALL_ACK]
timestamp: [2012-06-12 07:55:08]

Old State: [Commit op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Commit op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Commit op sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_COMMIT_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Unlock sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Unlock sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Unlock sent]
New State: [Default]
Event    : [GD_OP_EVENT_ALL_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Default]
New State: [Lock sent]
Event    : [GD_OP_EVENT_START_LOCK]
timestamp: [2012-06-12 07:55:08]

Old State: [Lock sent]
New State: [Lock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Lock sent]
New State: [Lock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Lock sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_ALL_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Stage op sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Stage op sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Stage op sent]
New State: [Brick op sent]
Event    : [GD_OP_EVENT_STAGE_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Brick op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_ALL_ACK]
timestamp: [2012-06-12 07:55:08]

Old State: [Commit op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Commit op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Commit op sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_COMMIT_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Unlock sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Unlock sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Unlock sent]
New State: [Default]
Event    : [GD_OP_EVENT_ALL_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Default]
New State: [Lock sent]
Event    : [GD_OP_EVENT_START_LOCK]
timestamp: [2012-06-12 07:55:15]

Old State: [Lock sent]
New State: [Lock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Lock sent]
New State: [Lock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Lock sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_ALL_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Stage op sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Stage op sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Stage op sent]
New State: [Brick op sent]
Event    : [GD_OP_EVENT_STAGE_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Brick op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_ALL_ACK]
timestamp: [2012-06-12 07:55:15]

Old State: [Commit op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Commit op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Commit op sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_COMMIT_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Unlock sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Unlock sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Unlock sent]
New State: [Default]
Event    : [GD_OP_EVENT_ALL_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Default]
New State: [Lock sent]
Event    : [GD_OP_EVENT_START_LOCK]
timestamp: [2012-06-12 07:55:20]

Old State: [Lock sent]
New State: [Lock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:20]

Old State: [Lock sent]
New State: [Lock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:20]

Old State: [Lock sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_ALL_ACC]
timestamp: [2012-06-12 07:55:20]

Old State: [Stage op sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:20]

Old State: [Stage op sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:20]

Old State: [Stage op sent]
New State: [Brick op sent]
Event    : [GD_OP_EVENT_STAGE_ACC]
timestamp: [2012-06-12 07:55:20]
Comment 2 Shwetha Panduranga 2012-06-13 02:55:29 EDT
Created attachment 591366 [details]
statedump of glustershd process
Comment 3 Shwetha Panduranga 2012-06-13 02:58:12 EDT
Created attachment 591369 [details]
statedump of glustershd process
Comment 4 Pranith Kumar K 2012-08-29 02:47:45 EDT
*** Bug 852191 has been marked as a duplicate of this bug. ***
Comment 5 Rob.Hendelman 2012-08-29 08:28:11 EDT
My bug got marked as a duplicate.  My use case has 3-5 Million files.  Is this being actively worked on?  Any idea if a fix will be in prior to 3.4 (or during 3.4) ?  I'm comfortable building from git if needed.

Thanks,

Rob
Comment 6 Rob.Hendelman 2012-09-11 10:04:26 EDT
I've built with latest git commit a032de191ec32be363d1feedfbd839f6dbde2579 and can still reproduce this.  Let me know how I can help.

Thanks,

Rob
Comment 7 Rob.Hendelman 2013-01-28 11:23:21 EST
Logs from 3.3.1 (from single server that is healing, quorum 2 (of 3)) when healing stops:

[2013-01-28 09:31:52.652900] I [client-handshake.c:453:client_set_lk_version_cbk] 0-newdata-client-0: Server lk version = 1
[2013-01-28 09:31:52.653354] I [client-handshake.c:453:client_set_lk_version_cbk] 0-newdata-client-1: Server lk version = 1
pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2013-01-28 09:40:15
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.1
/lib/x86_64-linux-gnu/libc.so.6(+0x364a0)[0x7f76254024a0]
/lib/x86_64-linux-gnu/libc.so.6(+0x129024)[0x7f76254f5024]
/lib/x86_64-linux-gnu/libc.so.6(xdr_string+0x116)[0x7f76254f4886]
/usr/lib/libgfxdr.so.0(xdr_gd1_mgmt_brick_op_req+0x26)[0x7f76259b38e6]
/usr/lib/libgfxdr.so.0(xdr_to_generic+0x44)[0x7f76259ae494]
/usr/sbin/glusterfs(glusterfs_handle_translator_op+0xb0)[0x7f762627b0c0]
/usr/lib/libglusterfs.so.0(synctask_wrap+0x12)[0x7f7625e1cd92]
/lib/x86_64-linux-gnu/libc.so.6(+0x48650)[0x7f7625414650]
Comment 8 Ben Turner 2013-03-05 11:14:32 EST
I am seeing volume heal commands fail occasionally on my 6 node 3x2 setups.  Here is an example from heal info split-brian:

# gluster volume heal healtest info split-brain
operation failed
# echo $?
255

This happened at just about 20:32:07.  In the etc-glusterfs-glusterd.vol.log I see:

[2013-02-27 20:32:08.081243] E [glusterd-utils.c:278:glusterd_lock] 0-glusterd: Unable to get lock for uuid: c3538fdf-c16e-4eaf-9650-6cf0caf7478b, lock held by: 647b8ff3-8110-4eab-956b-df78235fb192
[2013-02-27 20:32:08.081254] E [glusterd-handler.c:470:glusterd_op_txn_begin] 0-management: Unable to acquire local lock, ret: -1

I'll attach the "gluster system:: fsm log" output next time I run into this.
Comment 9 Pranith Kumar K 2014-01-03 09:01:51 EST
gluster volume heal <volname> info is re-written to handle large amount of files which need self-heal. Moving it to ON_QA