Bug 829170 - gluster cli returning "operation failed" output for command "gluster v heal vol info"
Summary: gluster cli returning "operation failed" output for command "gluster v heal ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 3.3-beta
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
: 852191 (view as bug list)
Depends On:
Blocks: 852294 858475
TreeView+ depends on / blocked
 
Reported: 2012-06-06 06:57 UTC by Shwetha Panduranga
Modified: 2014-07-11 15:10 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.4.0.53rhs
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 852294 (view as bug list)
Environment:
Last Closed: 2014-07-11 15:10:23 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
glusterd logs (57.78 KB, text/x-log)
2012-06-06 06:57 UTC, Shwetha Panduranga
no flags Details
statedump of glustershd process (33.07 KB, application/octet-stream)
2012-06-13 06:55 UTC, Shwetha Panduranga
no flags Details
statedump of glustershd process (80.00 KB, application/x-tar)
2012-06-13 06:58 UTC, Shwetha Panduranga
no flags Details

Description Shwetha Panduranga 2012-06-06 06:57:26 UTC
Created attachment 589746 [details]
glusterd logs

Description of problem:
-----------------------
gluster cli reported "operation failed" for the command "gluster volume heal vol  info" during the self-heal of 50K files. 

create_files.sh:-(create 1M - 10M 50K files) (270GB disk space is required)
-----------------
#!/bin/bash

path=$1
count_value=1
for i in {1..50000};do
	if [ $count_value -gt 10 ]; then
		count_value=1
	fi
	echo -e "Creating File : file.$i\n"
	dd if=/dev/urandom of=$path/file.$i bs=1M count=$count_value
	echo -e "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
	let "count_value = $count_value + 1"
done


Version-Release number of selected component (if applicable):
--------------------------------------------------------------
3.3.0qa45

Steps to Reproduce:
---------------------
1.Create a distribute-replicate volume(3x3)
2.Bring down 2 bricks from each replicate sub-volume. 
3.Create a fuse mount
4.On the mount, execute the script "create_files.sh"
5.Once the script is completely executed, bring back the bricks.(gluster v start <vol_name> force"
6.On storage node execute :
-------------------------
a. gluster v heal vol full
Heal operation on volume vol has been successful

Actual Result:-
------------------
[06/06/12 - 02:05:48 root@AFR-Server1 ~]# gluster v heal vol  info
[06/06/12 - 02:09:31 root@AFR-Server1 ~]# echo $?
130
[06/06/12 - 02:09:34 root@AFR-Server1 ~]# gluster v heal vol  info
operation failed
[06/06/12 - 02:11:17 root@AFR-Server1 ~]# gluster v heal vol  info healed
operation failed
[06/06/12 - 02:11:28 root@AFR-Server1 ~]# gluster v heal vol  info heal-failed
operation failed

[06/06/12 - 02:11:45 root@AFR-Server1 ~]# gluster v heal vol  info heal-failed
operation failed
  
Additional info:
---------------

[06/06/12 - 02:18:58 root@AFR-Server1 ~]# gluster v info
 
Volume Name: vol
Type: Distributed-Replicate
Volume ID: b2f7f458-598e-456f-af7c-aa5af0036393
Status: Started
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: 10.16.159.184:/export_b1/dir1
Brick2: 10.16.159.188:/export_b1/dir1
Brick3: 10.16.159.196:/export_b1/dir1
Brick4: 10.16.159.184:/export_c1/dir1
Brick5: 10.16.159.188:/export_c1/dir1
Brick6: 10.16.159.196:/export_c1/dir1
Brick7: 10.16.159.184:/export_d1/dir1
Brick8: 10.16.159.188:/export_d1/dir1
Brick9: 10.16.159.196:/export_d1/dir1

Comment 1 Shwetha Panduranga 2012-06-12 12:04:20 UTC
[06/12/12 - 07:55:15 root@AFR-Server1 ~]# gluster v heal vol info
[06/12/12 - 07:57:20 root@AFR-Server1 ~]# echo $?
110
[06/12/12 - 07:57:26 root@AFR-Server1 ~]# gluster v heal vol info
operation failed
[06/12/12 - 07:57:29 root@AFR-Server1 ~]# gluster v heal vol info healed
operation failed
[06/12/12 - 07:57:31 root@AFR-Server1 ~]# ps -ef | grep gluster
root     14466     1  0 02:29 ?        00:00:01 glusterd
root     14519     1 41 02:29 ?        02:14:36 /usr/local/sbin/glusterfsd -s localhost --volfile-id vol.10.16.159.184.export_b1-dir1 -p /var/lib/glusterd/vols/vol/run/10.16.159.184-export_b1-dir1.pid -S /tmp/74414b0b8aa5135f694fce1fc36d154b.socket --brick-name /export_b1/dir1 -l /usr/local/var/log/glusterfs/bricks/export_b1-dir1.log --xlator-option *-posix.glusterd-uuid=1f9181c2-44c4-475f-9554-5446f94a64b1 --brick-port 24009 --xlator-option vol-server.listen-port=24009
root     15649     1  0 07:55 ?        00:00:00 /usr/local/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /usr/local/var/log/glusterfs/nfs.log -S /tmp/93490882b2b0ff4340fcd061ab2749fe.socket
root     15655     1  7 07:55 ?        00:00:10 /usr/local/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /usr/local/var/log/glusterfs/glustershd.log -S /tmp/2aa8553740d2e5cbc70758ce1064a8c0.socket --xlator-option *replicate*.node-uuid=1f9181c2-44c4-475f-9554-5446f94a64b1
root     15700 11867  0 07:57 pts/0    00:00:00 grep gluster
[06/12/12 - 07:57:35 root@AFR-Server1 ~]# 
[06/12/12 - 07:57:48 root@AFR-Server1 ~]# gluster v heal vol info healed
operation failed
[06/12/12 - 07:57:52 root@AFR-Server1 ~]# gluster system:: fsm log
number of transitions: 50
Old State: [Unlock sent]
New State: [Default]
Event    : [GD_OP_EVENT_ALL_ACC]
timestamp: [2012-06-12 07:55:06]

Old State: [Default]
New State: [Lock sent]
Event    : [GD_OP_EVENT_START_LOCK]
timestamp: [2012-06-12 07:55:08]

Old State: [Lock sent]
New State: [Lock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Lock sent]
New State: [Lock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Lock sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_ALL_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Stage op sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Stage op sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Stage op sent]
New State: [Brick op sent]
Event    : [GD_OP_EVENT_STAGE_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Brick op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_ALL_ACK]
timestamp: [2012-06-12 07:55:08]

Old State: [Commit op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Commit op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Commit op sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_COMMIT_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Unlock sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Unlock sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Unlock sent]
New State: [Default]
Event    : [GD_OP_EVENT_ALL_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Default]
New State: [Lock sent]
Event    : [GD_OP_EVENT_START_LOCK]
timestamp: [2012-06-12 07:55:08]

Old State: [Lock sent]
New State: [Lock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Lock sent]
New State: [Lock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Lock sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_ALL_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Stage op sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Stage op sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Stage op sent]
New State: [Brick op sent]
Event    : [GD_OP_EVENT_STAGE_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Brick op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_ALL_ACK]
timestamp: [2012-06-12 07:55:08]

Old State: [Commit op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Commit op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Commit op sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_COMMIT_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Unlock sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Unlock sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Unlock sent]
New State: [Default]
Event    : [GD_OP_EVENT_ALL_ACC]
timestamp: [2012-06-12 07:55:08]

Old State: [Default]
New State: [Lock sent]
Event    : [GD_OP_EVENT_START_LOCK]
timestamp: [2012-06-12 07:55:15]

Old State: [Lock sent]
New State: [Lock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Lock sent]
New State: [Lock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Lock sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_ALL_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Stage op sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Stage op sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Stage op sent]
New State: [Brick op sent]
Event    : [GD_OP_EVENT_STAGE_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Brick op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_ALL_ACK]
timestamp: [2012-06-12 07:55:15]

Old State: [Commit op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Commit op sent]
New State: [Commit op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Commit op sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_COMMIT_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Unlock sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Unlock sent]
New State: [Unlock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Unlock sent]
New State: [Default]
Event    : [GD_OP_EVENT_ALL_ACC]
timestamp: [2012-06-12 07:55:15]

Old State: [Default]
New State: [Lock sent]
Event    : [GD_OP_EVENT_START_LOCK]
timestamp: [2012-06-12 07:55:20]

Old State: [Lock sent]
New State: [Lock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:20]

Old State: [Lock sent]
New State: [Lock sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:20]

Old State: [Lock sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_ALL_ACC]
timestamp: [2012-06-12 07:55:20]

Old State: [Stage op sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:20]

Old State: [Stage op sent]
New State: [Stage op sent]
Event    : [GD_OP_EVENT_RCVD_ACC]
timestamp: [2012-06-12 07:55:20]

Old State: [Stage op sent]
New State: [Brick op sent]
Event    : [GD_OP_EVENT_STAGE_ACC]
timestamp: [2012-06-12 07:55:20]

Comment 2 Shwetha Panduranga 2012-06-13 06:55:29 UTC
Created attachment 591366 [details]
statedump of glustershd process

Comment 3 Shwetha Panduranga 2012-06-13 06:58:12 UTC
Created attachment 591369 [details]
statedump of glustershd process

Comment 4 Pranith Kumar K 2012-08-29 06:47:45 UTC
*** Bug 852191 has been marked as a duplicate of this bug. ***

Comment 5 Rob.Hendelman 2012-08-29 12:28:11 UTC
My bug got marked as a duplicate.  My use case has 3-5 Million files.  Is this being actively worked on?  Any idea if a fix will be in prior to 3.4 (or during 3.4) ?  I'm comfortable building from git if needed.

Thanks,

Rob

Comment 6 Rob.Hendelman 2012-09-11 14:04:26 UTC
I've built with latest git commit a032de191ec32be363d1feedfbd839f6dbde2579 and can still reproduce this.  Let me know how I can help.

Thanks,

Rob

Comment 7 Rob.Hendelman 2013-01-28 16:23:21 UTC
Logs from 3.3.1 (from single server that is healing, quorum 2 (of 3)) when healing stops:

[2013-01-28 09:31:52.652900] I [client-handshake.c:453:client_set_lk_version_cbk] 0-newdata-client-0: Server lk version = 1
[2013-01-28 09:31:52.653354] I [client-handshake.c:453:client_set_lk_version_cbk] 0-newdata-client-1: Server lk version = 1
pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2013-01-28 09:40:15
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.1
/lib/x86_64-linux-gnu/libc.so.6(+0x364a0)[0x7f76254024a0]
/lib/x86_64-linux-gnu/libc.so.6(+0x129024)[0x7f76254f5024]
/lib/x86_64-linux-gnu/libc.so.6(xdr_string+0x116)[0x7f76254f4886]
/usr/lib/libgfxdr.so.0(xdr_gd1_mgmt_brick_op_req+0x26)[0x7f76259b38e6]
/usr/lib/libgfxdr.so.0(xdr_to_generic+0x44)[0x7f76259ae494]
/usr/sbin/glusterfs(glusterfs_handle_translator_op+0xb0)[0x7f762627b0c0]
/usr/lib/libglusterfs.so.0(synctask_wrap+0x12)[0x7f7625e1cd92]
/lib/x86_64-linux-gnu/libc.so.6(+0x48650)[0x7f7625414650]

Comment 8 Ben Turner 2013-03-05 16:14:32 UTC
I am seeing volume heal commands fail occasionally on my 6 node 3x2 setups.  Here is an example from heal info split-brian:

# gluster volume heal healtest info split-brain
operation failed
# echo $?
255

This happened at just about 20:32:07.  In the etc-glusterfs-glusterd.vol.log I see:

[2013-02-27 20:32:08.081243] E [glusterd-utils.c:278:glusterd_lock] 0-glusterd: Unable to get lock for uuid: c3538fdf-c16e-4eaf-9650-6cf0caf7478b, lock held by: 647b8ff3-8110-4eab-956b-df78235fb192
[2013-02-27 20:32:08.081254] E [glusterd-handler.c:470:glusterd_op_txn_begin] 0-management: Unable to acquire local lock, ret: -1

I'll attach the "gluster system:: fsm log" output next time I run into this.

Comment 9 Pranith Kumar K 2014-01-03 14:01:51 UTC
gluster volume heal <volname> info is re-written to handle large amount of files which need self-heal. Moving it to ON_QA


Note You need to log in before you can comment on or make changes to this bug.