While self-heal is in progress on a mount, the mount may crash if cluster.data-self-heal is changed from "off" to "on" using volume set operation.
Workaround: Ensure that no self-heals are required on the volume before changing the cluster.data-self-heal.
Description of problem:
When running the testcase "Test self-heal of 50k files (self-heal-daemon)" there was a crash when creating data. Here is what I saw in the shell:
32768 bytes (33 kB) copied, 0.00226773 s, 14.4 MB/s
1+0 records in
1+0 records out
32768 bytes (33 kB) copied, 0.00233886 s, 14.0 MB/s
dd: opening `/gluster-mount/small/37773.small': Software caused connection abort
dd: opening `/gluster-mount/small/37774.small': Transport endpoint is not connected
dd: opening `/gluster-mount/small/37775.small': Transport endpoint is not connected
And in the gluster mount logs:
client-0 to healtest-client-1, metadata - Pending matrix: [ [ 0 2 ] [ 0 0 ] ], on /small/37757.small
[2014-02-14 18:56:15.169667] I [afr-self-heal-common.c:2906:afr_log_self_heal_completion_status] 0-healtest-replicate-0: metadata self heal is successfully completed, metadata self heal from source healtest-client-0 to healtest-client-1, metadata - Pending matrix: [ [ 0 2 ] [ 0 0 ] ], on /small/37771.small
[2014-02-14 18:56:15.275690] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2014-02-14 18:56:15.276117] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2014-02-14 18:56:15.278740] I [dht-shared.c:311:dht_init_regex] 0-healtest-dht: using regex rsync-hash-regex = ^\.(.+)\.[^.]+$
[2014-02-14 18:56:15.278975] I [glusterfsd-mgmt.c:1379:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2014-02-14 18:56:15.279009] I [glusterfsd-mgmt.c:1379:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
pending frames:
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2014-02-14 18:56:15configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0.59rhs
/lib64/libc.so.6(+0x32920)[0x7fd0fb464920]
/usr/lib64/glusterfs/3.4.0.59rhs/xlator/cluster/replicate.so(afr_sh_data_lock_rec+0x77)[0x7fd0f53a9a27]
/usr/lib64/glusterfs/3.4.0.59rhs/xlator/cluster/replicate.so(afr_sh_data_open_cbk+0x178)[0x7fd0f53ab398]
/usr/lib64/glusterfs/3.4.0.59rhs/xlator/protocol/client.so(client3_3_open_cbk+0x18b)[0x7fd0f560e82b]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x7fd0fc1a7f45]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x147)[0x7fd0fc1a9507]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7fd0fc1a4d88]
/usr/lib64/glusterfs/3.4.0.59rhs/rpc-transport/socket.so(+0x8d86)[0x7fd0f7a44d86]
/usr/lib64/glusterfs/3.4.0.59rhs/rpc-transport/socket.so(+0xa69d)[0x7fd0f7a4669d]
/usr/lib64/libglusterfs.so.0(+0x61ad7)[0x7fd0fc413ad7]
/usr/sbin/glusterfs(main+0x5f8)[0x4068b8]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7fd0fb450cdd]
/usr/sbin/glusterfs[0x4045c9]
---------
Version-Release number of selected component (if applicable):
glusterfs 3.4.0.59rhs
How reproducible:
I have only seen this crash once out of 2-3 runs of this or very similar testcases.
Steps to Reproduce:
I hit this during a batch run of automated testcases, they were:
TCMS - 198855 223406 226909 226912 237832 238530 238539
The testcase that saw the crash was 238530,
1. Create a 1x2 volume across 2 nodes.
2. Set volume option 'self-heal-daemon' to value “off” using the command “gluster volume set <vol_name> self-heal-daemon off” from one of the storage node.
3. Bring down all bricks processes offline on a node.
4. Create 50k files with:
mkdir -p $MOUNT-POINT/small
for i in `seq 1 $3`; do
dd if=/dev/zero of=$MOUNT-POINT/small/$i.small bs=$4 count=1
done
Actual results:
Crash on the client during file creation.
Expected results:
No crash.
Additional info:
I was only able to get the core file and sosreport from the client before the hosts were reclaimed. I'll attempt to repro again for more data.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://rhn.redhat.com/errata/RHBA-2015-0038.html