Bug 802030 - self-heal is not triggered by glustershd even when the all the bricks are online
self-heal is not triggered by glustershd even when the all the bricks are online
Status: CLOSED CANTFIX
Product: GlusterFS
Classification: Community
Component: replicate (Show other bugs)
mainline
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Pranith Kumar K
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-10 07:12 EST by Shwetha Panduranga
Modified: 2015-12-01 11:45 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-04-01 02:23:45 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
glustershd log (1.79 MB, text/x-log)
2012-03-10 07:12 EST, Shwetha Panduranga
no flags Details

  None (edit)
Description Shwetha Panduranga 2012-03-10 07:12:45 EST
Created attachment 569079 [details]
glustershd log

glustershd reports: "no active sinks for performing self-heal on file" even when all the bricks are online. Hence self-heal is not triggered.

Version-Release number of selected component (if applicable):
3.3.0qa26

How reproducible:
often

Steps to Reproduce:
1.create a distribute-replicate volume (2 x 3, 200GB total available space).
Start the volume
2.Create 2 gluster mounts and one nfs mount
3.Start dd on one of the gluster mount and nfs mount
4.dd if=/dev/zero of=gfsf1 bs=1M count=51200 on gluster mount
5.dd if=/dev/zero of=nfsf1 bs=1M count=51200 on nfs mount
6.enable quota on the volume
7.set the quota limit-usage to some value (The quota limit-usage was set to
150GB.)
8.Bring down 2 bricks (one from each replicate pair)
9.dd still in progress on both mounts. 
10.Bring back the bricks.

Actual results:
------------------
glustershd didn't trigger self-heal on files and after some amount of time(approximately 30 minutes)the xattrs of the file become all zeros .But the file size differs on the bricks.

[03/10/12 - 21:58:57 root@APP-SERVER1 ~]# gluster volume start dstore1 force
Starting volume dstore1 has been successful

[03/10/12 - 22:00:30 root@APP-SERVER1 ~]# gluster volume status dstore1 

Status of volume: dstore1
Brick							Port	Online	Pid
------------------------------------------------------------------------------
192.168.2.35:/export1					24009	Y	8806
------------------------------------------------------------------------------
192.168.2.36:/export1					24009	Y	23655
------------------------------------------------------------------------------
192.168.2.37:/export1					24009	Y	13195
------------------------------------------------------------------------------
192.168.2.35:/export2					24010	Y	8811
------------------------------------------------------------------------------
192.168.2.36:/export2					24010	Y	23660
------------------------------------------------------------------------------
192.168.2.37:/export2					24010	Y	13200

Brick1:-
-------
# file: export1/nfsf1
trusted.afr.dstore1-client-0=0x000000000000000000000000
trusted.afr.dstore1-client-1=0x000000000000000000000000
trusted.afr.dstore1-client-2=0x000000000000000000000000
trusted.gfid=0x0e53a4d9c45340f0b9bd085c70a6af41
trusted.glusterfs.quota.00000000-0000-0000-0000-000000000001.contri=0x000000019b411000

Brick2:-
------
# file: export1/nfsf1
trusted.afr.dstore1-client-0=0x000569e30000000000000000
trusted.afr.dstore1-client-1=0x000000000000000000000000
trusted.afr.dstore1-client-2=0x000000000000000000000000
trusted.gfid=0x0e53a4d9c45340f0b9bd085c70a6af41
trusted.glusterfs.quota.00000000-0000-0000-0000-000000000001.contri=0x0000000c16982000

Brick3:-
-----
# file: export1/nfsf1
trusted.afr.dstore1-client-0=0x000569e30000000000000000
trusted.afr.dstore1-client-1=0x000000000000000000000000
trusted.afr.dstore1-client-2=0x000000000000000000000000
trusted.gfid=0x0e53a4d9c45340f0b9bd085c70a6af41
trusted.glusterfs.quota.00000000-0000-0000-0000-000000000001.contri=0x0000000c16982000

[03/10/12 - 22:54:21 root@APP-SERVER3 ~]# du -h /export1/nfsf1
49G	/export1/nfsf1

[03/10/12 - 22:55:11 root@APP-SERVER3 ~]# getfattr -m . -d -e hex /export1/nfsf1
getfattr: Removing leading '/' from absolute path names
# file: export1/nfsf1
trusted.afr.dstore1-client-0=0x000000000000000000000000
trusted.afr.dstore1-client-1=0x000000000000000000000000
trusted.afr.dstore1-client-2=0x000000000000000000000000
trusted.gfid=0x0e53a4d9c45340f0b9bd085c70a6af41
trusted.glusterfs.quota.00000000-0000-0000-0000-000000000001.contri=0x0000000c16982000

[03/10/12 - 22:54:37 root@APP-SERVER2 ~]# du -h /export1/nfsf1
49G	/export1/nfsf1

[03/10/12 - 22:55:06 root@APP-SERVER2 ~]# 
[03/10/12 - 22:58:28 root@APP-SERVER2 ~]# getfattr -m . -d -e hex /export1/nfsf1
getfattr: Removing leading '/' from absolute path names
# file: export1/nfsf1
trusted.afr.dstore1-client-0=0x000000000000000000000000
trusted.afr.dstore1-client-1=0x000000000000000000000000
trusted.afr.dstore1-client-2=0x000000000000000000000000
trusted.gfid=0x0e53a4d9c45340f0b9bd085c70a6af41
trusted.glusterfs.quota.00000000-0000-0000-0000-000000000001.contri=0x0000000c16982000

[03/10/12 - 22:54:47 root@APP-SERVER1 ~]# du -h /export1/nfsf1
6.5G	/export1/nfsf1

[03/10/12 - 22:54:53 root@APP-SERVER1 ~]# getfattr -m . -d -e hex /export1/nfsf1
getfattr: Removing leading '/' from absolute path names
# file: export1/nfsf1
trusted.afr.dstore1-client-0=0x000000000000000000000000
trusted.afr.dstore1-client-1=0x000000000000000000000000
trusted.afr.dstore1-client-2=0x000000000000000000000000
trusted.gfid=0x0e53a4d9c45340f0b9bd085c70a6af41
trusted.glusterfs.quota.00000000-0000-0000-0000-000000000001.contri=0x000000019b411000

glustershd logs:-
------------------
[2012-03-10 22:00:33.613995] I [client-handshake.c:1308:client_setvolume_cbk] 0-dstore1-client-0: clnt-lk-version = 1, server-lk-version = 0
[2012-03-10 22:00:33.614040] I [client-handshake.c:1334:client_setvolume_cbk] 0-dstore1-client-0: Connected to 192.168.2.35:24009, attached to remote volume '/export1'
.
[2012-03-10 22:00:33.614141] I [afr-self-heald.c:1048:afr_start_crawl] 0-dstore1-replicate-0: starting crawl 1 for dstore1-client-0
[2012-03-10 22:00:33.623052] W [client.c:1990:client_rpc_notify] 0-dstore1-client-3: Cancelling the grace timer
[2012-03-10 22:00:33.623224] I [afr-self-heald.c:949:afr_dir_crawl] 0-dstore1-replicate-1: Crawl completed on dstore1-client-5
[2012-03-10 22:00:33.623621] I [client-handshake.c:1533:select_server_supported_programs] 0-dstore1-client-3: Using Program GlusterFS 3.3.0qa26, Num (1298437), Version
 (330)
[2012-03-10 22:00:33.627940] I [client-handshake.c:1308:client_setvolume_cbk] 0-dstore1-client-3: clnt-lk-version = 1, server-lk-version = 0
[2012-03-10 22:00:33.627971] I [client-handshake.c:1334:client_setvolume_cbk] 0-dstore1-client-3: Connected to 192.168.2.35:24010, attached to remote volume '/export2'
.
[2012-03-10 22:00:33.628201] I [afr-self-heald.c:1048:afr_start_crawl] 0-dstore1-replicate-1: starting crawl 1 for dstore1-client-3
[2012-03-10 22:00:33.637846] I [afr-self-heald.c:890:afr_find_child_position] 0-dstore1-replicate-0: child dstore1-client-0 is remote
[2012-03-10 22:00:33.638076] W [client.c:1990:client_rpc_notify] 0-dstore1-client-1: Cancelling the grace timer
[2012-03-10 22:00:33.639141] I [client-handshake.c:1533:select_server_supported_programs] 0-dstore1-client-1: Using Program GlusterFS 3.3.0qa26, Num (1298437), Version
 (330)
[2012-03-10 22:00:33.640539] I [afr-self-heald.c:890:afr_find_child_position] 0-dstore1-replicate-1: child dstore1-client-3 is remote
[2012-03-10 22:00:33.641984] I [client-handshake.c:1308:client_setvolume_cbk] 0-dstore1-client-1: clnt-lk-version = 1, server-lk-version = 0
[2012-03-10 22:00:33.642028] I [client-handshake.c:1334:client_setvolume_cbk] 0-dstore1-client-1: Connected to 192.168.2.36:24009, attached to remote volume '/export1'
.
[2012-03-10 22:00:33.642128] I [afr-self-heald.c:1048:afr_start_crawl] 0-dstore1-replicate-0: starting crawl 1 for dstore1-client-1
[2012-03-10 22:00:33.645176] W [client.c:1990:client_rpc_notify] 0-dstore1-client-4: Cancelling the grace timer
[2012-03-10 22:00:33.647412] I [client-handshake.c:1533:select_server_supported_programs] 0-dstore1-client-4: Using Program GlusterFS 3.3.0qa26, Num (1298437), Version (330)
[2012-03-10 22:00:33.648519] I [afr-self-heald.c:890:afr_find_child_position] 0-dstore1-replicate-0: child dstore1-client-1 is remote
[2012-03-10 22:00:33.648788] I [client-handshake.c:1308:client_setvolume_cbk] 0-dstore1-client-4: clnt-lk-version = 1, server-lk-version = 0
[2012-03-10 22:00:33.648823] I [client-handshake.c:1334:client_setvolume_cbk] 0-dstore1-client-4: Connected to 192.168.2.36:24010, attached to remote volume '/export2'.
[2012-03-10 22:00:33.648912] I [afr-self-heald.c:1048:afr_start_crawl] 0-dstore1-replicate-1: starting crawl 1 for dstore1-client-4
[2012-03-10 22:00:33.654426] I [afr-self-heald.c:890:afr_find_child_position] 0-dstore1-replicate-1: child dstore1-client-4 is remote
[2012-03-10 22:00:33.664607] I [afr-self-heald.c:949:afr_dir_crawl] 0-dstore1-replicate-0: Crawl completed on dstore1-client-2
[2012-03-10 22:10:33.744321] I [afr-self-heald.c:1048:afr_start_crawl] 0-dstore1-replicate-1: starting crawl 1 for dstore1-client-5
[2012-03-10 22:10:33.744787] I [afr-self-heald.c:1048:afr_start_crawl] 0-dstore1-replicate-0: starting crawl 1 for dstore1-client-2
[2012-03-10 22:10:33.744861] I [afr-self-heald.c:1048:afr_start_crawl] 0-dstore1-replicate-0: starting crawl 1 for dstore1-client-0
[2012-03-10 22:10:33.744939] I [afr-self-heald.c:1048:afr_start_crawl] 0-dstore1-replicate-1: starting crawl 1 for dstore1-client-3
[2012-03-10 22:10:33.745160] I [afr-self-heald.c:1048:afr_start_crawl] 0-dstore1-replicate-0: starting crawl 1 for dstore1-client-1
[2012-03-10 22:10:33.745265] I [afr-self-heald.c:1048:afr_start_crawl] 0-dstore1-replicate-1: starting crawl 1 for dstore1-client-4
[2012-03-10 22:10:33.745474] I [afr-self-heald.c:890:afr_find_child_position] 0-dstore1-replicate-0: child dstore1-client-2 is local
[2012-03-10 22:10:33.745678] I [afr-self-heald.c:890:afr_find_child_position] 0-dstore1-replicate-1: child dstore1-client-5 is local
[2012-03-10 22:10:33.748841] I [afr-common.c:1313:afr_launch_self_heal] 0-dstore1-replicate-1: background  data self-heal triggered. path: , reason: lookup detected pending operations
[2012-03-10 22:10:33.750670] I [afr-common.c:1313:afr_launch_self_heal] 0-dstore1-replicate-0: background  data self-heal triggered. path: , reason: lookup detected pending operations
[2012-03-10 22:10:33.754028] I [afr-self-heal-data.c:738:afr_sh_data_fix] 0-dstore1-replicate-1: no active sinks for performing self-heal on file 
[2012-03-10 22:10:33.755123] I [afr-self-heal-common.c:2037:afr_self_heal_completion_cbk] 0-dstore1-replicate-1: background  data self-heal completed on
Comment 1 Pranith Kumar K 2012-03-29 06:40:10 EDT
Shwetha reproduced the bug, the self-heal is completed successfully and the xattrs are removed. Even the md5sum matches on both the files but the source file has du 49G whereas the sink has du 6.5G. ls -l shows 49G on both source and sink. This issue happens because self-heal does an equivalent of 'truncate -s 49G' on the sink file and then does the diff self-heal. Since afr could not differentiate between the hole on sink and all '0' data on source, it does not do any writes to the sink file so at the end of the self-heal the sink file's du is still 6.5G no extra data is synced from source to sink.
Comment 2 Pranith Kumar K 2012-04-01 02:23:45 EDT
There is no way to find out if the file is a hole or 0 filled. We can't fix this issue. Since it does not affect the functionality it should be fine.

Note You need to log in before you can comment on or make changes to this bug.