821056 – file does not remain in sync after self-heal

Bug 821056 - file does not remain in sync after self-heal

Summary: file does not remain in sync after self-heal

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	3.1.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Pranith Kumar K
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	853684 858501 895528
TreeView+	depends on / blocked

Reported:	2012-05-11 17:28 UTC by Joe Julian
Modified:	2015-12-01 16:45 UTC (History)
CC List:	4 users (show)
Fixed In Version:	glusterfs-3.4.0
Clone Of:
Clones:	853684 (view as bug list)
Environment:
Last Closed:	2013-07-24 17:35:53 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
client state dump (21.61 KB, application/octet-stream) 2012-05-11 22:10 UTC, Joe Julian	no flags	Details
ewcs2:/var/share/glusterd/d_vmimages state dump (112.62 KB, application/octet-stream) 2012-05-11 22:10 UTC, Joe Julian	no flags	Details
ewcs4:/var/share/glusterd/d_vmimages state dump (100.14 KB, application/octet-stream) 2012-05-11 22:11 UTC, Joe Julian	no flags	Details
ewcs7:/var/share/glusterd/d_vmimages state dump (93.32 KB, application/octet-stream) 2012-05-11 22:11 UTC, Joe Julian	no flags	Details
View All

Description Joe Julian 2012-05-11 17:28:06 UTC

I  host raw vm images on:
Volume Name: vmimages
Type: Distributed-Replicate
Status: Started
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: ewcs2:/var/spool/glusterfs/a_vmimages
Brick2: ewcs4:/var/spool/glusterfs/a_vmimages
Brick3: ewcs7:/var/spool/glusterfs/a_vmimages
Brick4: ewcs2:/var/spool/glusterfs/b_vmimages
Brick5: ewcs4:/var/spool/glusterfs/b_vmimages
Brick6: ewcs7:/var/spool/glusterfs/b_vmimages
Brick7: ewcs2:/var/spool/glusterfs/c_vmimages
Brick8: ewcs4:/var/spool/glusterfs/c_vmimages
Brick9: ewcs7:/var/spool/glusterfs/c_vmimages
Brick10: ewcs2:/var/spool/glusterfs/d_vmimages
Brick11: ewcs4:/var/spool/glusterfs/d_vmimages
Brick12: ewcs7:/var/spool/glusterfs/d_vmimages

After replacing ewcs7:/var/spool/glusterfs/c_vmimages due to a disk failure, I ran a stat() against the files (vm images) in that volume. These VM's were live at the time. The background self-heal was scheduled and completed. The extended attributes, however, showed pending.

before:
/var/spool/glusterfs/d_vmimages/intranet.img
trusted.afr.vmimages-client-11=0x0000accd0000000000000000

log:
[2012-05-11 08:24:28.756410] I [afr-common.c:790:afr_lookup_done] 1-vmimages-replicate-3: background  data self-heal triggered. path: /intranet.img
[2012-05-11 08:30:10.230581] I [afr-self-heal-algorithm.c:532:sh_diff_loop_driver_done] 1-vmimages-replicate-3: diff self-heal on /intranet.img: 56 blocks of 49152 were different (0.11%)
[2012-05-11 08:30:10.459175] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 1-vmimages-replicate-3: background  data self-heal completed on /intranet.img

after:
/var/spool/glusterfs/d_vmimages/intranet.img
trusted.afr.vmimages-client-11=0x000000f80000000000000000

Trying again with client-log level debug led me to this anomaly:
[2012-05-11 09:35:04.900303] D [afr-self-heal-data.c:573:afr_sh_data_sync_prepare] 1-vmimages-replicate-3: self-healing file /intranet.img from subvolume vmimages-client-9 to 1 other

Why is it self-healing from vmimages-client-9 when it's vmimages-client-11 that's dirty?

Comment 1 Joe Julian 2012-05-11 22:10:00 UTC

Created attachment 583940 [details]
client state dump

Comment 2 Joe Julian 2012-05-11 22:10:54 UTC

Created attachment 583941 [details]
ewcs2:/var/share/glusterd/d_vmimages state dump

Comment 3 Joe Julian 2012-05-11 22:11:31 UTC

Created attachment 583942 [details]
ewcs4:/var/share/glusterd/d_vmimages state dump

Comment 4 Joe Julian 2012-05-11 22:11:57 UTC

Created attachment 583943 [details]
ewcs7:/var/share/glusterd/d_vmimages state dump

Comment 5 Joe Julian 2012-05-11 22:20:06 UTC

We think that it's possible that the changed drive may be a red herring. I didn't start looking for dirty attributes until that happened, but I did down ewcs7 for an upgrade on the 3rd, so it's possible that's when it started.

intranet.img isn't the only file that's showing this behavior. Here are the non-zero xattrs remaining:

ewcs{2,4}:
# file: var/spool/glusterfs/a_vmimages/blog.img
trusted.afr.vmimages-client-2=0x000036910000000000000000

# file: var/spool/glusterfs/a_vmimages/mysql1.img
trusted.afr.vmimages-client-2=0x0000ab3b0000000000000000

# file: var/spool/glusterfs/c_vmimages/bazaar.img
trusted.afr.vmimages-client-8=0x00002d720000000000000000

# file: var/spool/glusterfs/d_vmimages/intranet.img
trusted.afr.vmimages-client-11=0x0000273a0000000000000000

# file: var/spool/glusterfs/d_vmimages/squid1.img
trusted.afr.vmimages-client-11=0x00009d510000000000000000

# file: var/spool/glusterfs/d_vmimages/web2.img
trusted.afr.vmimages-client-11=0x00009ff00000000000000000

ewcs7 is all 0

I also realized that these may be relevant:
[2012-05-11 09:32:55.844205] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-2: (144): failed to get fd ctx. EBADFD
[2012-05-11 09:32:55.844807] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-2: (156): failed to get fd ctx. EBADFD
[2012-05-11 09:32:55.844916] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-0: lk op is for a transaction
[2012-05-11 09:32:55.845259] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-0: lk op is for a transaction
[2012-05-11 09:32:55.882763] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-0: lk op is for a transaction
[2012-05-11 09:32:55.882928] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2012-05-11 09:32:55.882981] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2012-05-11 09:32:55.883365] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-2: (144): failed to get fd ctx. EBADFD
[2012-05-11 09:32:55.883386] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-0: we're done locking
[2012-05-11 09:32:55.883402] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-0: Blocking inodelks done. Proceeding to FOP
[2012-05-11 09:32:56.119427] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (135): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.119641] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-0: lk op is for a transaction
[2012-05-11 09:32:56.120151] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction
[2012-05-11 09:32:56.120334] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction
[2012-05-11 09:32:56.120450] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-2: (156): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.120487] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-0: we're done locking
[2012-05-11 09:32:56.120503] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-0: Blocking inodelks done. Proceeding to FOP
[2012-05-11 09:32:56.120721] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-3: Non blocking inodelks failed. Proceeding to blocking
[2012-05-11 09:32:56.121838] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (135): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.121894] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-3: we're done locking
[2012-05-11 09:32:56.121919] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-3: Blocking inodelks done. Proceeding to FOP
[2012-05-11 09:32:56.252114] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (135): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.252441] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction
[2012-05-11 09:32:56.252876] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-3: Non blocking inodelks failed. Proceeding to blocking
[2012-05-11 09:32:56.252958] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction
[2012-05-11 09:32:56.253320] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (135): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.253343] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-3: we're done locking
[2012-05-11 09:32:56.253367] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-3: Blocking inodelks done. Proceeding to FOP
[2012-05-11 09:32:56.314070] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (159): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.314354] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction
[2012-05-11 09:32:56.314827] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-3: Non blocking inodelks failed. Proceeding to blocking
[2012-05-11 09:32:56.315231] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (159): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.315253] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-3: we're done locking
[2012-05-11 09:32:56.315269] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-3: Blocking inodelks done. Proceeding to FOP

Comment 6 Pranith Kumar K 2012-06-13 08:56:36 UTC

Seems like re-open of the fd after you brought the brick back up did not work fine. It keeps giving EBADFD for the fd. That could be the reason for why the sync is never complete. Writes are not happening on the new brick. How did you perform the disk replace? Did you bring the brick down?

Comment 7 Pranith Kumar K 2012-06-13 09:29:17 UTC

I just found that there is a problem in re-open of a file in case of disconnect and re-connect, in client xlator. If at the time of re-open after the brick reconnects, if the file is not present on the brick (since the disk was replaced I am assuming there are no files present on the disk at the time of bringing the brick up) the re-open fails and further operations that happen on the file using that fd will now happen on only one brick (assuming 2 replica). So the file will never be in sync as long as the fd is open. With hosting VMs this a problem.

Comment 8 Joe Julian 2012-06-13 14:09:11 UTC

That sounds like that's it.  When the drive failed, I killed glusterfsd for that brick, replaced the drive (partitioned, formatted, etc.), then resterted the brick with "gluster volume start vmimages force".

Comment 9 Vijay Bellur 2013-02-03 19:34:14 UTC

CHANGE: http://review.gluster.org/4357 (protocol/client: Add fdctx back to saved-list after reopen) merged in master by Anand Avati (avati)

Comment 10 Vijay Bellur 2013-02-03 19:34:31 UTC

CHANGE: http://review.gluster.org/4358 (protocol/client: Periodically attempt reopens) merged in master by Anand Avati (avati)

Comment 11 Vijay Bellur 2013-02-03 19:35:09 UTC

CHANGE: http://review.gluster.org/4386 (tests: Added util functions) merged in master by Anand Avati (avati)

Comment 12 Vijay Bellur 2013-02-03 19:36:03 UTC

CHANGE: http://review.gluster.org/4387 (test: test re-opens of open-fds) merged in master by Anand Avati (avati)

Comment 13 Vijay Bellur 2013-02-05 06:12:05 UTC

CHANGE: http://review.gluster.org/4464 (protocol/client: Avoid double free of frame) merged in master by Anand Avati (avati)

Comment 14 Joe Julian 2013-02-18 17:30:46 UTC

Please backport to release-3.3

Comment 15 Vijay Bellur 2013-02-20 00:09:22 UTC

CHANGE: http://review.gluster.org/4540 (Tests: Disable open-behind) merged in master by Anand Avati (avati)

Note You need to log in before you can comment on or make changes to this bug.